← Back to all projects

BEAM ELIXIR ERLANG LEARNING PROJECTS

Where BEAM Languages (Elixir/Erlang) Truly Shine

Goal

After completing these projects, you will deeply understand why BEAM-based languages (Elixir, Erlang, Gleam) excel at building fault-tolerant, massively concurrent systems. You’ll internalize the Actor Model, supervision trees, and the “let it crash” philosophy—not as abstract concepts, but as practical tools you’ve used to build systems that self-heal. You’ll be able to architect systems that handle millions of concurrent connections, survive component failures gracefully, and achieve soft real-time guarantees—capabilities that would require exponentially more complexity in other languages. Most importantly, you’ll recognize the specific problem domains where BEAM is not just better, but fundamentally the right choice.


Why BEAM Exists: The Telephone Switch Story

In 1986, Ericsson needed to build the next generation of telephone switches. The AXE-10 switch handled millions of phone calls simultaneously, and a single bug couldn’t crash the entire system. They tried C++, but it couldn’t provide the fault tolerance they needed. Joe Armstrong and his team created Erlang and the BEAM VM specifically for this problem.

The constraints were brutal:

┌─────────────────────────────────────────────────────────────────────────┐
│                        TELEPHONE SWITCH REQUIREMENTS                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐   │
│  │  99.9999999%     │    │  Millions of     │    │  Zero Downtime   │   │
│  │  Uptime          │    │  Concurrent      │    │  Upgrades        │   │
│  │  (Nine Nines)    │    │  Calls           │    │                  │   │
│  │                  │    │                  │    │                  │   │
│  │  = 31.5ms        │    │  Each call is    │    │  Can't tell      │   │
│  │  downtime/year   │    │  independent     │    │  customers to    │   │
│  │                  │    │                  │    │  "please hold"   │   │
│  └──────────────────┘    └──────────────────┘    └──────────────────┘   │
│                                                                          │
│  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐   │
│  │  Soft Real-Time  │    │  Fault Isolation │    │  Self-Healing    │   │
│  │                  │    │                  │    │                  │   │
│  │  Voice can't     │    │  One bad call    │    │  System must     │   │
│  │  buffer - must   │    │  can't crash     │    │  recover without │   │
│  │  deliver in      │    │  the switch      │    │  human           │   │
│  │  ~150ms          │    │                  │    │  intervention    │   │
│  └──────────────────┘    └──────────────────┘    └──────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

These requirements led to a fundamentally different approach to computing:

Traditional Approach BEAM Approach
Prevent all errors with defensive coding Assume errors WILL happen, design for recovery
Heavy threads/processes (1MB+ each) Ultra-lightweight processes (2KB each)
Shared memory with locks No shared memory, message passing only
Stop-the-world garbage collection Per-process GC, never stops the world
Deploy = restart = downtime Hot code reload while running

The Actor Model: BEAM’s Foundation

Every BEAM process is an actor - an independent entity with its own memory that communicates only through messages. This is not a library or framework; it’s built into the VM itself.

┌─────────────────────────────────────────────────────────────────────────┐
│                           THE ACTOR MODEL                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Each Actor (Process) has:                                              │
│                                                                          │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                         PROCESS                                  │   │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │   │
│   │  │   MAILBOX   │  │   STATE     │  │      BEHAVIOR           │  │   │
│   │  │             │  │             │  │                         │  │   │
│   │  │  [msg1]     │  │  count: 5   │  │  handle_call(:inc) ->   │  │   │
│   │  │  [msg2]     │  │  name: "A"  │  │    state + 1            │  │   │
│   │  │  [msg3]     │  │  ...        │  │                         │  │   │
│   │  │     ↓       │  │             │  │  handle_cast(:reset) -> │  │   │
│   │  │  (FIFO)     │  │             │  │    0                    │  │   │
│   │  └─────────────┘  └─────────────┘  └─────────────────────────┘  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│   Communication (ONLY via messages):                                     │
│                                                                          │
│   ┌─────────┐         send(pid, msg)           ┌─────────┐              │
│   │ Process │  ──────────────────────────────► │ Process │              │
│   │    A    │                                  │    B    │              │
│   └─────────┘  ◄──────────────────────────────  └─────────┘              │
│                        send(a, reply)                                    │
│                                                                          │
│   Key Properties:                                                        │
│   • NO shared memory - processes are completely isolated                 │
│   • Messages are COPIED, not referenced (no data races possible)         │
│   • Mailbox is FIFO - messages processed in order received              │
│   • Non-blocking sends - sender continues immediately                    │
│   • Blocking receives - process waits until matching message arrives     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why this matters:

In traditional concurrent programming (Java, Go, C++), you share memory and use locks:

Traditional (Shared Memory + Locks):
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│   Thread 1          SHARED MEMORY              Thread 2                  │
│   ┌───────┐        ┌───────────┐              ┌───────┐                 │
│   │       │◄──────►│ counter=5 │◄────────────►│       │                 │
│   │       │  LOCK  │           │    LOCK      │       │                 │
│   └───────┘        └───────────┘              └───────┘                 │
│                                                                          │
│   Problems:                                                              │
│   • Deadlocks (Thread 1 waits for lock Thread 2 holds, and vice versa)  │
│   • Race conditions (both threads modify at once)                        │
│   • Lock contention (threads wait, reducing parallelism)                 │
│   • One thread crashes → corrupts shared state → cascading failure       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

BEAM (Message Passing):
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│   Process 1                                    Process 2                 │
│   ┌───────────┐                               ┌───────────┐             │
│   │ my_count=5│       {:increment, 1}         │ my_data=X │             │
│   │           │ ────────────────────────────► │           │             │
│   │           │                               │           │             │
│   │           │ ◄──────────────────────────── │           │             │
│   │           │        {:ok, 6}               │           │             │
│   └───────────┘                               └───────────┘             │
│                                                                          │
│   Benefits:                                                              │
│   • No locks needed (no shared state to protect)                         │
│   • No race conditions (each process owns its data)                      │
│   • Process 1 crash → Process 2 unaffected (isolated memory)             │
│   • Easier to reason about (follow the messages)                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Lightweight Processes: The Scalability Secret

BEAM processes are NOT operating system processes or threads. They’re managed entirely by the BEAM VM:

┌─────────────────────────────────────────────────────────────────────────┐
│                    BEAM PROCESSES vs OS THREADS                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   OS Thread (Java, Go goroutine, etc.):                                  │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  ~1MB stack (Java) or ~2-8KB (Go goroutine)                     │   │
│   │  Context switch = expensive (save/restore registers, TLB flush) │   │
│   │  Managed by OS scheduler                                         │   │
│   │  Creating 1M threads = 1TB+ memory (Java) or ~8GB (Go)          │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│   BEAM Process:                                                          │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  ~2KB initial memory (grows as needed)                          │   │
│   │  Context switch = cheap (just update a pointer)                  │   │
│   │  Managed by BEAM scheduler (not OS)                              │   │
│   │  Creating 1M processes = ~2GB memory                             │   │
│   │  WhatsApp: 2 million connections per server with ~32GB RAM      │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│   Memory comparison for 1 million concurrent connections:                │
│                                                                          │
│   Java Threads:     ████████████████████████████████████████  1+ TB     │
│   Go Goroutines:    ████████                                  ~8 GB     │
│   BEAM Processes:   ██                                        ~2 GB     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The BEAM Scheduler: Preemptive Fairness

Unlike most VMs, BEAM guarantees that no single process can monopolize the CPU. It uses reduction counting for preemptive scheduling:

┌─────────────────────────────────────────────────────────────────────────┐
│                      BEAM PREEMPTIVE SCHEDULER                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   What is a "reduction"?                                                 │
│   - Unit of work (roughly: one function call = one reduction)           │
│   - Each process gets ~4000 reductions before yielding                  │
│   - Cannot be disabled - built into the VM                              │
│                                                                          │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                         Timeline                                 │   │
│   ├─────────────────────────────────────────────────────────────────┤   │
│   │                                                                  │   │
│   │  Process A    ████████░░░░░░░░░░░░████████░░░░░░░░░░░░          │   │
│   │  (CPU-bound)       ↑                   ↑                        │   │
│   │               4000 reductions     4000 reductions               │   │
│   │               (forced yield)      (forced yield)                │   │
│   │                                                                  │   │
│   │  Process B    ░░░░░░░░████░░░░░░░░░░░░░░░░████░░░░░░░░░░░░       │   │
│   │  (I/O-bound)       ↑  ↑                    ↑  ↑                 │   │
│   │                yield  resume           yield  resume            │   │
│   │               (waiting for I/O)                                  │   │
│   │                                                                  │   │
│   │  Process C    ░░░░░░░░░░░░████████░░░░░░░░░░░░████████           │   │
│   │  (mixed)                                                         │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│   Why this matters:                                                      │
│   • A CPU-bound infinite loop CANNOT starve other processes             │
│   • I/O operations automatically yield (no wasted CPU)                   │
│   • Soft real-time guarantees: response time is bounded                  │
│   • No process can "forget" to yield - it's enforced by the VM          │
│                                                                          │
│   Compare to Go:                                                         │
│   • Go scheduler is cooperative (goroutines must yield at specific      │
│     points like function calls, channel ops, etc.)                       │
│   • A tight loop with no function calls can starve other goroutines     │
│   • Go 1.14+ added async preemption, but it's still not as thorough     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Per-Process Garbage Collection: No Stop-the-World

Each BEAM process has its own heap and is garbage collected independently:

┌─────────────────────────────────────────────────────────────────────────┐
│                    GARBAGE COLLECTION COMPARISON                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   JVM/Go (Stop-the-World GC):                                            │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                                                                  │   │
│   │  Timeline: ──────────────────────────────────────────────────►  │   │
│   │                                                                  │   │
│   │  Thread 1:  ████████████░░░░░░░░░░░░████████████████████████    │   │
│   │  Thread 2:  ████████████░░░░░░░░░░░░████████████████████████    │   │
│   │  Thread 3:  ████████████░░░░░░░░░░░░████████████████████████    │   │
│   │  Thread 4:  ████████████░░░░░░░░░░░░████████████████████████    │   │
│   │                        ^^^^^^^^^^^^                              │   │
│   │                        ALL THREADS PAUSED                        │   │
│   │                        for GC (10-100+ ms)                       │   │
│   │                                                                  │   │
│   │  Problem: ALL users experience latency spike during GC pause     │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│   BEAM (Per-Process GC):                                                 │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                                                                  │   │
│   │  Timeline: ──────────────────────────────────────────────────►  │   │
│   │                                                                  │   │
│   │  Process 1: ██████░█████████████████████████████████████████    │   │
│   │                  ↑                                               │   │
│   │            (GC for P1 only - microseconds)                       │   │
│   │                                                                  │   │
│   │  Process 2: ████████████████░████████████████████████████████   │   │
│   │                            ↑                                     │   │
│   │                  (GC for P2 only - microseconds)                 │   │
│   │                                                                  │   │
│   │  Process 3: ███████████████████████████████████████████████████ │   │
│   │             (no GC needed - not enough garbage)                  │   │
│   │                                                                  │   │
│   │  Benefit: ONLY the affected process pauses, others continue     │   │
│   │           Small heaps = fast GC (microseconds, not milliseconds) │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Supervision Trees: “Let It Crash” in Practice

The “let it crash” philosophy is not about being careless—it’s about designing systems that expect and recover from failures automatically:

┌─────────────────────────────────────────────────────────────────────────┐
│                         SUPERVISION TREE                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                          ┌─────────────────┐                             │
│                          │   Application   │                             │
│                          │   Supervisor    │                             │
│                          │   (one_for_one) │                             │
│                          └────────┬────────┘                             │
│                                   │                                      │
│            ┌──────────────────────┼──────────────────────┐               │
│            │                      │                      │               │
│            ▼                      ▼                      ▼               │
│   ┌────────────────┐    ┌────────────────┐    ┌────────────────┐        │
│   │ Web Supervisor │    │ DB Supervisor  │    │ Cache Supervisor│        │
│   │ (one_for_one)  │    │ (rest_for_one) │    │ (one_for_all)  │        │
│   └───────┬────────┘    └───────┬────────┘    └───────┬────────┘        │
│           │                     │                     │                  │
│     ┌─────┼─────┐         ┌─────┼─────┐         ┌─────┼─────┐           │
│     ▼     ▼     ▼         ▼     ▼     ▼         ▼     ▼     ▼           │
│   ┌───┐ ┌───┐ ┌───┐     ┌───┐ ┌───┐ ┌───┐     ┌───┐ ┌───┐ ┌───┐        │
│   │ W │ │ W │ │ W │     │ P │ │ R │ │ R │     │ C │ │ C │ │ C │        │
│   │ 1 │ │ 2 │ │ 3 │     │ o │ │ e │ │ e │     │ 1 │ │ 2 │ │ 3 │        │
│   │   │ │   │ │   │     │ o │ │ p │ │ p │     │   │ │   │ │   │        │
│   └───┘ └───┘ └───┘     │ l │ │ 1 │ │ 2 │     └───┘ └───┘ └───┘        │
│   Workers               └───┘ └───┘ └───┘     Cache nodes              │
│   (can crash            DB connection pool                              │
│    independently)                                                        │
│                                                                          │
│   RESTART STRATEGIES:                                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                                                                  │   │
│   │  one_for_one:   Only restart the crashed child                  │   │
│   │                 "Web worker 2 died? Restart only worker 2"       │   │
│   │                                                                  │   │
│   │  rest_for_one:  Restart crashed child and all started after it  │   │
│   │                 "Pool died? Restart pool, then replicas"        │   │
│   │                 (replicas depend on pool)                        │   │
│   │                                                                  │   │
│   │  one_for_all:   Restart ALL children if one dies                │   │
│   │                 "Cache 2 died? Restart all caches"              │   │
│   │                 (caches must be in sync)                         │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│   ESCALATION:                                                            │
│   - If a child crashes too many times, supervisor kills itself          │
│   - Its parent supervisor then restarts the entire subtree              │
│   - This continues up until the entire app restarts (or human looks)    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why “Let It Crash” Works:

Traditional Defensive Programming:
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│  try {                                                                   │
│    result = doRiskyOperation()                                          │
│    if (result == null) {                                                │
│      // Handle null case                                                 │
│      log("result was null, trying fallback...")                         │
│      result = fallbackOperation()                                       │
│      if (result == null) {                                              │
│        // Handle fallback null                                           │
│        return defaultValue                                               │
│      }                                                                   │
│    }                                                                     │
│    if (!isValid(result)) {                                              │
│      // Handle invalid case                                              │
│      result = sanitize(result)                                          │
│      if (!isValid(result)) {                                            │
│        throw new InvalidResultException()                                │
│      }                                                                   │
│    }                                                                     │
│    return result                                                         │
│  } catch (NetworkException e) {                                          │
│    // Handle network error                                               │
│    retry(3)                                                              │
│  } catch (TimeoutException e) {                                          │
│    // Handle timeout                                                     │
│  } catch (Exception e) {                                                 │
│    // Handle unknown error                                               │
│    log(e)                                                                │
│    return defaultValue                                                   │
│  }                                                                       │
│                                                                          │
│  Problems:                                                               │
│  • 90% of code is error handling                                         │
│  • Still can't handle every edge case                                    │
│  • State might be corrupted after partial failure                        │
│  • Defensive code often has its own bugs                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

BEAM "Let It Crash" Approach:
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│  def handle_call(:do_risky_operation, _from, state) do                  │
│    result = do_risky_operation()   # Just do it. If it fails, crash.    │
│    {:reply, result, state}                                               │
│  end                                                                     │
│                                                                          │
│  # Supervisor config (separate concern):                                 │
│  children = [                                                            │
│    {RiskyWorker, restart: :permanent, max_restarts: 3, max_seconds: 5}  │
│  ]                                                                       │
│                                                                          │
│  Benefits:                                                               │
│  • Clean, focused business logic                                         │
│  • Guaranteed fresh state after restart (no corruption)                  │
│  • Failure handling is declarative (supervisor config)                   │
│  • Self-healing: supervisor restarts crashed process                     │
│  • Escalation: too many crashes → restart larger subsystem               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distribution: Built Into the VM

BEAM nodes can form clusters and communicate transparently:

┌─────────────────────────────────────────────────────────────────────────┐
│                      DISTRIBUTED ERLANG                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Node 1 (server1@192.168.1.1)        Node 2 (server2@192.168.1.2)      │
│   ┌─────────────────────────┐         ┌─────────────────────────┐       │
│   │  ┌───┐ ┌───┐ ┌───┐     │         │  ┌───┐ ┌───┐ ┌───┐     │       │
│   │  │P1 │ │P2 │ │P3 │     │◄───────►│  │P4 │ │P5 │ │P6 │     │       │
│   │  └───┘ └───┘ └───┘     │   TCP   │  └───┘ └───┘ └───┘     │       │
│   │                         │         │                         │       │
│   └─────────────────────────┘         └─────────────────────────┘       │
│                                                                          │
│   # From P1, send message to P5 on different machine:                    │
│   send({:registered_name, :"server2@192.168.1.2"}, {:hello, "world"})   │
│                                                                          │
│   # SAME SYNTAX as sending to local process!                             │
│   # The VM handles:                                                      │
│   #   - Serialization                                                    │
│   #   - Network transport                                                │
│   #   - Deserialization                                                  │
│   #   - Delivery to correct process                                      │
│                                                                          │
│   Built-in features:                                                     │
│   • Node.connect/1 - connect to another node                            │
│   • Node.list/0 - list connected nodes                                  │
│   • :global - global process registry across cluster                    │
│   • :pg - process groups for pub/sub                                    │
│   • Node.monitor/2 - detect node failures                               │
│                                                                          │
│   Compare to other languages:                                            │
│   • Java: Need gRPC, Protocol Buffers, service discovery (Consul/etcd)  │
│   • Go: Need gRPC, serialization libs, manual clustering                 │
│   • BEAM: Just connect and send messages (same syntax as local)          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Where BEAM Is The Best Choice

Problem Domain Why BEAM Wins What Others Struggle With
Millions of persistent connections Lightweight processes (2KB each) Go/Java threads are 1MB+, context switching overhead
Fault isolation Process crashes are contained, supervisors restart Exceptions in Go/Java can corrupt shared state
Soft real-time Per-process GC, preemptive scheduler JVM/Go stop-the-world GC causes latency spikes
Distributed systems Distribution primitives built into VM Requires external tools (Consul, etcd, Zookeeper)
Hot code reloading Native support Nearly impossible in compiled languages
“Always up” systems “Let it crash” + supervision trees Defensive programming everywhere, still fragile

Real-World Validation

Company Use Case Scale Why BEAM
WhatsApp Messaging 2M connections/server, 50 engineers for 900M users Fault tolerance, connection density
Discord Real-time chat 5M concurrent users on launch Connection handling, real-time updates
Ericsson Telecom switches 99.9999999% uptime (31.5ms downtime/year) Original use case
Pinterest Notification system Millions of push notifications/second Fault tolerance, throughput
Bleacher Report Live sports updates Millions of concurrent users during games Real-time, spike handling

Concept Summary Table

Concept Cluster What You Need to Internalize
Actor Model Processes don’t share memory. All communication is via message passing. This eliminates data races by design.
Lightweight Processes BEAM processes are ~2KB, managed by the VM, not the OS. You can have millions of them.
Preemptive Scheduling No process can monopolize the CPU. The scheduler enforces fairness via reduction counting.
Per-Process GC Each process has its own heap. GC pauses affect only that process. No stop-the-world.
Supervision Trees Hierarchical fault recovery. Supervisors restart failed children. Failure is expected, not exceptional.
“Let It Crash” Don’t try to handle every error. Let processes crash and restart with clean state. Supervisors handle recovery.
Distribution Built into the VM. Send messages to remote processes with the same syntax as local ones.
Hot Code Reload Update running code without stopping the system. Critical for zero-downtime deployments.

Deep Dive Reading by Concept

This section maps each concept to specific book chapters for deeper understanding.

Actor Model & Message Passing

Concept Book & Chapter
Actor model fundamentals Programming Erlang by Joe Armstrong — Ch. 8: “Concurrent Programming”
Process communication Elixir in Action by Saša Jurić — Ch. 5: “Concurrency Primitives”
GenServer and OTP behaviors The Little Elixir & OTP Guidebook by Benjamin Tan Wei Hao — Ch. 3-4
Process design patterns Designing for Scalability with Erlang/OTP by Francesco Cesarini — Ch. 3-4: “GenServer, Generic Servers”

Supervision and Fault Tolerance

Concept Book & Chapter
Supervision trees Elixir in Action by Saša Jurić — Ch. 9: “Isolating Error Effects”
Supervisor strategies Programming Erlang by Joe Armstrong — Ch. 13: “Errors in Concurrent Programs”
“Let it crash” philosophy The Zen of Erlang (free online) by Fred Hébert
Production fault tolerance Designing for Scalability with Erlang/OTP by Francesco Cesarini — Ch. 9: “System Reliability”

BEAM Internals

Concept Book & Chapter
How processes work The BEAM Book (free online) by Erik Stenman — Process chapters
Scheduler and reductions The BEAM Book by Erik Stenman — Scheduling chapter
Memory and GC The BEAM Book by Erik Stenman — Memory management chapters
Distribution internals Learn You Some Erlang (free online) — “Distribunomicon” chapter

ETS and Performance

Concept Book & Chapter
ETS tables Elixir in Action by Saša Jurić — Ch. 10: “Beyond GenServer”
When to use ETS vs GenServer Erlang and OTP in Action by Martin Logan — Ch. 6
Performance patterns Designing for Scalability with Erlang/OTP by Francesco Cesarini — Ch. 6: “ETS and Mnesia”

Distribution and Clustering

Concept Book & Chapter
BEAM distribution Learn You Some Erlang (free online) — “Distribunomicon” chapter
Building distributed systems Programming Erlang by Joe Armstrong — Ch. 17: “Distributed Programming”
Clustering patterns Designing for Scalability with Erlang/OTP by Francesco Cesarini — Ch. 11: “Distribution”

Essential Reading Order

  1. Week 1 - Foundation:
    • Elixir in Action Ch. 5 (concurrency primitives)
    • Programming Erlang Ch. 8 (concurrent programming)
    • The Little Elixir & OTP Guidebook Ch. 3-4 (GenServer)
  2. Week 2 - Fault Tolerance:
    • Elixir in Action Ch. 9 (supervision)
    • Programming Erlang Ch. 13 (errors)
    • The Zen of Erlang (philosophy)
  3. Week 3 - Performance:
    • Elixir in Action Ch. 10 (ETS)
    • The BEAM Book (internals)
  4. Week 4 - Distribution:
    • Learn You Some Erlang “Distribunomicon”
    • Programming Erlang Ch. 17 (distributed programming)

Project Recommendations


Project 1: Real-Time Chat System with Fault Tolerance

  • File: BEAM_ELIXIR_ERLANG_LEARNING_PROJECTS.md
  • Programming Language: Elixir
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Concurrency / Distributed Systems
  • Software or Tool: BEAM VM / OTP
  • Main Book: “Designing for Scalability with Erlang/OTP” by Francesco Cesarini

What you’ll build: A chat server where users can create rooms, join them, and send messages - but with a twist: you’ll intentionally crash parts of it and watch it self-heal.

Why it teaches BEAM’s strengths: This forces you to confront the core BEAM paradigm: processes, message passing, supervision trees, and the “let it crash” philosophy. You’ll see firsthand why WhatsApp could serve 2 million connections per server with a tiny team.

Core challenges you’ll face:

  • Modeling each chat room as a separate process (maps to: Actor model)
  • Handling user disconnections without losing room state (maps to: Process supervision)
  • Broadcasting messages to thousands of users without blocking (maps to: Lightweight concurrency)
  • Making the system survive crashes of individual components (maps to: Supervision trees)

Key Concepts:

  • OTP GenServer: “Designing for Scalability with Erlang/OTP” Ch. 3-4 - Francesco Cesarini
  • Supervision Trees: “Elixir in Action” Ch. 9 - Saša Jurić
  • Process Linking & Monitoring: “Programming Erlang” Ch. 13 - Joe Armstrong
  • Message Passing: “The Little Elixir & OTP Guidebook” Ch. 3-4 - Benjamin Tan Wei Hao

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Elixir syntax, understanding of processes

Learning milestones:

  1. First milestone: Single chat room works - you understand GenServer and message passing
  2. Second milestone: Multiple rooms with supervision - you understand why “let it crash” is actually safer than try/catch everywhere
  3. Final milestone: You can kill any process and the system recovers - you’ve internalized fault tolerance

Real World Outcome

You’ll have a fully functional chat server that demonstrates BEAM’s fault tolerance in action. Here’s exactly what you’ll see when running the completed project:

# Terminal 1: Start the chat server
$ iex -S mix
Compiling 5 files (.ex)
[info] Chat server starting...
[info] Supervisor tree initialized

iex> ChatServer.start()
:ok
[info] Chat server listening on port 4000

# Terminal 2: User "alice" connects via telnet
$ telnet localhost 4000
Connected to localhost.
Welcome to ChatServer! Enter your username:
> alice
[System] Welcome, alice! Commands: /join <room>, /leave, /list, /msg <text>

> /join general
[System] Joined room 'general' (1 user online)

> /msg Hello everyone!
[alice @ general] Hello everyone!

# Terminal 3: User "bob" connects
$ telnet localhost 4000
> bob
[System] Welcome, bob!

> /join general
[System] Joined room 'general' (2 users online)
[System] alice is here

# Bob sees Alice's messages, Alice sees Bob join
[alice @ general] Hello everyone!
[System] bob joined the room

> /msg Hi Alice!
[bob @ general] Hi Alice!

# Back in Terminal 2 (Alice's view):
[System] bob joined the room
[bob @ general] Hi Alice!

# NOW THE FUN PART - Kill the room process!
# In Terminal 1 (iex console):
iex> room_pid = ChatServer.RoomRegistry.get_room("general")
#PID<0.234.0>

iex> Process.exit(room_pid, :kill)
true

# Watch the logs:
[warn] Room 'general' process terminated (reason: killed)
[info] Supervisor restarting room 'general'...
[info] Room 'general' restarted with fresh state
[info] Reconnecting 2 users to 'general'...

# In Terminals 2 and 3, users see:
[System] Room temporarily unavailable, reconnecting...
[System] Reconnected to 'general' (2 users online)

# Users can immediately continue chatting!
> /msg The room crashed and recovered automatically!
[alice @ general] The room crashed and recovered automatically!

# View the supervision tree in Observer:
iex> :observer.start()
# You'll see:
┌─────────────────────────────────────────────────────────────────┐
│                      ChatServer.Application                      │
│                              │                                   │
│              ┌───────────────┴───────────────┐                  │
│              │                               │                  │
│      RoomSupervisor                  UserSupervisor             │
│      (DynamicSupervisor)             (DynamicSupervisor)        │
│              │                               │                  │
│    ┌─────────┼─────────┐           ┌─────────┼─────────┐        │
│    │         │         │           │         │         │        │
│  Room:     Room:     Room:      User:     User:     User:       │
│  general   random    elixir     alice      bob      carol       │
│                                                                  │
│  [Click any process to see its state, messages, memory]         │
└─────────────────────────────────────────────────────────────────┘

# Stress test: spawn 1000 users
iex> for i <- 1..1000, do: ChatServer.simulate_user("user#{i}", "general")
[info] Room 'general' now has 1002 users

# Broadcast a message - all 1000 users receive it instantly
iex> ChatServer.broadcast("general", "system", "Hello to all 1000 of you!")
:ok  # Returns immediately - non-blocking!

# Check process count
iex> :erlang.system_info(:process_count)
2847  # Each user = 1 process, each room = 1 process
      # All running smoothly with minimal memory

What you’ve proven:

  1. Each chat room is an isolated process - killing one doesn’t affect others
  2. Supervisors automatically restart crashed processes
  3. State can be recovered (users reconnect seamlessly)
  4. Broadcasting to 1000 users is trivial - BEAM handles it
  5. The Observer shows you exactly what’s happening inside

The Core Question You’re Answering

“How do you build a system where individual components can fail, but users never notice?”

This is the fundamental question that Ericsson faced when building telephone switches and that every production system must answer. Traditional approaches use defensive programming (try/catch everywhere), but BEAM uses a radically different approach: let processes crash and have supervisors restart them with clean state.

Before you write any code, sit with this: In most languages, a crash is a disaster. In BEAM, a crash is just an event that triggers recovery. Your mental model must shift from “prevent all crashes” to “design for recovery.”


Concepts You Must Understand First

Stop and research these before coding:

  1. Processes as Actors
    • What is a BEAM process, and how does it differ from an OS thread?
    • Why is each process completely isolated (no shared memory)?
    • What happens when you send a message to a process?
    • Book Reference: “Elixir in Action” by Saša Jurić — Ch. 5: “Concurrency Primitives”
  2. GenServer Behavior
    • What is a “behavior” in OTP, and why does it matter?
    • What’s the difference between handle_call (sync) and handle_cast (async)?
    • How does GenServer maintain state between calls?
    • What happens to the state when the process crashes?
    • Book Reference: “The Little Elixir & OTP Guidebook” by Benjamin Tan Wei Hao — Ch. 3-4
  3. Supervision Trees
    • What is a supervisor, and what can it supervise?
    • What are the restart strategies (:one_for_one, :one_for_all, :rest_for_one)?
    • What is :max_restarts and :max_seconds? What happens when exceeded?
    • How do you design a supervision hierarchy?
    • Book Reference: “Elixir in Action” by Saša Jurić — Ch. 9: “Isolating Error Effects”
  4. Process Linking and Monitoring
    • What’s the difference between link and monitor?
    • When a linked process crashes, what happens to the other?
    • How does a supervisor know when a child crashes?
    • Book Reference: “Programming Erlang” by Joe Armstrong — Ch. 13: “Errors in Concurrent Programs”
  5. Message Passing Semantics
    • Messages are copied, not shared - why is this important?
    • What guarantees does BEAM provide about message ordering?
    • What is the mailbox, and what happens if it grows too large?
    • Book Reference: “Programming Erlang” by Joe Armstrong — Ch. 8: “Concurrent Programming”

Questions to Guide Your Design

Before implementing, think through these:

  1. Process Architecture
    • Should each chat room be a separate process? Why?
    • Should each connected user be a separate process? Why?
    • What’s the relationship between user processes and room processes?
    • How do users “join” a room? (Message passing? Process registration?)
  2. State Management
    • What state does a room process need to maintain?
    • What state does a user process need to maintain?
    • When a room crashes and restarts, what state is lost? What can be recovered?
    • Should you persist state to disk, or is in-memory acceptable?
  3. Supervision Hierarchy
    • Draw your supervision tree on paper
    • If a room crashes, should all users in that room also restart?
    • If a user crashes, should their room be affected?
    • What restart strategy makes sense for rooms? For users?
  4. Message Broadcasting
    • When Alice sends a message, how does it reach Bob?
    • Option A: Room process sends to each user process
    • Option B: Use Phoenix.PubSub or :pg for pub/sub
    • What are the tradeoffs?
  5. Failure Scenarios
    • What happens when a user’s network disconnects?
    • What happens when a room process crashes?
    • What happens when the room supervisor crashes?
    • How do you test these scenarios?

Thinking Exercise

Before coding, trace the following scenario on paper:

Scenario: Alice sends "Hello!" to the #general room where Bob is also connected.

Draw the message flow:

1. Alice's TCP connection receives "Hello!"
2. Alice's user process parses the message
3. User process sends {:broadcast, "alice", "Hello!"} to Room process
4. Room process iterates over member list [alice_pid, bob_pid]
5. Room sends {:new_message, "alice", "Hello!"} to each member
6. Each user process sends the formatted message to their TCP socket
7. Bob sees "[alice @ general] Hello!"

Now trace the crash scenario:

1. Room process for #general crashes (killed by admin)
2. RoomSupervisor detects child exit
3. Supervisor restarts Room process with fresh state (empty member list!)
4. How do Alice and Bob rejoin? Who initiates?

Options:
A) User processes are linked to room - they crash too and reconnect
B) User processes monitor room - they receive :DOWN and rejoin
C) Room stores member list in ETS (survives restart)

Which option preserves the best user experience? Draw each scenario.

State Machine for User Process:

┌─────────────────────────────────────────────────────────────────┐
│                    USER PROCESS STATE MACHINE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌───────────┐   username received   ┌───────────┐             │
│   │ CONNECTED │ ────────────────────► │AUTHENTICATED│            │
│   │(waiting   │                       │ (no room)   │            │
│   │ for name) │                       └──────┬──────┘            │
│   └───────────┘                              │                   │
│                                        /join room                │
│                                              │                   │
│                                              ▼                   │
│                                       ┌───────────┐              │
│            room crashed ◄──────────── │ IN_ROOM   │              │
│            (monitor :DOWN)            │           │              │
│                  │                    └─────┬─────┘              │
│                  │                          │                    │
│                  │                    /leave or                  │
│                  │                    disconnect                 │
│                  │                          │                    │
│                  ▼                          ▼                    │
│           ┌───────────┐              ┌───────────┐               │
│           │RECONNECTING│              │DISCONNECTED│              │
│           │ (auto-retry)│             │  (cleanup) │              │
│           └──────┬──────┘             └───────────┘               │
│                  │                                               │
│            rejoin success                                        │
│                  │                                               │
│                  └────────────────► IN_ROOM                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why use a separate process for each chat room instead of one process handling all rooms?”
    • Answer: Isolation. If one room crashes (e.g., due to a malicious message), other rooms continue working. Also, each room can use its own CPU core, enabling parallelism. The overhead is minimal (~2KB per process).
  2. “How do you handle the case where a user sends a message to a room that just crashed?”
    • Answer: The user’s send to the room’s PID will fail (process doesn’t exist). Options: (1) User catches the failure and retries after room restarts, (2) User monitors the room and gets notified of restart, (3) Use Registry with name-based lookup which automatically routes to new PID.
  3. “What’s the difference between link and monitor and which would you use here?”
    • Answer: link is bidirectional - if either process dies, the other receives an exit signal (and dies unless it traps exits). monitor is unidirectional - you observe another process’s death without being affected. For users watching rooms, use monitor - you want to know when the room dies but don’t want to die yourself.
  4. “If you have 10,000 users in a room and send a message, doesn’t the room process become a bottleneck?”
    • Answer: Yes, this is a known pattern called “hot spot.” Solutions: (1) Use :pg or Phoenix.PubSub for fan-out (distributes work), (2) Shard large rooms across multiple processes, (3) Use ETS for read-heavy operations. For most use cases, single process is fine - BEAM can send millions of messages per second.
  5. “How do you test that your supervision tree works correctly?”
    • Answer: Integration tests that Process.exit(pid, :kill) specific processes and assert the system recovers. Use :erlang.trace or Observer to verify restart counts. Write property-based tests with StreamData that randomly kill processes and verify invariants.
  6. “What happens if messages pile up in a room’s mailbox faster than it can process them?”
    • Answer: The mailbox grows unbounded, eventually consuming all memory (OOM). Solutions: (1) Add backpressure (reject messages if mailbox > N), (2) Use GenServer.call with timeout instead of cast, (3) Monitor mailbox size with :erlang.process_info(pid, :message_queue_len) and shed load.
  7. “How would you make this chat system distributed across multiple servers?”
    • Answer: Use libcluster or Node.connect/1 to form a cluster. Use :pg (process groups) for cross-node pub/sub. Or use Phoenix.PubSub with Redis adapter. Room processes can be distributed using Horde or consistent hashing.

Hints in Layers

Hint 1 - Project Structure:

# Start with this structure:
lib/
├── chat_server/
   ├── application.ex      # Main supervisor
   ├── room_supervisor.ex  # DynamicSupervisor for rooms
   ├── user_supervisor.ex  # DynamicSupervisor for users
   ├── room.ex             # GenServer for each room
   ├── user.ex             # GenServer for each user connection
   └── room_registry.ex    # Registry for name-based room lookup
└── chat_server.ex          # Public API

Hint 2 - Basic GenServer for Room:

defmodule ChatServer.Room do
  use GenServer

  def start_link(room_name) do
    GenServer.start_link(__MODULE__, room_name, name: via_tuple(room_name))
  end

  defp via_tuple(room_name) do
    {:via, Registry, {ChatServer.RoomRegistry, room_name}}
  end

  def init(room_name) do
    {:ok, %{name: room_name, members: MapSet.new()}}
  end

  def handle_call({:join, user_pid}, _from, state) do
    Process.monitor(user_pid)  # Know when user disconnects
    new_members = MapSet.put(state.members, user_pid)
    {:reply, :ok, %{state | members: new_members}}
  end

  def handle_cast({:broadcast, from, message}, state) do
    for member_pid <- state.members do
      send(member_pid, {:new_message, state.name, from, message})
    end
    {:noreply, state}
  end

  # When a monitored user dies, remove from members
  def handle_info({:DOWN, _ref, :process, user_pid, _reason}, state) do
    new_members = MapSet.delete(state.members, user_pid)
    {:noreply, %{state | members: new_members}}
  end
end

Hint 3 - Supervision Tree Setup:

defmodule ChatServer.Application do
  use Application

  def start(_type, _args) do
    children = [
      # Registry for room name -> pid lookup
      {Registry, keys: :unique, name: ChatServer.RoomRegistry},

      # Dynamic supervisor for rooms
      {DynamicSupervisor, name: ChatServer.RoomSupervisor, strategy: :one_for_one},

      # Dynamic supervisor for users
      {DynamicSupervisor, name: ChatServer.UserSupervisor, strategy: :one_for_one}
    ]

    Supervisor.start_link(children, strategy: :one_for_one, name: ChatServer.Supervisor)
  end
end

# To start a new room:
DynamicSupervisor.start_child(ChatServer.RoomSupervisor, {ChatServer.Room, "general"})

Hint 4 - User Process with Room Monitoring:

defmodule ChatServer.User do
  use GenServer

  def init({socket, username}) do
    {:ok, %{socket: socket, username: username, room: nil, room_monitor: nil}}
  end

  def handle_cast({:join_room, room_name}, state) do
    # Monitor the room so we know if it crashes
    room_pid = ChatServer.Room.whereis(room_name)
    ref = Process.monitor(room_pid)
    ChatServer.Room.join(room_name, self())
    {:noreply, %{state | room: room_name, room_monitor: ref}}
  end

  # Room crashed! Try to rejoin after it restarts
  def handle_info({:DOWN, ref, :process, _pid, _reason}, %{room_monitor: ref} = state) do
    # Give supervisor time to restart the room
    Process.send_after(self(), :rejoin_room, 100)
    send_to_socket(state.socket, "[System] Room temporarily unavailable, reconnecting...")
    {:noreply, %{state | room_monitor: nil}}
  end

  def handle_info(:rejoin_room, state) do
    case ChatServer.Room.whereis(state.room) do
      nil ->
        # Room not yet restarted, retry
        Process.send_after(self(), :rejoin_room, 100)
        {:noreply, state}
      pid ->
        ref = Process.monitor(pid)
        ChatServer.Room.join(state.room, self())
        send_to_socket(state.socket, "[System] Reconnected to '#{state.room}'")
        {:noreply, %{state | room_monitor: ref}}
    end
  end
end

Hint 5 - Testing Fault Tolerance:

defmodule ChatServer.FaultToleranceTest do
  use ExUnit.Case

  test "room restarts after crash and users reconnect" do
    # Setup
    {:ok, _} = ChatServer.Room.start("test-room")
    {:ok, user1} = ChatServer.User.start(fake_socket(), "alice")
    {:ok, user2} = ChatServer.User.start(fake_socket(), "bob")

    ChatServer.User.join_room(user1, "test-room")
    ChatServer.User.join_room(user2, "test-room")

    # Get room pid
    room_pid = ChatServer.Room.whereis("test-room")

    # Kill the room!
    Process.exit(room_pid, :kill)

    # Wait for restart and reconnection
    :timer.sleep(200)

    # Room should be back with new pid
    new_room_pid = ChatServer.Room.whereis("test-room")
    assert new_room_pid != room_pid
    assert new_room_pid != nil

    # Users should have reconnected
    members = ChatServer.Room.get_members("test-room")
    assert MapSet.size(members) == 2
  end
end

Books That Will Help

Topic Book Chapter Why It Helps
Process fundamentals Elixir in Action by Saša Jurić Ch. 5: “Concurrency Primitives” Best introduction to BEAM processes, spawn, send, receive
GenServer deep dive The Little Elixir & OTP Guidebook by Benjamin Tan Wei Hao Ch. 3-4 Practical GenServer patterns with real examples
Supervision trees Elixir in Action by Saša Jurić Ch. 9: “Isolating Error Effects” Clear explanation of supervisors and restart strategies
Error handling in concurrent systems Programming Erlang by Joe Armstrong Ch. 13: “Errors in Concurrent Programs” From Erlang’s creator - the philosophy behind “let it crash”
Production OTP patterns Designing for Scalability with Erlang/OTP by Francesco Cesarini Ch. 3-4, 9 How to design GenServers and supervision for production
Building real-time apps Programming Phoenix LiveView by Bruce Tate Ch. 2-3 If you want to add a web interface later
Understanding the philosophy The Zen of Erlang (free online) by Fred Hébert Entire article Essential reading on the “let it crash” mindset

Project 2: Rate Limiter / Circuit Breaker Service

  • File: BEAM_ELIXIR_ERLANG_LEARNING_PROJECTS.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang, Go, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: Level 4: The “Open Core” Infrastructure
  • Difficulty: Level 2: Intermediate (The Developer)
  • Knowledge Area: Concurrency, Distributed Systems
  • Software or Tool: Elixir, ETS, OTP
  • Main Book: “Elixir in Action” by Saša Jurić

What you’ll build: A service that other applications call to check if a request should be allowed (rate limiting) or if a downstream service is healthy (circuit breaker). This is infrastructure that every production system needs.

Why it teaches BEAM’s strengths: Rate limiters and circuit breakers need to handle massive concurrent requests, maintain state per-client, and never become the bottleneck themselves. BEAM’s ETS (Erlang Term Storage), atomic counters, and lightweight processes make this trivial - in other languages you’d need Redis or complex locking.

Core challenges you’ll face:

  • Tracking request counts per user across millions of users (maps to: ETS tables)
  • Implementing sliding window rate limiting without locks (maps to: Atomic operations)
  • Detecting downstream failures and opening circuits (maps to: Process monitoring)
  • Handling thundering herd when circuit closes (maps to: Half-open state with concurrency limits)

Key Concepts:

  • ETS Tables: “Elixir in Action” Ch. 10 - Saša Jurić
  • Atomic Operations: “Erlang and OTP in Action” Ch. 6 - Martin Logan
  • Circuit Breaker Pattern: Martin Fowler’s “Circuit Breaker” article + Fuse library source code
  • Sliding Window Algorithms: “System Design Interview” Ch. 4 - Alex Xu

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Elixir, GenServer understanding

Real world outcome: You’ll have an HTTP API where you can call POST /check/user123 and get back {allowed: true, remaining: 99} or {allowed: false, retry_after: 30}. You can hammer it with thousands of concurrent requests (use wrk or hey) and watch it handle them without breaking a sweat. Try implementing the same thing in Go with proper concurrency safety - you’ll appreciate BEAM’s approach.

Learning milestones:

  1. First milestone: Basic rate limiter with ETS - you understand why ETS exists and when to use it vs GenServer state
  2. Second milestone: Sliding window implementation - you understand atomic operations without locks
  3. Final milestone: Full circuit breaker - you understand process monitoring and state machines in BEAM

Real World Outcome (Detailed): Show exact HTTP output example:

# Start the rate limiter service
$ iex -S mix phx.server
[info] Running RateLimiterWeb.Endpoint

# Test rate limiting (10 requests/minute limit)
$ curl -X POST http://localhost:4000/check/user123
{"allowed": true, "remaining": 9, "reset_in": 60}

$ for i in {1..10}; do curl -s -X POST http://localhost:4000/check/user123 | jq; done
{"allowed": true, "remaining": 8, "reset_in": 58}
{"allowed": true, "remaining": 7, "reset_in": 57}
...
{"allowed": false, "remaining": 0, "retry_after": 45}

# Circuit breaker in action - downstream service failing
$ curl http://localhost:4000/circuit/payment-service
{"status": "open", "failures": 5, "next_attempt_in": 30}

# Load test with 10000 concurrent requests
$ hey -n 10000 -c 100 http://localhost:4000/check/user123
Summary:
  Total:        0.8234 secs
  Requests/sec: 12145.23
  # No errors, all handled correctly!

The Core Question You’re Answering: “How do you protect systems from being overwhelmed while handling millions of concurrent checks without locks?”

Concepts You Must Understand First - Detailed prerequisites:

  • ETS tables and when to use them vs GenServer state: ETS provides concurrent read/write access without message passing overhead. GenServer state requires serialized access through a single process. For rate limiting where millions of requests need to check counters, ETS wins because reads are concurrent and in the caller’s process.
  • Atomic operations with :atomics and :counters: These provide lock-free counter updates that multiple processes can modify simultaneously without race conditions. Critical for implementing rate limiters that don’t become bottlenecks.
  • Sliding window vs fixed window rate limiting: Fixed windows (reset every minute) allow bursts at window boundaries. Sliding windows track requests over a rolling time period, providing smoother rate limiting. Understanding the tradeoffs is essential.
  • Circuit breaker pattern (closed/open/half-open states): Closed = normal operation, Open = all requests fail fast (don’t call failing service), Half-open = test if service recovered. Prevents cascading failures when downstream services are down.
  • Process monitoring for failure detection: Using Process.monitor/1 to detect when downstream service calls fail, tracking failure counts to determine when to open the circuit.

Questions to Guide Your Design - Implementation questions:

  • Why is ETS better than GenServer state for this use case? Because with GenServer, every rate limit check becomes a message that queues up in a single process. At 100k requests/sec, your GenServer becomes a bottleneck. With ETS, reads happen concurrently in the calling process with no message passing.
  • How do you implement sliding window without locks? Use ETS ordered_set to store timestamps of each request. On each check, delete entries older than the window, count remaining entries. The ETS operations are atomic. Alternative: Use multiple fixed windows with weighted counts.
  • When should a circuit breaker open? Close? Open when failure threshold is reached (e.g., 5 failures in 10 seconds). Transition to half-open after timeout (e.g., 30 seconds). Close if test request succeeds in half-open state. Reopen if test fails.
  • How do you handle the thundering herd problem? When circuit transitions from open to half-open, only allow ONE test request through, not thousands. Use atomic compare-and-swap to ensure single tester. Other requests still fail fast until circuit closes.

Thinking Exercise - A paper exercise: Before writing code, design on paper:

  1. ETS Table Structure Design: ``` For rate limiting, sketch:
    • Table name: :rate_limits
    • Table type: set or ordered_set?
    • Key structure: What uniquely identifies each user/client?
    • Value structure: What data do you need to track?

    Example for fixed window: {user_id, window_start_time} => {count, expires_at}

    Example for sliding window: {user_id, request_timestamp, unique_id} => true ```

  2. Trace Concurrent Access Patterns: Draw a timeline with 3 concurrent processes checking rate limits for the same user:
    Process A: Read count -> Check limit -> Increment -> Write
    Process B:      Read count -> Check limit -> Increment -> Write
    Process C:           Read count -> Check limit -> Increment -> Write
    
    What could go wrong with naive implementation?
    How do :atomics or :ets.update_counter solve this?
    
  3. Circuit Breaker State Transitions: Draw a state machine diagram:
    [Closed] --5 failures--> [Open] --30 sec timeout--> [Half-Open]
        ^                                                      |
        |--test success--------------------------------------|
        |                                                      |
        |--test failure----------------------------------->[Open]
    
    For each state, list what happens to incoming requests.
    

The Interview Questions They’ll Ask:

  1. “Your rate limiter works great with 100 requests/sec, but at 100k req/sec it becomes the bottleneck. Why, and how do you fix it?”
    • Answer: GenServer serializes all requests through one process. Solution: Use ETS for concurrent reads, :atomics or :ets.update_counter for atomic increments, or partition across multiple ETS tables by key hash.
  2. “A user makes 100 requests at 11:59:59 and 100 more at 12:00:00. With a 100/minute limit, should this be allowed?”
    • Answer: Depends on sliding vs fixed window. Fixed window = yes (different windows). Sliding window = no (200 requests in 1 second). Discuss tradeoffs: fixed is simpler/faster, sliding is fairer.
  3. “Your circuit breaker transitions to half-open. 10,000 requests arrive simultaneously. What happens?”
    • Answer: Naive implementation floods the recovering service (thundering herd). Solution: Only allow 1 test request through, fail-fast the rest. Use atomic compare-and-swap or a single tester process.
  4. “How do you prevent race conditions when incrementing rate limit counters across multiple processes?”
    • Answer: Use :ets.update_counter (atomic operation), :atomics module, or :counters module. Never do read-modify-write separately. ETS operations are atomic per key.
  5. “Your rate limiter crashes and restarts. Should rate limits reset to zero?”
    • Answer: Depends on requirements. Options: (1) Persistent ETS table (outlives owner process), (2) Store in external system (Redis), (3) Accept reset as acceptable for this use case. Discuss supervision tree design.
  6. “Compare your BEAM implementation to a Redis-based rate limiter. What are the tradeoffs?”
    • Answer: BEAM: Lower latency (no network), simpler deployment, tightly coupled. Redis: Shared across services, persistent, extra network hop, additional failure point. BEAM can handle millions of req/sec locally; Redis is better for distributed rate limiting across multiple app servers.

Hints in Layers - Progressive hints:

Hint 1 - High Level Architecture:

Start with these modules:
- RateLimiter.FixedWindow - Simple fixed window using ETS
- RateLimiter.SlidingWindow - More sophisticated, still using ETS
- CircuitBreaker.Server - GenServer managing circuit state
- CircuitBreaker.Registry - Track multiple circuit breakers

Don't jump to optimization yet. Get it working first.

Hint 2 - ETS Table Design:

# For fixed window rate limiting:
:ets.new(:rate_limits, [:set, :public, :named_table])

# Store: {key, window_start} => count
# On each request:
# 1. Calculate current window
# 2. :ets.update_counter to atomically increment
# 3. If counter exceeds limit, deny

# The :public option allows any process to read/write
# The :named_table option lets you reference by name

Hint 3 - Avoiding Race Conditions:

# WRONG - Race condition:
count = :ets.lookup(:rate_limits, key)
if count < limit do
  :ets.insert(:rate_limits, {key, count + 1})
end

# RIGHT - Atomic:
new_count = :ets.update_counter(:rate_limits, key, {2, 1}, {key, 0})
if new_count <= limit, do: :allow, else: :deny

# :ets.update_counter is atomic - no race condition possible

Hint 4 - Circuit Breaker State Machine:

defmodule CircuitBreaker.Server do
  use GenServer

  # State: %{status: :closed | :open | :half_open,
  #          failures: 0,
  #          last_failure_time: nil}

  def handle_call(:check, _from, %{status: :open} = state) do
    if should_attempt_reset?(state) do
      {:reply, :half_open, %{state | status: :half_open}}
    else
      {:reply, :open, state}
    end
  end

  # Handle :closed and :half_open states...
end

Hint 5 - Load Testing & Measurement:

# Install hey: brew install hey

# Test rate limiter performance:
hey -n 100000 -c 100 http://localhost:4000/check/user123

# Watch process counts in observer:
iex> :observer.start()

# Profile with :fprof or :eprof:
:fprof.trace([:start])
# ... make requests ...
:fprof.trace([:stop])
:fprof.profile()
:fprof.analyse()

# If ETS is bottleneck, consider partitioning:
# Hash user_id across multiple ETS tables

Books That Will Help:

Book Author Relevant Chapters Why It Helps
Elixir in Action Saša Jurić Ch. 10 (ETS), Ch. 8 (Fault Tolerance) Best explanation of when to use ETS vs GenServer state, includes performance patterns
Designing for Scalability with Erlang/OTP Francesco Cesarini Ch. 6 (ETS and Mnesia), Ch. 9 (System Reliability) Deep dive into ETS concurrency model, production patterns
System Design Interview Vol 2 Alex Xu Ch. 4 (Rate Limiter) Clear explanation of different rate limiting algorithms with diagrams
Release It! Michael Nygard Ch. 5 (Circuit Breaker Pattern) Original source for circuit breaker pattern, stability patterns
Programming Erlang Joe Armstrong Ch. 15 (ETS and DETS) From Erlang’s creator, explains the philosophy behind ETS design
Designing Data-Intensive Applications Martin Kleppmann Ch. 11 (Stream Processing) Context on where rate limiting fits in system architecture

Project 3: Distributed Key-Value Store

  • File: BEAM_ELIXIR_ERLANG_LEARNING_PROJECTS.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang, Go, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: Level 4: The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced (The Engineer)
  • Knowledge Area: Distributed Systems, Consensus
  • Software or Tool: BEAM VM, OTP, Distributed Erlang
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A key-value store that runs on multiple nodes, partitions data across them, and survives node failures. Think a simplified Redis Cluster or etcd.

Why it teaches BEAM’s strengths: BEAM has distribution built into the VM - nodes can communicate transparently, processes can be registered across a cluster, and you can send messages to processes on other machines with the same syntax as local ones. In other languages, this requires significant infrastructure (gRPC, service discovery, serialization).

Core challenges you’ll face:

  • Connecting multiple BEAM nodes in a cluster (maps to: Distribution primitives)
  • Partitioning keys across nodes using consistent hashing (maps to: Distributed algorithms)
  • Handling node failures and re-partitioning (maps to: Node monitoring, pg/pg2)
  • Implementing read/write quorums for consistency (maps to: Distributed systems theory)

Key Concepts:

  • BEAM Distribution: “Learn You Some Erlang” - Distribunomicon chapter (free online)
  • Consistent Hashing: “Designing Data-Intensive Applications” Ch. 6 - Martin Kleppmann
  • Distributed Process Groups: “Programming Erlang” Ch. 17 - Joe Armstrong
  • CAP Theorem tradeoffs: “Designing Data-Intensive Applications” Ch. 9 - Martin Kleppmann

Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Solid understanding of GenServer, ETS, basic distributed systems concepts

Real world outcome: You’ll spin up 3 terminal windows, each running a BEAM node, connect them into a cluster, and store/retrieve data across them. Then you’ll kill -9 one of the nodes and watch the other two continue serving requests. You’ll query a key from any node and get the correct value even though it’s stored on a different node.

Learning milestones:

  1. First milestone: Two nodes talking - you understand BEAM’s distribution primitives
  2. Second milestone: Data partitioned with consistent hashing - you understand why this is trivial in BEAM vs building it in Go/Java
  3. Final milestone: Node failure recovery - you’ve built something production-worthy

Real World Outcome (Enhanced):

# Start 3 nodes in separate terminals
# Terminal 1
$ iex --sname node1@localhost -S mix
iex(node1@localhost)> DistKV.Cluster.join()
:ok

# Terminal 2
$ iex --sname node2@localhost -S mix
iex(node2@localhost)> DistKV.Cluster.join()
Connected to: [node1@localhost]

# Terminal 3
$ iex --sname node3@localhost -S mix
iex(node3@localhost)> DistKV.Cluster.join()
Connected to: [node1@localhost, node2@localhost]

# Store data from any node
iex(node1@localhost)> DistKV.put("user:123", %{name: "Alice"})
:ok  # Data is stored on node2 (determined by consistent hash)

# Retrieve from a different node
iex(node3@localhost)> DistKV.get("user:123")
%{name: "Alice"}  # Fetched from node2 transparently!

# Now KILL node2 (Ctrl+C twice or kill -9)
# Watch the cluster detect and recover
iex(node1@localhost)>
[warn] Node node2@localhost disconnected
[info] Redistributing 342 keys from failed node...

# Data is still accessible!
iex(node3@localhost)> DistKV.get("user:123")
%{name: "Alice"}  # Now served from node1 (replica)

The Core Question You’re Answering

“How do you build a database that survives machine failures without losing data or availability?”

This is the question that defined Amazon’s Dynamo, Apache Cassandra, Riak, and every modern distributed database. In traditional single-node databases, your data’s fate is tied to a single machine. A disk failure, power outage, or kernel panic means downtime and potential data loss.

BEAM gives you primitives that make distribution feel natural - nodes connect with a single function call, message passing works identically across machines, and failure detection is built into the VM. What takes weeks of infrastructure in other languages takes days in Elixir.


Concepts You Must Understand First

Stop and research these before coding:

  1. BEAM Distribution Primitives
    • How do BEAM nodes discover and connect to each other?
    • What is the Erlang cookie, and why does it matter for security?
    • How does message passing work between nodes? (Hint: Same syntax as local!)
    • What is :global vs :pg vs Registry for cross-node process registration?
    • Book Reference: “Learn You Some Erlang” — “Distribunomicon” chapter (free online)
  2. Consistent Hashing
    • Why can’t you just use hash(key) % num_nodes? (What happens when nodes change?)
    • How does a hash ring minimize data movement during topology changes?
    • What are virtual nodes, and why do they improve balance?
    • Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 6: “Partitioning”
  3. Replication Strategies
    • What is primary-backup replication, and when does it fail?
    • What is leaderless (Dynamo-style) replication?
    • How do quorums (N, W, R) provide tunable consistency?
    • What is “read your writes” consistency, and how do you achieve it?
    • Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 5: “Replication”
  4. CAP Theorem
    • What do Consistency, Availability, and Partition Tolerance actually mean?
    • Why must you choose between C and A during a partition?
    • What is eventual consistency, and what invariants does it break?
    • What is the difference between linearizable and serializable?
    • Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus”
  5. Node Failure Detection
    • How does BEAM detect that a remote node is down?
    • What is Node.monitor/2, and what messages does it send?
    • What’s the difference between a crashed node and a network partition?
    • How do you handle “false positives” (node thought dead but actually alive)?
    • Book Reference: “Programming Erlang” by Joe Armstrong — Ch. 17: “Distributed Programming”

Questions to Guide Your Design

Before implementing, think through these:

  1. Partitioning Strategy
    • How do you decide which node stores which key?
    • If you add a fourth node to a 3-node cluster, how many keys need to move?
    • How do you handle “hot keys” (one key gets 90% of traffic)?
  2. Replication Decisions
    • How many copies of each key do you store (replication factor)?
    • Which nodes hold the replicas? (Next N nodes on the ring?)
    • What happens when you write: write to one node or all replicas?
  3. Consistency vs Availability
    • If node2 is unreachable, do you:
      • Return an error (CP: prioritize consistency)?
      • Return potentially stale data from a replica (AP: prioritize availability)?
    • How do you configure this per-request?
  4. Failure Handling
    • What happens when a node crashes mid-write?
    • How do you detect node failures? How quickly?
    • When a node comes back, how do you sync its data with the cluster?
  5. Cluster Membership
    • How do nodes discover each other?
    • What happens when a new node joins?
    • How do you rebalance data without causing downtime?

Thinking Exercise

Before coding, work through this on paper:

Exercise 1: Draw the Consistent Hash Ring

┌─────────────────────────────────────────────────────────────────────────┐
│                     CONSISTENT HASH RING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                              0                                           │
│                              │                                           │
│                    ┌─────────┴─────────┐                                │
│                    │                   │                                │
│          Node A ───●                   ●─── "user:456" lands here       │
│          (hash=10)                     (hash=25)                        │
│                                                                          │
│    ●                                         ●                          │
│    Node C                                    "user:123"                 │
│    (hash=70)                                 (hash=45)                  │
│                                                                          │
│                    ●───────────────────●                                │
│                    Node B              "order:789"                      │
│                    (hash=50)           (hash=60)                        │
│                                                                          │
│   Key assignment rule: Key goes to the NEXT node clockwise on the ring  │
│                                                                          │
│   • "user:456" (hash=25) → Node B (hash=50)  ✓                          │
│   • "user:123" (hash=45) → Node B (hash=50)  ✓                          │
│   • "order:789" (hash=60) → Node C (hash=70) ✓                          │
│                                                                          │
│   Now: What happens if Node B crashes?                                  │
│   - "user:456" → Node C (next clockwise from 25)                        │
│   - "user:123" → Node C (next clockwise from 45)                        │
│   - Only keys between A and B need to move! Others stay put.            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Exercise 2: Trace a Distributed Read

Scenario: Client requests GET("user:123") from Node A.
Key "user:123" is stored on Node B (primary) and Node C (replica).

Step 1: Client → Node A: GET("user:123")
Step 2: Node A calculates hash("user:123") = 45 → primary is Node B
Step 3: Node A knows replicas are on [Node B, Node C] (replication factor = 2)
Step 4: Node A sends parallel requests to Node B and Node C
Step 5: Node B responds: {value: "Alice", version: 42}
Step 6: Node C responds: {value: "Alice", version: 42}
Step 7: Node A returns to client (quorum met: 2 responses out of 2)

Now trace: What if Node B is down?
- Step 4: Node A sends to Node B (times out) and Node C (responds)
- With R=1, we can still serve the read from Node C
- With R=2, we'd return an error (insufficient replicas)

Exercise 3: Trace a Write with Conflict

Scenario: Two clients write to "counter" simultaneously from different nodes.

Timeline:
  T1: Client1 → Node A: PUT("counter", 100)
  T1: Client2 → Node B: PUT("counter", 200)

Without coordination, both nodes might accept their write.
When you later read "counter", what do you get?

Options:
A) Last-write-wins (LWW): Use timestamps, most recent wins
   - Problem: Clocks aren't perfectly synchronized
   - Problem: Can lose data silently

B) Version vectors: Track causal history
   - Node A: {counter: 100, version: {A: 1}}
   - Node B: {counter: 200, version: {B: 1}}
   - On read: Detect conflict, return both, let client resolve

C) CRDTs: Use data structures that merge automatically
   - G-Counter: Only increments, merges by taking max per node
   - No conflicts possible!

Which strategy will you implement?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain how consistent hashing minimizes data movement when nodes are added or removed.”
    • Answer: In naive hashing (key % N), changing N remaps almost all keys. In consistent hashing, keys are assigned to the next node clockwise on a ring. Adding a node only affects keys between it and its predecessor. With virtual nodes, load is distributed evenly and movement is ~1/N of total keys.
  2. “What’s the difference between AP and CP systems in the CAP theorem? Which did you choose and why?”
    • Answer: During a network partition, CP systems (like etcd, Zookeeper) refuse writes to maintain consistency—every read returns the latest write. AP systems (like Cassandra, DynamoDB) accept writes on available nodes—may return stale data but stay available. I chose AP because [my use case values availability over immediate consistency] or CP because [my use case requires strong consistency guarantees].
  3. “How would you handle a network partition where node1 can see node2, but node3 can’t see either?”
    • Answer: This is a “split-brain” scenario. Options: (1) Use a quorum—with 3 nodes, require 2 nodes to agree (node3 alone can’t write). (2) Use an external arbiter (like Consul or etcd) to determine which partition is authoritative. (3) Accept writes on both partitions and merge conflicts later (AP approach with conflict resolution).
  4. “What’s the difference between strong consistency and eventual consistency? What are the tradeoffs?”
    • Answer: Strong (linearizable) consistency: Every read returns the most recent write. Requires coordination (slow). Eventual consistency: Reads may return stale data, but all replicas converge given enough time. Fast, available, but application must handle stale reads. Tradeoff is latency/availability vs. consistency guarantees.
  5. “How do you implement read repair in a distributed key-value store?”
    • Answer: On every read, send request to all replicas. Compare returned values and versions. If replicas disagree, send the newest version to stale replicas as a background repair. This probabilistically fixes inconsistencies during normal read traffic without requiring a separate consistency process.
  6. “Explain the quorum approach (N/W/R values). What happens with N=3, W=2, R=2?”
    • Answer: N=replication factor (3 copies), W=write quorum (2 must acknowledge), R=read quorum (2 must respond). With W+R > N (2+2 > 3), you’re guaranteed at least one node in every read saw the latest write—strong consistency. With N=3, W=1, R=1 (W+R = 2 ≤ N), you could read stale data—eventual consistency but faster.
  7. “How would you implement anti-entropy to ensure replicas converge?”
    • Answer: Run a background process that periodically compares data between replicas. Use Merkle trees: hash the data hierarchically so you can quickly identify which ranges differ. Exchange only the differing keys. This bounds consistency lag even if read repair misses some keys.
  8. “Your node receives a write, but one replica is unreachable. What do you do?”
    • Answer: Depends on configuration. With W=2, N=3: if 2 replicas acknowledge, write succeeds. The third replica gets “hinted handoff”—the coordinator holds the write and retries when the replica recovers. Alternative: sloppy quorum—temporarily write to another available node, repair later.

Hints in Layers

Hint 1 - Getting Nodes to Talk:

# In terminal 1:
$ iex --sname node1@localhost --cookie secret -S mix

# In terminal 2:
$ iex --sname node2@localhost --cookie secret -S mix

# From node2, connect to node1:
iex(node2@localhost)> Node.connect(:"node1@localhost")
true

iex(node2@localhost)> Node.list()
[:"node1@localhost"]

# Now you can send messages across nodes!
iex(node2@localhost)> Node.spawn(:"node1@localhost", fn -> IO.puts("Hello from node2!") end)

# The cookie must match, or connection fails
# Use --cookie or set in ~/.erlang.cookie

Hint 2 - Simple Consistent Hashing:

defmodule ConsistentHash do
  @ring_size 1_000_000  # Large ring for good distribution

  def hash(key) when is_binary(key) do
    :erlang.phash2(key, @ring_size)
  end

  def get_node(key, nodes) when is_list(nodes) and length(nodes) > 0 do
    key_hash = hash(key)

    # Sort nodes by their hash position
    node_positions = Enum.map(nodes, fn node ->
      {hash(Atom.to_string(node)), node}
    end)
    |> Enum.sort_by(fn {pos, _} -> pos end)

    # Find first node with position >= key_hash (or wrap around)
    case Enum.find(node_positions, fn {pos, _} -> pos >= key_hash end) do
      nil ->
        # Wrap around to first node
        {_, node} = hd(node_positions)
        node
      {_, node} ->
        node
    end
  end

  # Get N replica nodes for a key
  def get_replicas(key, nodes, replication_factor) do
    sorted_nodes = Enum.map(nodes, fn node ->
      {hash(Atom.to_string(node)), node}
    end)
    |> Enum.sort_by(fn {pos, _} -> pos end)

    key_hash = hash(key)

    # Rotate list so primary is first
    {before, after_} = Enum.split_while(sorted_nodes, fn {pos, _} -> pos < key_hash end)
    rotated = after_ ++ before

    # Take first N nodes as replicas
    rotated
    |> Enum.take(replication_factor)
    |> Enum.map(fn {_, node} -> node end)
  end
end

Hint 3 - Cross-Node GenServer Calls:

defmodule DistKV.Storage do
  use GenServer

  def start_link(opts) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  def init(state) do
    {:ok, state}
  end

  # Local call
  def get(key), do: GenServer.call(__MODULE__, {:get, key})
  def put(key, value), do: GenServer.call(__MODULE__, {:put, key, value})

  # Remote call - SAME SYNTAX but with node name
  def remote_get(node, key) do
    GenServer.call({__MODULE__, node}, {:get, key})
  end

  def remote_put(node, key, value) do
    GenServer.call({__MODULE__, node}, {:put, key, value})
  end

  def handle_call({:get, key}, _from, state) do
    {:reply, Map.get(state, key), state}
  end

  def handle_call({:put, key, value}, _from, state) do
    {:reply, :ok, Map.put(state, key, value)}
  end
end

# From any node, call storage on a specific node:
# DistKV.Storage.remote_get(:"node2@localhost", "user:123")

Hint 4 - Node Failure Detection:

defmodule DistKV.ClusterMonitor do
  use GenServer

  def start_link(opts) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  def init(state) do
    # Monitor all connected nodes
    :net_kernel.monitor_nodes(true)
    {:ok, %{nodes: MapSet.new(Node.list())}}
  end

  def handle_info({:nodeup, node}, state) do
    IO.puts("[Cluster] Node #{node} joined!")
    # Trigger rebalancing here
    new_nodes = MapSet.put(state.nodes, node)
    {:noreply, %{state | nodes: new_nodes}}
  end

  def handle_info({:nodedown, node}, state) do
    IO.puts("[Cluster] Node #{node} left!")
    # Trigger failover here - redistribute that node's keys
    new_nodes = MapSet.delete(state.nodes, node)
    {:noreply, %{state | nodes: new_nodes}}
  end
end

Hint 5 - Parallel Quorum Reads:

defmodule DistKV.Coordinator do
  @read_quorum 2
  @timeout 5_000

  def get(key) do
    nodes = ConsistentHash.get_replicas(key, Node.list() ++ [Node.self()], 3)

    # Send requests to all replicas in parallel
    tasks = Enum.map(nodes, fn node ->
      Task.async(fn ->
        try do
          DistKV.Storage.remote_get(node, key)
        catch
          :exit, _ -> {:error, :node_down}
        end
      end)
    end)

    # Wait for quorum responses
    results = tasks
    |> Task.yield_many(@timeout)
    |> Enum.map(fn
      {_task, {:ok, result}} -> result
      {task, nil} ->
        Task.shutdown(task, :brutal_kill)
        {:error, :timeout}
    end)
    |> Enum.reject(&match?({:error, _}, &1))

    if length(results) >= @read_quorum do
      # Return the value (could also do read repair here)
      {:ok, hd(results)}
    else
      {:error, :quorum_not_met}
    end
  end
end

Hint 6 - Putting It All Together:

defmodule DistKV do
  @doc "Public API - routes to correct node and handles replication"

  def put(key, value) do
    # Find which nodes should store this key
    replicas = ConsistentHash.get_replicas(key, get_cluster_nodes(), 3)

    # Write to all replicas in parallel
    results = Enum.map(replicas, fn node ->
      Task.async(fn ->
        DistKV.Storage.remote_put(node, key, value)
      end)
    end)
    |> Task.await_many(5_000)

    # Check write quorum
    successes = Enum.count(results, &(&1 == :ok))
    if successes >= 2 do
      :ok
    else
      {:error, :write_failed}
    end
  end

  def get(key) do
    DistKV.Coordinator.get(key)
  end

  defp get_cluster_nodes do
    [Node.self() | Node.list()]
  end
end

Books That Will Help

Topic Book Chapter Why It Helps
Partitioning & Consistent Hashing Designing Data-Intensive Applications by Martin Kleppmann Ch. 6: “Partitioning” The definitive explanation of partitioning strategies with clear diagrams
Replication Designing Data-Intensive Applications by Martin Kleppmann Ch. 5: “Replication” Covers primary-backup, multi-leader, and leaderless replication in depth
CAP Theorem & Consensus Designing Data-Intensive Applications by Martin Kleppmann Ch. 9: “Consistency and Consensus” Clears up common CAP misconceptions, explains linearizability
BEAM Distribution Learn You Some Erlang by Fred Hébert “Distribunomicon” (free online) Practical guide to distributed Erlang with code examples
Distributed Erlang Programming Erlang by Joe Armstrong Ch. 17: “Distributed Programming” From Erlang’s creator - fundamentals of BEAM clustering
Dynamo Paper Amazon, 2007 Entire paper The foundational paper for modern distributed KV stores
Production Patterns Designing for Scalability with Erlang/OTP by Francesco Cesarini Ch. 11: “Distribution” How to build production-grade distributed Elixir systems

Project 4: Real-Time Dashboard with Phoenix LiveView

  • File: BEAM_ELIXIR_ERLANG_LEARNING_PROJECTS.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: TypeScript, Go, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: Level 2: The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate (The Developer)
  • Knowledge Area: Real-time Web, WebSockets
  • Software or Tool: Phoenix, LiveView, PubSub
  • Main Book: “Programming Phoenix LiveView” by Bruce Tate

What you’ll build: A dashboard that displays live metrics (CPU, memory, request rates) updating in real-time for multiple servers. No JavaScript required for the real-time updates.

Why it teaches BEAM’s strengths: LiveView is only possible because of BEAM - each user connection is a lightweight process, state changes push diffs over websockets, and the server can handle hundreds of thousands of concurrent connections. React/Vue with websockets could do this, but requires much more infrastructure and can’t match the connection density.

Core challenges you’ll face:

  • Maintaining stateful connections for each user (maps to: Process per connection)
  • Pushing updates to thousands of users simultaneously (maps to: PubSub, broadcast)
  • Handling slow clients without blocking others (maps to: Process isolation, backpressure)
  • Surviving server restarts without losing all connections (maps to: Presence, CRDT)

Key Concepts:

  • Phoenix Channels & PubSub: “Programming Phoenix LiveView” Ch. 2-3 - Bruce Tate
  • LiveView State Management: “Programming Phoenix LiveView” Ch. 4-5 - Bruce Tate
  • Phoenix Presence (CRDTs): Phoenix Presence documentation + CRDT paper by Shapiro et al.
  • Backpressure: “Systems Performance” Ch. 2 - Brendan Gregg (concept) + GenStage docs

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Elixir, Phoenix basics

Real world outcome: Open your dashboard in 10 browser tabs. Watch all 10 update simultaneously when metrics change. Open the Observer and watch the process count - you’ll see each tab is a separate process. Now open 1000 tabs (or use a load testing tool) - it still works. Try this with a traditional request/response framework.

Learning milestones:

  1. First milestone: Single live-updating component - you understand LiveView’s model
  2. Second milestone: Broadcasting to multiple clients - you understand PubSub and why it’s cheap
  3. Final milestone: Handling thousands of connections - you understand why BEAM was built for this

Real World Outcome:

# Start the Phoenix server
$ mix phx.server
[info] Running DashboardWeb.Endpoint at http://localhost:4000

# Open browser to http://localhost:4000/dashboard
# You see a real-time metrics dashboard:
┌─────────────────────────────────────────────────────────┐
│              System Metrics Dashboard                   │
├─────────────────────────────────────────────────────────┤
│  Server: web-01          Status: ● ONLINE               │
│  ┌─────────────────────────────────────────────────────┐│
│  │ CPU: ████████░░░░░░░░ 45%                           ││
│  │ MEM: ██████████████░░ 72%                           ││
│  │ REQ: 1,247/sec    ↑ 12%                             ││
│  └─────────────────────────────────────────────────────┘│
│                                                         │
│  Server: web-02          Status: ● ONLINE               │
│  ┌─────────────────────────────────────────────────────┐│
│  │ CPU: ████░░░░░░░░░░░░ 22%                           ││
│  │ MEM: ████████████░░░░ 65%                           ││
│  │ REQ: 892/sec      ↓ 3%                              ││
│  └─────────────────────────────────────────────────────┘│
│                                                         │
│  Connected Users: 1,247  │  Updates: 50ms refresh      │
└─────────────────────────────────────────────────────────┘

# The bars update IN REAL-TIME without page refresh!
# No JavaScript written - all server-rendered with LiveView

# Check process count in iex:
iex> :erlang.system_info(:process_count)
2547  # Each connected browser = 1 LiveView process

The Core Question You’re Answering: “How do you push real-time updates to thousands of browser clients without writing JavaScript or building complex websocket infrastructure?”

Concepts You Must Understand First:

  • Phoenix Channels and WebSockets
  • LiveView process lifecycle
  • PubSub broadcasting patterns
  • Phoenix Presence for tracking connected users
  • Handling backpressure with slow clients

Questions to Guide Your Design:

  • How does LiveView diff updates to minimize data sent?
  • What happens when a browser tab closes?
  • How do you broadcast to 10,000 users without blocking?
  • How do you handle users with slow connections?

Thinking Exercise: Trace the message flow from server event to browser update:

  1. A system metric changes (e.g., CPU usage jumps to 60%)
  2. Your monitoring process detects this change
  3. It publishes to Phoenix.PubSub: PubSub.broadcast("metrics:web-01", {:cpu_update, 60})
  4. All LiveView processes subscribed to “metrics:web-01” receive this message
  5. Each LiveView process updates its assigns: assign(socket, :cpu, 60)
  6. LiveView diffs the new HTML against the old HTML
  7. Only the changed parts are sent over the websocket (not the entire page)
  8. The client receives the diff: {diff: [{0, "████████████░░░░ 60%"}]}
  9. The browser applies the diff to the DOM
  10. The user sees the bar update in real-time

The Interview Questions They’ll Ask:

  1. “What happens if you have 10,000 connected users and you broadcast an update? Does it block?”
    • Answer: No. PubSub uses process isolation - each LiveView is a separate process with its own mailbox. Broadcasting is O(1) - it puts the message in each process’s queue. Each process handles it independently without blocking others.
  2. “How does LiveView know what HTML to send to the client on initial page load vs updates?”
    • Answer: On mount, LiveView renders the full template and sends complete HTML. On updates, it re-renders and diffs against the previous render, sending only the changes (morphdom-style patching). This minimizes bandwidth.
  3. “What happens when a user’s browser tab crashes or they close the page?”
    • Answer: The WebSocket connection breaks, the LiveView process receives a shutdown signal, calls terminate/2 callback for cleanup, and terminates. No memory leaks - the process is garbage collected.
  4. “How would you handle a slow client that can’t keep up with updates?”
    • Answer: Each LiveView process has a message queue. If a client is slow, messages pile up in its queue. You can monitor queue length and either: 1) Skip intermediate updates, sending only the latest, 2) Disconnect slow clients, or 3) Use backpressure mechanisms like GenStage to slow down the producer.
  5. “Why can’t you build this easily in React/Node.js?”
    • Answer: You could, but you’d need: 1) A separate WebSocket server (Socket.io, etc.), 2) State management for each connection (Redis or in-memory store), 3) Manual diffing logic, 4) Clustering for horizontal scaling, 5) Load balancer with sticky sessions. LiveView gives you all of this built-in because BEAM processes are lightweight (2KB) and isolated.
  6. “How do you test LiveView components?”
    • Answer: Phoenix provides Phoenix.LiveViewTest with helpers like render_component/2, live/2 for mounting, and render_click/2, render_submit/2 for simulating user interactions. You can assert on the rendered HTML and test the entire lifecycle without a browser.

Hints in Layers:

  1. Hint 1 - The Architecture: Start by creating a GenServer that generates fake metrics every second. Use Process.send_after/3 to schedule recurring updates. This will be your data source.

  2. Hint 2 - The LiveView Process: In your LiveView’s mount/3, subscribe to PubSub: Phoenix.PubSub.subscribe(MyApp.PubSub, "metrics"). In handle_info/2, pattern match on the metric updates and use assign/3 to update socket state.

  3. Hint 3 - The Template: Use LiveView’s HEEx templates with dynamic attributes. For the CPU bar, calculate width based on percentage: <div style={"width: #{@cpu}%"}>. LiveView will automatically diff this when @cpu changes.

  4. Hint 4 - Scaling to Multiple Servers: Create a separate GenServer for each monitored server. Each one publishes to its own PubSub topic: “metrics:web-01”, “metrics:web-02”. In LiveView, subscribe to all topics you want to display.

  5. Hint 5 - The “Aha!” Moment: Open your browser’s Network tab and watch the WebSocket frames. You’ll see LiveView only sends tiny diffs like [0, "45%"] instead of the entire page. Now open 10 tabs - each is a separate process. Check Observer (iex> :observer.start()) and filter by LiveView processes. Watch them appear/disappear as you open/close tabs. This is why BEAM can handle millions of connections.

Books That Will Help:

  • “Programming Phoenix LiveView” by Bruce Tate and Sophie DeBenedetto
    • Chapter 2: “Your First LiveView” - Understanding the LiveView lifecycle
    • Chapter 3: “Generators: Scoffing LiveView” - Building forms and interactions
    • Chapter 4: “Live Data and Snapshots” - Managing state and temporary assigns
    • Chapter 5: “Changeset Strategies” - Form validation patterns
    • Chapter 6: “DOM Patching and Temporary Assigns” - Performance optimization
    • Chapter 7: “Building a Game with Presence” - Phoenix Presence and tracking users
    • Chapter 8: “Uploads” - Handling file uploads in LiveView
  • “Real-Time Phoenix” by Stephen Bussey
    • Chapter 4: “Manage Real-Time State with GenServer” - Backend for LiveView
    • Chapter 5: “Communicate with Phoenix PubSub” - Broadcasting patterns
    • Chapter 6: “Build a Real-Time Application” - Complete example
  • “Phoenix in Action” by Geoffrey Lessel
    • Chapter 13: “Phoenix Channels” - Understanding the WebSocket layer under LiveView

Project 5: Telephony System / Simple SIP Server

  • File: BEAM_ELIXIR_ERLANG_LEARNING_PROJECTS.md
  • Programming Language: Elixir
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Telecommunications / VoIP
  • Software or Tool: SIP / BEAM
  • Main Book: “Erlang and OTP in Action” by Martin Logan

What you’ll build: A basic telephony server that can receive “calls” (simulated or real SIP), route them, handle voicemail, and implement call queues. This is literally what BEAM was invented for.

Why it teaches BEAM’s strengths: You’re building what Ericsson built BEAM for. Each call is a process, calls can crash without affecting others, supervision trees handle recovery, and you need real-time guarantees (audio can’t buffer).

Core challenges you’ll face:

  • Modeling each call as a finite state machine (maps to: gen_statem)
  • Handling call transfers without dropping audio (maps to: Process handoff)
  • Implementing call queues with timeouts (maps to: Supervision strategies)
  • Supporting thousands of concurrent calls (maps to: Lightweight processes)

Key Concepts:

  • gen_statem (State Machines): “Designing for Scalability with Erlang/OTP” Ch. 5 - Francesco Cesarini
  • Finite State Machines: “Erlang and OTP in Action” Ch. 7 - Martin Logan
  • Soft Real-time Guarantees: “The BEAM Book” (free online) - Erik Stenman
  • SIP Protocol Basics: RFC 3261 (overview sections) or any SIP tutorial

Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Strong OTP understanding, networking basics

Real world outcome: You’ll have a system where you can “dial” a number (via telnet/HTTP), get routed to an “agent” (another connection), and have a conversation. Put an agent on hold, transfer them, hang up one side - the system handles it all. Kill the agent process - the caller gets transferred to voicemail automatically. This is how real phone systems work.

Learning milestones:

  1. First milestone: Single call as state machine - you understand gen_statem
  2. Second milestone: Call routing and queues - you understand supervision strategies
  3. Final milestone: Fault-tolerant call handling - you understand why Erlang runs phone networks

Enhanced Real World Outcome:

# Start the telephony server
$ iex -S mix
iex> Telephony.start()
[info] Telephony server listening on port 5060

# Simulate incoming call (Terminal 2)
$ telnet localhost 5060
INVITE sip:support@company.com
> 100 Trying...
> 180 Ringing...
> 200 OK (connected to Agent #1)

# Agent 1's terminal shows:
[RING] Incoming call from +1-555-0123
[CONNECTED] Call duration: 00:00:00

# Put caller on hold
HOLD
> 200 OK (on hold)
> [Playing hold music...]

# Transfer to Agent 2
TRANSFER sip:agent2@company.com
> 202 Accepted
> [Transferring...]
> 200 OK (connected to Agent #2)

# Now KILL Agent 1's process:
iex> Process.exit(Agent1.pid, :kill)

# Caller is automatically moved to voicemail!
[Call state machine transitions:]
:connected -> :agent_crashed -> :voicemail
> "Please leave a message after the tone..."

# Call queue visualization:
┌────────────────────────────────────────┐
│         Call Queue: Support            │
├────────────────────────────────────────┤
│ Position │ Caller     │ Wait Time     │
├──────────┼────────────┼───────────────┤
│    1     │ +1-555-001 │ 00:02:34      │
│    2     │ +1-555-002 │ 00:01:12      │
│    3     │ +1-555-003 │ 00:00:45      │
└────────────────────────────────────────┘
│ Agents: 2 available, 3 on calls        │
└────────────────────────────────────────┘

The Core Question You’re Answering: “How do you model complex state machines (like phone calls) that must handle failures gracefully and never drop the ball?”

Concepts You Must Understand First:

  • gen_statem for finite state machines
  • State machine design patterns
  • Soft real-time guarantees in BEAM
  • Process handoff and state transfer
  • Supervision strategies for stateful processes

Questions to Guide Your Design:

  • What are the states of a phone call?
  • How do you transfer a call without dropping audio?
  • What happens when an agent’s process crashes mid-call?
  • How do you ensure fair queue ordering?

Thinking Exercise: Draw the complete state machine for a phone call on paper. Include all states (ringing, connected, on_hold, transferring, voicemail, ended) and all possible transitions between them. Then consider: what happens at each transition if the process crashes? Which states can recover automatically, and which need supervision intervention?

The Interview Questions They’ll Ask:

  1. “Walk me through what happens when you transfer a call between two agents. How do you ensure the audio stream isn’t dropped?”
  2. “Your agent process crashes while handling a call. How does the system detect this and recover without the caller knowing?”
  3. “How would you implement a call queue where callers hear ‘You are number 3 in the queue’ and it updates automatically?”
  4. “What’s the difference between gen_server and gen_statem, and why would you choose gen_statem for a phone call?”
  5. “How do you prevent a slow agent from blocking the entire call queue?”
  6. “Explain how you’d implement call recording without coupling it tightly to the call process itself.”

Hints in Layers:

Hint 1 (Architecture): Think of each call as its own gen_statem process. The states are obvious: :idle, :ringing, :connected, :on_hold, :transferring, :voicemail, :ended. But the transitions are where BEAM shines - you can crash during a transition and the supervisor knows how to recover.

Hint 2 (Process Handoff): When transferring a call, don’t try to move state between processes. Instead, have the new agent process subscribe to the audio stream while the old one unsubscribes. The call process coordinates this, but doesn’t care which agent is listening.

Hint 3 (Fault Tolerance): Use Process.monitor/1 to watch agent processes. When an agent crashes, the call state machine receives a :DOWN message and can transition to :voicemail state automatically. This is why “let it crash” works - the failure is just another event.

Hint 4 (Call Queues): The queue is its own GenServer that maintains an ordered list of waiting calls. When an agent becomes available, it asks the queue for the next call. The queue doesn’t push - agents pull. This prevents blocking and makes the system backpressure-aware.

Hint 5 (Real-time Audio): You don’t need to actually stream audio for this project - simulate it with periodic messages. The important part is proving that your state machine can handle transitions without dropping messages. Use Process.send_after/3 to simulate audio packets every 20ms.

Books That Will Help:

  • “Erlang and OTP in Action” by Martin Logan, Eric Merritt, Richard Carlsson
    • Chapter 7: Finite State Machines (gen_statem fundamentals)
    • Chapter 4: OTP Supervisors (how to design supervision trees for telephony)
    • Chapter 11: Adding Distribution (scaling to multiple nodes)
  • “Designing for Scalability with Erlang/OTP” by Francesco Cesarini & Steve Vinoski
    • Chapter 5: Using gen_statem (advanced state machine patterns)
    • Chapter 6: Process Architectures (how to structure telephony systems)
    • Chapter 10: System Reliability (achieving telecom-grade reliability)
  • “Programming Erlang” by Joe Armstrong
    • Chapter 13: Errors in Concurrent Programs (why “let it crash” works for telephony)
    • Chapter 16: OTP Behaviors (when to use gen_statem vs gen_server)
  • “The BEAM Book” by Erik Stenman (free online at https://blog.stenmans.org/theBeamBook/)
    • Chapter on Processes (understanding process overhead and scheduling)
    • Chapter on Distribution (how BEAM handles network calls in telephony)

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor BEAM Concepts Covered
Chat System Intermediate 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Processes, Supervision, Messaging
Rate Limiter Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐ ETS, Atomics, Performance
Distributed KV Advanced 2-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ Distribution, Clustering, CAP
LiveView Dashboard Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐⭐⭐ Channels, PubSub, Real-time
Telephony System Advanced 2-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐ State Machines, Original Use Case

Based on understanding where BEAM shines, here’s the recommended progression:

  1. Start with: Chat System (1-2 weeks)
    • This is the “hello world” of BEAM that isn’t trivial
    • Forces you to think in processes and supervision
    • Visible, satisfying result
  2. Then: Rate Limiter (1 week)
    • Teaches ETS and performance patterns
    • Shows why BEAM can be faster than “fast” languages for certain workloads
  3. Finally: Distributed KV Store (2-4 weeks)
    • This is where BEAM truly separates from the pack
    • You’ll build in a weekend what would take months in other languages

Final Comprehensive Project: Discord Clone (Simplified)

What you’ll build: A real-time communication platform with servers, channels, voice chat indicators, presence (online/offline/typing), and direct messages. Essentially, the core of what Discord built on Elixir.

Why it teaches BEAM’s strengths: Discord famously scaled to millions of concurrent users on Elixir. This project combines everything: massive concurrency, fault tolerance, real-time updates, distribution, and presence tracking. It’s the ultimate demonstration of why BEAM exists.

Core challenges you’ll face:

  • Managing millions of presence updates efficiently (maps to: Phoenix Presence, CRDTs)
  • Fan-out messaging to large channels (maps to: PubSub optimization, manifold)
  • Sharding servers across nodes (maps to: Distributed Erlang, consistent hashing)
  • Handling network partitions gracefully (maps to: CAP tradeoffs, eventual consistency)
  • Zero-downtime deployments (maps to: Hot code reloading, rolling restarts)

Key Concepts:

  • Phoenix Presence & CRDTs: “Programming Phoenix LiveView” Ch. 7 + Phoenix.Presence source code
  • Distributed Systems at Scale: Discord’s engineering blog posts on Elixir scaling
  • Efficient Fan-out: “Designing Data-Intensive Applications” Ch. 11 - Martin Kleppmann
  • Hot Code Reloading: “Erlang and OTP in Action” Ch. 14 - Martin Logan
  • Process Sharding: libcluster and Horde library documentation

Difficulty: Advanced Time estimate: 1-2 months Prerequisites: All previous projects, solid distributed systems understanding

Real world outcome: You’ll have a working Discord-like app where users can:

  • Create servers and channels
  • See who’s online in real-time (green dot appears/disappears)
  • See “user is typing…” indicators
  • Send messages that appear instantly for everyone
  • Scale horizontally by adding nodes

Run it on 3 different machines (or Docker containers), connect 10,000 simulated users, and watch it handle the load. Kill one of the nodes - users on that node reconnect to others automatically. This is production-grade infrastructure.

Learning milestones:

  1. First milestone: Single-server Discord works - you understand LiveView + Channels + Presence
  2. Second milestone: Presence scales to thousands - you understand CRDTs and why they matter
  3. Third milestone: Multi-node distribution - you understand BEAM clustering
  4. Final milestone: Node failure recovery - you’ve built what Discord built, and you understand why they chose Elixir

Key Insight

The fundamental insight is this: BEAM treats failure as a first-class citizen. Every other runtime tries to prevent failures. BEAM assumes failures will happen and builds recovery into the VM itself.

This is why:

  • WhatsApp: 2M connections per server, 50 engineers
  • Discord: 5M concurrent users, started with tiny team
  • Ericsson switches: 99.9999999% uptime (nine nines)

You can’t truly understand this by reading about it. You have to build something, crash it on purpose, and watch it recover. That’s what these projects teach.


Additional Resources

  1. “Elixir in Action” by Saša Jurić - Best introduction to Elixir + OTP
  2. “Programming Erlang” by Joe Armstrong - From the creator of Erlang
  3. “Designing for Scalability with Erlang/OTP” by Francesco Cesarini - Production patterns
  4. “The Little Elixir & OTP Guidebook” by Benjamin Tan Wei Hao - Practical OTP

Online Resources

  • “Learn You Some Erlang for Great Good!” (free online) - Excellent Erlang tutorial
  • “The BEAM Book” by Erik Stenman (free online) - Deep dive into VM internals
  • Elixir Forum - Active community for questions
  • Discord Engineering Blog - Real-world Elixir scaling stories

Video Resources

  • “The Soul of Erlang and Elixir” by Saša Jurić (YouTube) - Best visual explanation of BEAM
  • ElixirConf talks - Practical production experiences