Sprint: BEAM (Erlang/Elixir) Mastery - Real World Projects

Goal: Build a first-principles mental model of how BEAM languages (Erlang, Elixir, Gleam) achieve fault tolerance and massive concurrency through isolated lightweight processes, message passing, and supervision trees. citeturn5search0turn1search6turn0search5 You will internalize OTP behaviors (GenServer, Supervisor, Application), distribution semantics, and the mechanics of scheduling, memory, and code upgrades. citeturn0search8turn6search0turn5search4turn1search0 By building production-style services such as real-time dashboards, rate limiters, distributed stores, and backpressure pipelines, you will learn when BEAM is the right tool and how to design systems that self-heal under failure. citeturn3search0turn4search2

Introduction

BEAM is the virtual machine and runtime system behind Erlang and Elixir. Its design assumes that failures are normal and that concurrency should be cheap, isolated, and easy to supervise. citeturn5search0turn0search5 This guide teaches you to reason about BEAM systems as networks of small processes coordinated by supervisors, not as monolithic servers protected by locks.

What problem does it solve today?

  • It makes it practical to build fault-tolerant systems by isolating failures and restarting components automatically. citeturn0search5turn0search0
  • It enables huge numbers of concurrent activities without shared-memory locks, using message passing as the default. citeturn5search0turn1search6
  • It supports distribution (multiple nodes) with transparent message passing and links/monitors across nodes. citeturn5search4

What you will build across the projects:

  • A supervised real-time chat system
  • A rate limiter and circuit breaker service
  • A distributed key-value store
  • A Phoenix LiveView dashboard
  • A GenStage backpressure pipeline
  • A fault-injection harness to test supervision strategies
  • A hot-code-upgrade drill with release handling artifacts
  • A presence and notification service (Discord-style)
  • An ETS-powered session/cache service
  • A telemetry and observability pipeline

In scope:

  • BEAM processes, message passing, and isolation
  • OTP behaviors (GenServer, Supervisor, Application)
  • Supervision strategies and fault recovery
  • Scheduling, reductions, and per-process GC
  • Distribution, nodes, links, and monitors
  • ETS/Mnesia storage concepts
  • Hot code upgrades and release handling
  • Real-time web with Phoenix LiveView and backpressure with GenStage

Out of scope:

  • Low-level VM internals beyond behavioral guarantees
  • Full SIP/Telco protocol implementations
  • Custom NIFs beyond conceptual mentions

Big-Picture ASCII Diagram

CLIENTS -> ROUTERS -> BEAM RUNTIME -> SUPERVISION TREE -> WORKERS
   |           |           |               |               |
   v           v           v               v               v
 WebSocket   API Gate   Process Scheduler  Supervisor    GenServer
 HTTP        Rate Lim  Message Passing     Strategies    ETS Cache

Failures -> local process crash -> supervisor restart -> system stays up

How to Use This Guide

  • Read the Theory Primer first to build the mental model (do not skip it).
  • Build projects in order for your first pass; later, jump by learning path.
  • After each project, verify behavior using the exact outputs in the Definition of Done.
  • Keep a failure log: for each crash you cause, write what recovered it and why.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

  • Programming fundamentals: functions, recursion, pattern matching or conditional logic
  • Basic concurrency concepts: what a process/thread is, what a message is
  • Command-line basics: running commands, reading logs
  • Recommended Reading: “Operating Systems: Three Easy Pieces” - Concurrency chapters

Helpful But Not Required

  • Basic networking (TCP/HTTP) for real-time projects
  • Database basics for distributed KV and dashboards
  • UI basics for LiveView (HTML/CSS)

Self-Assessment Questions

  1. Can you explain the difference between a process and a thread?
  2. Can you describe a queue and why ordering matters?
  3. Can you reason about what happens if one component in a system crashes?
  4. Can you read and interpret a simple log stream?

Development Environment Setup Required Tools:

  • Erlang/OTP (latest stable)
  • Elixir (latest stable)
  • Mix (Elixir build tool)

Recommended Tools:

  • observer or :observer for runtime inspection
  • recon or telemetry libraries for instrumentation
  • PostgreSQL for the LiveView dashboard project

Testing Your Setup:

$ elixir --version
Erlang/OTP: <version>
Elixir: <version>

$ erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().' -noshell
"<otp_release>"

Time Investment

  • Simple projects: 4-8 hours each
  • Moderate projects: 10-20 hours each
  • Complex projects: 20-40 hours each
  • Total sprint: 2-4 months

Important Reality Check BEAM mastery is about system behavior under failure, not just syntax. If your system never crashes during these projects, you are not pushing it hard enough. The goal is to learn how to design for recovery, not how to avoid every error.

Big Picture / Mental Model

Think of BEAM systems as a forest of tiny, isolated processes supervised by a hierarchy that enforces recovery policies. Message passing is the only coordination mechanism, and the scheduler guarantees fairness by preempting work.

          +-------------------+           +-------------------+
          |   Supervisor A    |           |   Supervisor B    |
          +---------+---------+           +---------+---------+
                    |                               |
          +---------+---------+           +---------+---------+
          |  Worker 1 (GS)    |           |  Worker 3 (GS)    |
          +---------+---------+           +---------+---------+
                    |                               |
          +---------+---------+           +---------+---------+
          |  Worker 2 (ETS)   |           |  Worker 4 (Stage) |
          +-------------------+           +-------------------+

If Worker 2 crashes -> Supervisor A restarts only Worker 2 (one_for_one).

Theory Primer

This primer is a mini-book. Each concept below maps directly to multiple projects.

Concept 1: BEAM Processes and Actor Isolation

Fundamentals BEAM processes are lightweight, isolated units of execution with their own heap and mailbox. They are not OS threads; they are managed by the BEAM runtime and are designed to be created in very large numbers. citeturn5search0 Messages between processes are copied by default, which eliminates shared-memory data races and makes isolation a first-class property. citeturn1search6 This isolation allows failures to be contained: if one process crashes, others remain unaffected unless they are explicitly linked or monitored. The actor model in BEAM is therefore not just a design pattern but a runtime guarantee.

Deep Dive In BEAM, every process is a self-contained actor with three core pieces: mailbox, state (heap), and behavior (message handlers). Unlike thread-based concurrency models that rely on shared memory and locks, BEAM uses message passing, which drastically reduces the complexity of reasoning about concurrency. citeturn5search0turn1search6 The price you pay is message copying, but the reward is deterministic isolation: no process can mutate another process’s memory, and no data race is possible by construction.

This is not a theoretical statement. The efficiency guide documents that a newly spawned Erlang process uses a small, fixed amount of memory, with a conservative initial heap size so that systems can run hundreds of thousands or millions of processes. citeturn5search0 The runtime expands and shrinks heaps as needed, which means memory use is proportional to work, not to worst-case allocation. This supports a system design style where you spawn a new process per connection, per task, or per workflow step without worrying about OS thread exhaustion.

Message passing itself has important semantics. Messages are copied between heaps; this creates a clear ownership boundary and avoids aliasing bugs. citeturn1search6 When messages cross nodes, they are encoded into the external term format and transported over TCP, then decoded on the receiving node. citeturn5search1turn5search4 This means distribution is conceptually the same as local message passing, but with explicit latency, serialization, and security considerations.

The mailbox is FIFO, but selective receive can reorder processing because the runtime scans the mailbox for a matching pattern. This is a subtle performance and correctness issue: if you write a receive clause that matches a rare pattern, the runtime may scan a long mailbox for every receive, adding overhead. Message copying and mailbox scanning together influence the architecture: you typically use tagged messages and short queues, or you split responsibilities across processes to keep mailboxes small. citeturn1search6

A BEAM process can be linked or monitored. Links are bidirectional and propagate exits; monitors are unidirectional and deliver a DOWN message. These primitives allow you to build fault detection and cascading recovery policies. Supervisors use these to restart crashed workers. citeturn0search5turn0search0 When you design your own process tree, you must decide which failures should be isolated and which should propagate upward.

The key design mindset is that processes are cheap and disposable. Instead of writing complex defensive code for every edge case, you let the process crash and rely on supervision to recover. This is not recklessness; it is a deliberate architecture that trades local complexity for system-level stability.

How this fit on projects You will use this concept in every project, especially the chat system, rate limiter, and distributed store.

Definitions & key terms

  • Process: A BEAM-managed actor with its own heap and mailbox. citeturn5search0
  • Mailbox: FIFO queue of incoming messages.
  • Message passing: Communication by sending immutable messages between processes. citeturn1search6
  • Link/Monitor: Failure propagation and observation primitives. citeturn0search5turn0search0

Mental model diagram

[Process A] --send--> [Mailbox B] -> [Receive Loop] -> [State Update]
           (copy)          (queue)         (pattern match)

How it works (step-by-step, with invariants and failure modes)

  1. A process sends a message to another process.
  2. The message is copied into the receiver’s mailbox. citeturn1search6
  3. The receiver scans for a matching receive clause.
  4. On match, the process updates its local state.
  5. Invariant: no process can mutate another’s memory.
  6. Failure modes: mailbox buildup, selective receive overhead, unhandled messages.

Minimal concrete example

Process Counter:
- State: count
- On message {inc}: count = count + 1
- On message {get, reply_to}: send {count, value} to reply_to

Common misconceptions

  • “Processes are threads.” They are lighter and runtime-managed. citeturn5search0
  • “Messages are references.” They are copied (with refc binary exceptions). citeturn1search6

Check-your-understanding questions

  1. Why does message copying prevent data races?
  2. What happens when a mailbox grows very large?
  3. How does selective receive affect performance?

Check-your-understanding answers

  1. Each process owns its memory; no shared mutation is possible. citeturn1search6
  2. Receive scans become expensive; latency grows.
  3. The runtime may scan many messages to find a match.

Real-world applications

  • Connection-per-process servers
  • Fault-isolated background jobs
  • Concurrent pipelines with message passing

Where you’ll apply it

  • Project 1 (Chat System)
  • Project 2 (Rate Limiter)
  • Project 3 (Distributed KV)
  • Project 6 (Fault Injection Harness)

References

  • Erlang Efficiency Guide: Processes. citeturn5search0
  • Erlang Efficiency Guide: Process Messages. citeturn1search6
  • OTP Design Principles: Supervision Trees. citeturn0search5

Key insights Isolation + message passing is the foundation of BEAM reliability.

Summary BEAM processes are lightweight, isolated actors with mailbox-driven concurrency. This model avoids shared-memory hazards and makes failure recovery a system design concern rather than a local coding burden.

Homework/Exercises to practice the concept

  1. Draw a message flow between three processes for a request/response cycle.
  2. Explain why selective receive can slow a busy process.

Solutions to the homework/exercises

  1. The sender posts a message; the receiver replies; the sender processes the reply.
  2. The mailbox must be scanned for the matching pattern each time.

Concept 2: OTP Behaviors and Supervision Trees

Fundamentals OTP behaviors are standardized patterns for long-running processes (GenServer, Supervisor, Application). They provide uniform lifecycle, error handling, and integration with supervision. citeturn0search8turn6search0 A supervision tree is a hierarchy of supervisors and workers where supervisors monitor and restart children according to a defined strategy. citeturn0search5turn0search0 This structure is the practical implementation of the “let it crash” philosophy: local failures are expected and recovered by a supervising process rather than by complex defensive code.

Deep Dive OTP behaviors exist because many processes in BEAM systems follow the same lifecycle: initialize state, receive messages, handle calls/casts, and terminate cleanly. GenServer formalizes this cycle, providing callbacks for initialization, synchronous calls, asynchronous casts, and miscellaneous messages. citeturn0search8 The benefit is not convenience alone; it is interoperability. A GenServer can be supervised, introspected, and upgraded consistently across the system.

Supervision trees are the backbone of fault tolerance. The design principles describe supervisors as processes that monitor workers and restart them when they fail. citeturn0search5 Supervisors apply strategies such as one_for_one (restart only the failed child), one_for_all (restart all children), or rest_for_one (restart the failed child and those started after it). citeturn0search0turn6search0 The strategy is a semantic decision about dependency: if workers are independent, one_for_one is safer; if they depend on shared state, one_for_all may be appropriate.

The restart intensity and period parameters provide circuit-breaker behavior: too many restarts in a short time can force a supervisor to give up, which then escalates the failure up the tree. citeturn0search0turn6search0 This is the mechanism that prevents infinite restart loops and signals that a deeper issue exists. The tree is not a simple retry mechanism; it is a controlled failure policy.

OTP behaviors also encode system messages: a GenServer automatically handles system-level calls such as code upgrades or state inspection. citeturn0search8 This is why you should not write your own manual receive loops for long-lived services unless you need custom semantics. By using behaviors, you gain the runtime’s built-in tooling, introspection, and fault handling.

Another critical aspect is naming and registration. GenServers and supervisors can be registered locally or globally, which affects discoverability and distribution. Naming semantics are shared between GenServer and Supervisor in Elixir. citeturn0search8turn6search0 If you register globally, you must handle network partitions and name conflicts; if you register locally, you need a discovery mechanism. These are architectural trade-offs that become explicit in distributed projects.

The “let it crash” philosophy only works when supervision trees are designed with intent. You must classify which failures are recoverable locally and which should cascade. For example, a crashed cache worker should be restarted; a corrupted database connection might need escalation to shut down the service cleanly. Supervision is therefore not just restart logic; it is policy design.

How this fit on projects Projects 1, 2, 3, 5, and 6 are rooted in OTP behaviors and supervision decisions.

Definitions & key terms

  • OTP behavior: A standardized process pattern (GenServer, Supervisor). citeturn0search8turn6search0
  • Supervisor: A process that starts, monitors, and restarts children. citeturn0search0
  • Supervision tree: Hierarchical structure of an application into workers and supervisors. citeturn0search5
  • Restart strategy: Policy for handling child failures. citeturn0search0turn6search0

Mental model diagram

Supervisor
  |-- Worker A (GenServer)
  |-- Worker B (GenServer)
  |-- Supervisor C
        |-- Worker C1

How it works (step-by-step, with invariants and failure modes)

  1. Supervisor starts children in defined order.
  2. Children run their workloads.
  3. On failure, supervisor applies its strategy. citeturn0search0
  4. If restart intensity is exceeded, supervisor terminates and escalates. citeturn0search0turn6search0
  5. Invariant: supervisors remain responsible for child lifecycle.
  6. Failure modes: incorrect strategy choice, restart loops, missing cleanup.

Minimal concrete example

Service Tree:
- Top supervisor
  - DB worker (restart: transient)
  - Cache worker (restart: permanent)

If DB worker fails repeatedly, supervisor escalates; cache restarts alone.

Common misconceptions

  • “Supervision trees prevent all outages.” They limit failure blast radius, not eliminate outages.
  • “One_for_all is always safer.” It is only safer when children are tightly coupled. citeturn0search0

Check-your-understanding questions

  1. When would you choose one_for_one vs one_for_all?
  2. Why do supervisors restart children in reverse order on shutdown?
  3. What happens if restart intensity is exceeded?

Check-your-understanding answers

  1. One_for_one for independent children; one_for_all for tightly dependent ones. citeturn0search0turn6search0
  2. Supervisors terminate children in reverse start order. citeturn0search3
  3. The supervisor terminates and failure escalates up the tree. citeturn0search0turn6search0

Real-world applications

  • Web servers with resilient worker pools
  • Background job systems
  • Fault-tolerant caches and queues

Where you’ll apply it

  • Project 1 (Chat System)
  • Project 2 (Rate Limiter)
  • Project 5 (GenStage Pipeline)
  • Project 6 (Fault Injection Harness)

References

  • OTP Supervision Principles. citeturn0search5
  • Erlang supervisor manual. citeturn0search3
  • Elixir GenServer documentation. citeturn0search8
  • Elixir Supervisor documentation. citeturn6search0

Key insights Supervision is policy, not just restart logic.

Summary OTP behaviors standardize process structure and supervision trees enforce fault recovery policies. This is the core of BEAM reliability.

Homework/Exercises to practice the concept

  1. Sketch a tree for a web app with DB, cache, and worker pool.
  2. Decide which children should be permanent vs transient.

Solutions to the homework/exercises

  1. Separate supervisors for DB and workers; cache as a child of app supervisor.
  2. DB connection often transient; worker pool permanent.

Concept 3: Scheduling, Reductions, and Per-Process GC

Fundamentals BEAM uses preemptive scheduling based on reductions to ensure no single process can monopolize CPU time. This creates fairness across thousands of processes and enables soft real-time behavior. Each process has its own heap and garbage collector, so GC pauses are localized rather than global. citeturn5search0 These design choices make latency more predictable than in stop-the-world GC systems.

Deep Dive The BEAM scheduler is designed for fairness and responsiveness. Instead of allowing a process to run indefinitely, the scheduler counts reductions (units of work) and yields execution after a quota. This means CPU-bound tasks cannot starve I/O-bound tasks. The exact reduction count is an implementation detail, but the design guarantee is that preemption occurs regularly and cannot be disabled by user code.

Per-process garbage collection is equally important. Each process has a private heap; GC runs on that heap only. The efficiency guide documents that the heap grows as needed and can shrink under certain conditions. citeturn5search0 This prevents global pauses. If one process allocates too much memory and triggers GC, only that process is paused. The rest of the system keeps running. This is a key property for soft real-time systems where latency spikes must be bounded.

The efficiency guide also notes that you can control minimum heap size to reduce GC overhead for short-lived processes, but warns this is an optimization that requires careful measurement. citeturn1search6 This highlights a broader lesson: BEAM’s defaults are conservative and safe, but you can tune them when you understand your workload.

Scheduler fairness interacts with mailbox patterns. If a process is constantly handling messages, it might be scheduled frequently, while a CPU-bound process will be preempted. This is why BEAM systems prefer to split heavy computation into separate processes or use ports/NIFs for compute-intensive tasks. The scheduler model encourages concurrency-friendly, reactive workloads rather than long-running CPU loops.

Understanding these mechanics matters because they shape how you design services. For example, a GenStage pipeline assumes that work is distributed across many processes so backpressure can be applied. If you put all work in a single process, scheduler fairness cannot help you; the bottleneck remains. The right architecture is one that aligns with BEAM’s scheduling and GC model.

How this fit on projects Projects 2, 5, and 10 rely on predictable scheduling and GC behavior.

Definitions & key terms

  • Reduction: A unit of work used for scheduling fairness.
  • Per-process GC: Garbage collection scoped to a single process. citeturn5search0
  • Soft real-time: Systems with bounded but not hard deterministic latency.

Mental model diagram

Scheduler
  -> run P1 for N reductions
  -> run P2 for N reductions
  -> run P3 for N reductions

GC runs inside each process heap, not globally.

How it works (step-by-step, with invariants and failure modes)

  1. Scheduler picks runnable processes.
  2. Runs each for a fixed reduction budget.
  3. Preempts and moves to the next.
  4. GC occurs when a process heap threshold is reached. citeturn5search0
  5. Invariant: no global stop-the-world pauses.
  6. Failure modes: CPU-bound single process, excessive allocations, mailbox overflow.

Minimal concrete example

Pipeline:
- Producer process emits events
- Multiple worker processes handle tasks
- Each worker yields to scheduler periodically

Common misconceptions

  • “BEAM is always fast.” It is fair and responsive, but CPU-heavy tasks can still bottleneck.
  • “GC is global.” It is per-process. citeturn5search0

Check-your-understanding questions

  1. Why does per-process GC reduce latency spikes?
  2. How does reduction-based scheduling prevent starvation?
  3. What happens if one process receives all work?

Check-your-understanding answers

  1. Only the allocating process pauses; others continue. citeturn5search0
  2. Processes are preempted after a fixed budget.
  3. It becomes the bottleneck regardless of scheduler fairness.

Real-world applications

  • High-concurrency web services
  • Stream processing pipelines
  • Real-time dashboards

Where you’ll apply it

  • Project 2 (Rate Limiter)
  • Project 5 (GenStage Pipeline)
  • Project 10 (Telemetry Pipeline)

References

  • Erlang Efficiency Guide: Processes and heap size. citeturn5search0
  • Erlang Efficiency Guide: Process Messages. citeturn1search6

Key insights Fair scheduling and localized GC are core to BEAM responsiveness.

Summary BEAM schedules processes fairly and performs GC locally, enabling predictable latency under heavy concurrency when workloads are well-distributed.

Homework/Exercises to practice the concept

  1. Explain why a single CPU-bound process can hurt system responsiveness.
  2. Draw a process graph that avoids a single bottleneck.

Solutions to the homework/exercises

  1. The scheduler can preempt but the workload remains centralized.
  2. Use multiple workers with a supervisor and load distribution.

Concept 4: Distribution, Nodes, and Fault Boundaries

Fundamentals A distributed Erlang system consists of multiple runtime nodes that communicate over TCP/IP. Message passing between processes at different nodes, as well as links and monitors, are transparent when pids are used. Registered names, however, are local to each node. citeturn5search4 This means distribution feels like local message passing but introduces network latency, partial failure, and security concerns. Distributed nodes must be explicitly named, and secure distribution requires TLS configuration. citeturn5search4

Deep Dive Distribution is one of BEAM’s strongest differentiators. When you connect nodes, you gain transparent messaging, links, and monitors across machines. citeturn5search4 The same primitives used for local fault detection can therefore be used across a cluster. This simplifies distributed design because you do not need a separate protocol for failure detection; the runtime already provides it.

However, transparency does not remove the realities of distributed systems. Messages must be serialized into an external term format and transmitted over TCP; this adds latency and increases the cost of large messages. citeturn5search1turn5search4 The system can still experience partitions or node failures. When a node goes down, linked or monitored processes receive exit or DOWN messages, which you must handle in supervision logic. This is how BEAM represents network failure in the same language as process failure.

Naming is another critical issue. Pids are globally unique within a distributed system, but registered names are node-local. citeturn5search4 If you want to address a named process on another node, you must include the node name. This drives design decisions about service discovery: do you rely on global registries, or do you maintain a routing layer? These are not just configuration choices; they are part of your failure model.

Security is also explicit. The distribution documentation warns that starting a node without TLS (inet_tls) exposes it to attacks that may give complete access to the node and the cluster. citeturn5search4 This means secure distribution is not optional in production. You must choose cookies, TLS, and network isolation deliberately.

The distributed model lends itself to certain architectures: sharded state per node, data locality via process placement, and supervision trees that span nodes. But you must also design for partition tolerance. If you choose to register a single global name, what happens during a split-brain? Your projects will force you to answer these questions by building a distributed KV store and a presence service that handles node loss gracefully.

How this fit on projects Projects 3, 7, and 9 rely on distributed Erlang semantics and failure handling.

Definitions & key terms

  • Node: A named Erlang runtime in a distributed system. citeturn5search4
  • Distribution: Transparent messaging across nodes. citeturn5search4
  • Registered name: A local alias for a pid on a node. citeturn5search4
  • Partition: Loss of connectivity between nodes.

Mental model diagram

Node A (name@host) <--- TCP ---> Node B (name@host)
   |                                 |
  Pid A                             Pid B
send(Pid B, Msg) works like local message send

How it works (step-by-step, with invariants and failure modes)

  1. Start nodes with explicit names.
  2. Connect nodes; establish distribution channel. citeturn5search4
  3. Send messages using pids across nodes.
  4. Monitor remote pids for failure.
  5. Invariant: messaging semantics are the same locally and remotely.
  6. Failure modes: network partitions, node crashes, name collisions, insecure distribution.

Minimal concrete example

Cluster messaging:
- Node A sends {ping, t} to Pid on Node B
- Node B replies {pong, t}
- If Node B disconnects, Node A receives DOWN

Common misconceptions

  • “Distributed Erlang hides all network issues.” It hides APIs, not failures.
  • “Registered names are global.” They are local per node. citeturn5search4

Check-your-understanding questions

  1. What must be included when sending to a registered name on another node?
  2. Why is TLS important for distribution?
  3. How do you detect that a remote node is down?

Check-your-understanding answers

  1. You must include the node name because registrations are local. citeturn5search4
  2. Without TLS the node may be fully compromised. citeturn5search4
  3. Links or monitors deliver exit/DOWN messages.

Real-world applications

  • Distributed caches
  • Presence systems and chat backends
  • Clustered job processors

Where you’ll apply it

  • Project 3 (Distributed KV Store)
  • Project 7 (Presence and Notification Service)
  • Project 9 (Hot Code Upgrade Drill)

References

  • Distributed Erlang documentation. citeturn5search4
  • External Term Format documentation. citeturn5search1

Key insights Distribution is transparent in API, not in failure semantics.

Summary Distributed Erlang makes cross-node messaging feel local while preserving the realities of latency and failure. You must design supervision and naming with these realities in mind.

Homework/Exercises to practice the concept

  1. Sketch a two-node cluster and describe how to detect node failure.
  2. Describe a naming strategy for a distributed service.

Solutions to the homework/exercises

  1. Use monitors on remote pids and handle DOWN messages.
  2. Use a local registry plus a routing process per node.

Concept 5: State Management with ETS and Mnesia

Fundamentals BEAM systems often store mutable state inside processes, but ETS (Erlang Term Storage) provides in-memory tables with efficient access for larger shared datasets. citeturn2search7 ETS tables are dynamic tables created by a process, and they support multiple table types (set, ordered_set, bag, duplicate_bag). citeturn2search7 Mnesia builds on these concepts to provide a distributed database with transactions and replication, but it inherits ETS-like access patterns. citeturn2search0

Deep Dive In BEAM, mutable state is typically owned by a single process, which serializes access. This is ideal for small, strongly encapsulated state. For large datasets or shared lookups, ETS is the standard tool. ETS provides constant-time access for sets and logarithmic access for ordered sets, making it useful for caches, registries, and session tables. citeturn2search7

ETS tables are created by a process and are destroyed when that process terminates. This is an important lifecycle property: your table’s availability depends on the owner process. citeturn2search7 If you put ETS in a supervisor-managed owner process, the table can be recreated on restart, but you must think about persistence and recovery. For state that cannot be lost, you either persist to disk or use Mnesia, which provides transactions and replication across nodes. citeturn2search0

ETS is not a database; it is a high-performance in-memory store optimized for BEAM use. It supports match and select operations, but these can be expensive and may require scanning the whole table. The tables-and-databases guide explicitly warns that select/match can become expensive and recommends structuring data to minimize full scans. citeturn2search0 This is a key design constraint: ETS is fast for key-based access, but you must design your keys and queries intentionally.

Mnesia adds transactions and distribution. Its API resembles ETS for basic operations, but it supports replication and transactions across nodes. This makes it attractive for distributed KV stores, but it also introduces complexity: partitions, schema management, and recovery. For learning projects, you can simulate a Mnesia-like log with ETS plus append-only persistence to understand the trade-offs.

In OTP systems, ETS often sits behind a GenServer or GenStage pipeline. The process serializes writes and defines a clear API, while ETS provides fast reads. This pattern gives you the speed of shared memory with the safety of controlled access. It is not magic; you still must handle concurrency, but you do so at the process boundary rather than with locks.

How this fit on projects Projects 2, 3, and 8 depend on ETS-based storage and state management.

Definitions & key terms

  • ETS: Built-in term storage with fast access. citeturn2search7
  • Table owner: The process that created an ETS table. citeturn2search7
  • Select/match: Query operations that may scan tables. citeturn2search0
  • Mnesia: Distributed DB built on Erlang concepts. citeturn2search0

Mental model diagram

[Client] -> [GenServer] -> [ETS Table]
                |            (fast lookup)
                v
             [State]

How it works (step-by-step, with invariants and failure modes)

  1. A process creates an ETS table.
  2. Clients access via API or direct table reads.
  3. Data is inserted, updated, looked up by key. citeturn2search7
  4. If owner dies, the table is destroyed. citeturn2search7
  5. Invariant: ETS access is efficient for key-based reads.
  6. Failure modes: table loss on crash, expensive scans, inconsistent writes.

Minimal concrete example

Session Cache:
- Key: session_id
- Value: user_id, expiry
- Read path: lookup by session_id

Common misconceptions

  • “ETS is durable.” It is in-memory; the table is destroyed on owner crash. citeturn2search7
  • “ETS is a full database.” It is optimized for key access, not complex queries. citeturn2search0

Check-your-understanding questions

  1. What happens to an ETS table when its owner crashes?
  2. Why are select/match operations potentially expensive?
  3. When would you choose Mnesia over ETS?

Check-your-understanding answers

  1. The table is destroyed with its owner process. citeturn2search7
  2. They may scan the entire table. citeturn2search0
  3. When you need replication or transactions across nodes.

Real-world applications

  • Session stores and caches
  • Local registries
  • Distributed KV stores

Where you’ll apply it

  • Project 2 (Rate Limiter)
  • Project 3 (Distributed KV Store)
  • Project 8 (ETS Cache Service)

References

  • ETS documentation. citeturn2search7
  • Tables and Databases guide. citeturn2search0

Key insights ETS gives you fast shared data when you respect its lifecycle and query limits.

Summary ETS is a high-performance in-memory table system; Mnesia adds distribution and transactions. Both require careful lifecycle and query design.

Homework/Exercises to practice the concept

  1. Design a table schema for a rate limiter.
  2. Identify which operations require full-table scans and avoid them.

Solutions to the homework/exercises

  1. Key by {user_id, window} with count as value.
  2. Use key lookups; avoid scan-based queries.

Concept 6: Real-Time Web and Backpressure (LiveView + GenStage)

Fundamentals Phoenix LiveView enables rich, real-time user experiences with server-rendered HTML, diff tracking, and WebSocket-based updates. citeturn3search0 A persistent connection is established between client and server, which reduces work per request and allows faster reactions to user events. citeturn3search6 GenStage provides backpressure-aware pipelines where consumers explicitly demand events and producers never send more than requested. citeturn4search2 Together, they demonstrate how BEAM’s process model supports real-time systems without heavy client-side code or external queues.

Deep Dive LiveView works by rendering HTML on the server, sending it to the client initially as a static page, then maintaining a persistent connection that streams updates. citeturn3search0turn3search6 This design means the client does not need to own application state; instead, the server is authoritative. The runtime diffs state changes and sends only the minimal updates. citeturn3search0 This is powerful for dashboards, collaborative tools, and real-time monitoring where consistency matters more than offline capability.

GenStage addresses the other side of real-time systems: throughput and backpressure. In a naive system, producers can overwhelm consumers, leading to mailbox growth and latency spikes. GenStage lets consumers explicitly demand a specific number of events, ensuring that producers never outpace the system’s capacity. citeturn4search2 This is not a library convenience; it is a concurrency contract that aligns with BEAM’s message-passing model.

The Discord engineering blog describes how Elixir was used for a highly concurrent real-time system, reporting nearly five million concurrent users and millions of events per second by July 2017. citeturn4search0 This scale is only achievable when flow control is a first-class concern. You cannot rely on infinite queues or manual throttling. The runtime must treat demand as a signal that shapes how work moves through the system.

LiveView and GenStage are often combined. A real-time dashboard can subscribe to a GenStage pipeline that feeds events; the LiveView process receives messages and updates its state, and the runtime only sends changed parts of the UI. This is a BEAM-native approach to real-time systems: concurrency, backpressure, and UI updates all in one runtime.

How this fit on projects Projects 4, 5, and 10 are built on these concepts.

Definitions & key terms

  • LiveView: Server-rendered, real-time UI with diff updates. citeturn3search0
  • Persistent connection: Long-lived channel between client and server. citeturn3search6
  • Backpressure: Consumer-controlled demand to prevent overload. citeturn4search2
  • Producer/Consumer: Components in a data pipeline. citeturn4search2

Mental model diagram

Events -> GenStage Producer -> GenStage Consumer -> LiveView -> Browser
         (demand-based flow)        (bounded)       (diff updates)

How it works (step-by-step, with invariants and failure modes)

  1. Producers emit events only when consumers demand them. citeturn4search2
  2. Consumers process events and update state.
  3. LiveView diffs state and pushes UI updates. citeturn3search0
  4. Invariant: demand controls flow; UI updates are incremental.
  5. Failure modes: unbounded demand, heavy LiveView processes, burst storms.

Minimal concrete example

Real-time chart:
- Producer emits metrics
- Consumer aggregates
- LiveView updates chart every second

Common misconceptions

  • “LiveView is just websockets.” It is server-rendered with diff updates. citeturn3search0
  • “Backpressure is optional.” Without it, queues grow until failure. citeturn4search2

Check-your-understanding questions

  1. Why is demand-driven flow safer than unbounded queues?
  2. What does LiveView send after the initial render?
  3. How would you prevent a slow consumer from collapsing a pipeline?

Check-your-understanding answers

  1. It bounds work and prevents overload. citeturn4search2
  2. Only the diffs (changed parts) are sent. citeturn3search0
  3. Limit demand and use backpressure-aware stages.

Real-world applications

  • Monitoring dashboards
  • Chat systems and activity feeds
  • Burst-resistant pipelines

Where you’ll apply it

  • Project 4 (LiveView Dashboard)
  • Project 5 (GenStage Pipeline)
  • Project 10 (Telemetry Pipeline)

References

  • Phoenix LiveView README. citeturn3search0
  • Phoenix LiveView docs (persistent connection). citeturn3search6
  • GenStage documentation. citeturn4search2
  • Discord engineering blog (5M concurrent users, millions events/sec). citeturn4search0

Key insights Real-time systems need both UI diffing and backpressure to stay stable.

Summary LiveView delivers server-rendered realtime UI; GenStage delivers safe flow control. Combined, they are a BEAM-native real-time stack.

Homework/Exercises to practice the concept

  1. Design a pipeline with bounded demand and describe how to tune it.
  2. Sketch a LiveView state update flow for a dashboard.

Solutions to the homework/exercises

  1. Set demand limits and buffer sizes at each stage.
  2. Server state changes trigger diff updates to the client.

Concept 7: Code Loading and Release Handling

Fundamentals Erlang/OTP supports runtime code replacement and release handling through SASL. The release handler installs upgrades based on appup and relup instructions, and can reload, restart, or replace applications as needed. citeturn1search0 Core applications (ERTS, kernel, stdlib, sasl) require runtime restarts during upgrades via restart_new_emulator, while other upgrades may use restart_emulator or in-place changes. citeturn1search2turn1search0

Deep Dive Runtime code loading is a distinguishing feature of BEAM. The system can load a new version of a module while the system runs, and processes can transition to the new code when they next call into that module. This is not magic; it relies on careful design of process state and upgrade callbacks. The release handling framework in OTP formalizes this for full releases. It uses appup files to describe application upgrade steps and relup files to describe release-level steps. citeturn1search0

Release handling is explicit about restart boundaries. The documentation describes restart_new_emulator for upgrades that change the runtime system or core applications. This instruction reboots the runtime and is required when ERTS or core apps are upgraded. citeturn1search2 For other upgrades, restart_emulator can be used at the end of a relup to reboot after upgrade instructions are executed. citeturn1search0 This makes it clear that “hot upgrade” has constraints: some upgrades require a controlled reboot, not a pure in-place swap.

A key challenge in hot upgrades is state migration. If the internal state structure of a process changes, you must define a code_change step to transform state. This is a design commitment; if you do not plan for it, hot upgrades will be painful. Projects in this guide include a controlled upgrade drill so you can practice the mechanics of a safe state transition.

Release handling also interacts with distribution. Each node can have its own release version, and upgrades can be coordinated across nodes using synchronization instructions. This enables rolling upgrades or staged rollouts, but only if you design your upgrade plan and compatibility boundaries.

The goal of learning this concept is not to make every project hot-upgradeable. It is to understand the constraints and the tooling so that when uptime requirements demand it, you can design for upgrade safety rather than retrofitting under pressure.

How this fit on projects Project 9 is entirely about release handling, and projects 1-3 benefit from upgrade-safe state design.

Definitions & key terms

  • Release handler: OTP component that installs upgrades. citeturn1search0
  • appup/relup: Upgrade instruction files for apps and releases. citeturn1search0
  • restart_new_emulator: Required for core runtime upgrades. citeturn1search2
  • restart_emulator: Reboot instruction for non-core upgrades. citeturn1search0

Mental model diagram

Old Release -> appup/relup -> release_handler -> Upgrade Steps -> New Release
      |                                                |
      +---- code_change(state) ------------------------+

How it works (step-by-step, with invariants and failure modes)

  1. Build release with appup/relup instructions. citeturn1search0
  2. Install release package on running node.
  3. Release handler executes upgrade steps. citeturn1search0
  4. Processes transition state via code_change.
  5. Invariant: state transitions must be explicit and safe.
  6. Failure modes: incompatible state, missing appup, core upgrade without restart.

Minimal concrete example

State migration:
- v1 state: {user_id, count}
- v2 state: {user_id, count, last_seen}
- code_change adds last_seen with default

Common misconceptions

  • “All upgrades are hot.” Core runtime upgrades require restart. citeturn1search2
  • “State changes are automatic.” You must define migration steps.

Check-your-understanding questions

  1. Why do core OTP applications require runtime restart?
  2. What does an appup file describe?
  3. Why is state migration central to hot upgrades?

Check-your-understanding answers

  1. The runtime itself cannot hot-swap its core components. citeturn1search2
  2. Upgrade instructions between application versions. citeturn1search0
  3. Processes carry state across versions; you must transform it.

Real-world applications

  • Zero-downtime upgrades in telecom systems
  • Rolling upgrades in distributed services

Where you’ll apply it

  • Project 9 (Hot Code Upgrade Drill)
  • Project 1 (Chat System) as optional enhancement

References

  • Release Handling in OTP. citeturn1search0
  • restart_new_emulator and restart_emulator instructions. citeturn1search2turn1search0

Key insights Hot upgrades are possible only when state and upgrade steps are explicit.

Summary Release handling makes runtime upgrades possible but requires careful design of state, versioning, and upgrade steps.

Homework/Exercises to practice the concept

  1. Describe a state change that requires migration.
  2. Sketch an upgrade plan with a rollback strategy.

Solutions to the homework/exercises

  1. Add a new field with a default value during upgrade.
  2. Deploy appup, verify health, and maintain rollback relup.

Glossary

  • BEAM: The Erlang virtual machine and runtime system.
  • OTP: Open Telecom Platform; a set of libraries and design principles. citeturn0search5
  • GenServer: OTP behavior for server processes. citeturn0search8
  • Supervisor: OTP behavior that monitors and restarts children. citeturn0search0
  • Reduction: Scheduling unit for fairness.
  • ETS: In-memory term storage. citeturn2search7
  • Distributed Erlang: Nodes connected for transparent message passing. citeturn5search4
  • LiveView: Server-rendered real-time UI with diffs. citeturn3search0
  • GenStage: Backpressure-aware pipeline. citeturn4search2
  • Release handling: OTP upgrade framework. citeturn1search0

Why BEAM Matters

  • BEAM’s process model and supervision trees are explicitly designed for fault tolerance at scale. citeturn5search0turn0search5
  • Discord reports scaling Elixir to nearly five million concurrent users and millions of events per second (July 6, 2017), demonstrating BEAM’s relevance for real-time systems. citeturn4search0
  • LiveView and GenStage provide a BEAM-native approach to real-time UI and backpressure-driven pipelines. citeturn3search0turn4search2

ASCII diagram: old vs new concurrency model

OLD (Threads + Locks)             NEW (Processes + Messages)
Shared state                      Isolated state
Locks and contention              Message passing
One crash can corrupt state       Crash is isolated
Complex recovery                  Supervisor restarts

Concept Summary Table

Concept Cluster What You Need to Internalize
Process Model Lightweight isolated processes, message copying, mailbox semantics.
OTP + Supervision GenServer/Supervisor lifecycle and restart strategies.
Scheduling + GC Reduction-based fairness and per-process GC.
Distribution Node naming, transparent messaging, and failure semantics.
State + ETS In-memory tables, lifecycle, and query constraints.
Real-Time + Backpressure LiveView diffing and GenStage demand control.
Release Handling appup/relup, code_change, and upgrade boundaries.

Project-to-Concept Map

Project Concepts Applied
Project 1 Process Model, OTP + Supervision
Project 2 OTP + Supervision, State + ETS
Project 3 Distribution, State + ETS
Project 4 Real-Time + Backpressure
Project 5 Real-Time + Backpressure, Scheduling + GC
Project 6 OTP + Supervision, Scheduling + GC
Project 7 Distribution, Process Model
Project 8 State + ETS
Project 9 Release Handling
Project 10 Scheduling + GC, Real-Time + Backpressure

Deep Dive Reading by Concept

Concept Book and Chapter Why This Matters
Process Model “Programming Erlang” by Joe Armstrong - Processes & Message Passing Actor model and concurrency fundamentals
OTP + Supervision “Designing for Scalability with Erlang/OTP” - Supervision Practical fault-tolerance design
Scheduling + GC “The BEAM Book” - Scheduler & Memory Understand runtime behavior
Distribution “Programming Erlang” - Distributed Erlang Cluster semantics and failure handling
State + ETS “Erlang and OTP in Action” - ETS/Mnesia Data structures and storage
Real-Time + Backpressure “Elixir in Action” - Processes and GenStage Flow control in pipelines
Release Handling “Programming Erlang” - Code upgrades Safe runtime upgrades

Quick Start: Your First 48 Hours

Day 1:

  1. Read Concept 1 and Concept 2 in the Theory Primer.
  2. Start Project 1 and get a supervised GenServer running.

Day 2:

  1. Validate Project 1 against the Definition of Done.
  2. Read Concept 3 and Concept 5 for ETS and scheduling basics.

Path 1: The Web Builder

  • Project 4 -> Project 1 -> Project 2 -> Project 10

Path 2: The Systems Learner

  • Project 1 -> Project 2 -> Project 3 -> Project 7 -> Project 9

Path 3: The Distributed Systems Learner

  • Project 3 -> Project 7 -> Project 5 -> Project 10

Success Metrics

  • You can explain supervision strategies and justify your choices.
  • You can design a process tree for a real service.
  • You can identify and fix mailbox backlogs.
  • You can run a controlled upgrade with a state migration step.

Optional Appendix: BEAM Tooling Cheat Sheet

  • observer: visualize processes, memory, and message queues
  • :sys: inspect GenServer state and system messages
  • :dbg: trace function calls and message flow
  • telemetry: instrument and export metrics

Project Overview Table

# Project Name Main Language Difficulty Time Estimate Core Concepts Coolness
1 Supervised Chat System Elixir Level 2 10-15 hrs OTP + Supervision Level 3
2 Rate Limiter + Circuit Breaker Elixir Level 2 12-18 hrs ETS + Supervision Level 3
3 Distributed KV Store Erlang/Elixir Level 3 20-30 hrs Distribution + ETS Level 4
4 LiveView Real-Time Dashboard Elixir Level 2 12-20 hrs LiveView Level 3
5 GenStage Backpressure Pipeline Elixir Level 3 18-25 hrs Backpressure Level 4
6 Fault Injection Harness Elixir Level 2 10-15 hrs Supervision Level 3
7 Presence and Notification Service Erlang/Elixir Level 3 20-30 hrs Distribution Level 4
8 ETS Cache Service Erlang/Elixir Level 2 10-15 hrs ETS Level 3
9 Hot Code Upgrade Drill Erlang Level 3 15-25 hrs Release Handling Level 4
10 Telemetry Pipeline + Live Metrics Elixir Level 3 15-25 hrs Scheduling + LiveView Level 4

Project List

The following projects guide you from core OTP patterns to distributed, real-time systems.

Project 1: Supervised Real-Time Chat System

  • File: P01-supervised-chat-system.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang, Gleam
  • Coolness Level: Level 3 (See REFERENCE.md)
  • Business Potential: Level 2 (See REFERENCE.md)
  • Difficulty: Level 2 (See REFERENCE.md)
  • Knowledge Area: Concurrency, Fault Tolerance
  • Software or Tool: OTP/GenServer
  • Main Book: “Programming Erlang”

What you will build: A multi-room chat service where each room is a supervised process and failures are isolated.

Why it teaches BEAM: It forces you to model state as processes and to recover from crashes using supervisors. citeturn0search5turn0search0turn0search8

Core challenges you will face:

  • Process isolation -> Process Model
  • Room supervision -> OTP + Supervision
  • Message routing -> Process Model

Real World Outcome

You can run a CLI client for two rooms, kill a room process, and watch it restart without losing other rooms.

$ chatctl create room general
room general: pid=<0.215.0>

$ chatctl send general "hello"
[general] user42: hello

$ chatctl crash room general
room general crashed; supervisor restarted it

$ chatctl send general "we are back"
[general] user42: we are back

The Core Question You Are Answering

“How do I build a service where one failing chat room does not bring down the whole system?”

Concepts You Must Understand First

  1. BEAM process isolation
    • Why are messages copied and state isolated? citeturn1search6
    • Book Reference: “Programming Erlang” - Processes chapter
  2. Supervision strategy
    • Which restart strategy fits independent rooms? citeturn0search0
    • Book Reference: “Designing for Scalability with Erlang/OTP” - Supervision
  3. GenServer lifecycle
    • How are calls and casts handled? citeturn0search8
    • Book Reference: “Elixir in Action” - OTP section

Questions to Guide Your Design

  1. State model
    • Will each room be one process or multiple?
    • How will you store room history safely?
  2. Failure model
    • What happens when a room crashes mid-message?
    • How do you inform clients that the room restarted?

Thinking Exercise

Room Isolation Sketch

Draw a supervision tree for 3 rooms. Decide where to place a router process that handles room discovery.

Questions to answer:

  • What should happen if the router dies?
  • Which processes should be linked vs monitored?

The Interview Questions They Will Ask

  1. “Why use one process per room instead of shared state?”
  2. “How does supervision improve reliability?” citeturn0search5turn0search0
  3. “How do GenServer calls differ from casts?” citeturn0search8
  4. “What happens to messages during a crash?”
  5. “How would you scale this across nodes?”

Hints in Layers

Hint 1: Starting Point Start with one room process and a simple send/receive API.

Hint 2: Next Level Add a supervisor that restarts the room on crash.

Hint 3: Technical Details Pseudocode:

Room process:
- state: list of users
- on {join, user}: add to list
- on {message, user, text}: broadcast

Hint 4: Tools/Debugging Use observer to confirm the room process restarts after a crash.

Books That Will Help

Topic Book Chapter
Processes “Programming Erlang” Processes chapter
Supervision “Designing for Scalability with Erlang/OTP” Supervision chapter

Common Pitfalls and Debugging

Problem 1: “Room crashes kill the whole app”

  • Why: You started rooms without a supervisor.
  • Fix: Put each room under a one_for_one supervisor. citeturn0search0
  • Quick test: Crash a room and verify the supervisor restarts only that room.

Definition of Done

  • Each room is its own process
  • Rooms restart on crash without affecting others
  • Messages still flow after a restart
  • Process tree visible in observer

Project 2: Rate Limiter and Circuit Breaker Service

  • File: P02-rate-limiter-circuit-breaker.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang, Gleam
  • Coolness Level: Level 3 (See REFERENCE.md)
  • Business Potential: Level 2 (See REFERENCE.md)
  • Difficulty: Level 2 (See REFERENCE.md)
  • Knowledge Area: State, Fault Tolerance
  • Software or Tool: ETS + GenServer
  • Main Book: “Erlang and OTP in Action”

What you will build: A rate limiter backed by ETS and a circuit breaker that trips on repeated failures.

Why it teaches BEAM: You must manage state safely and choose supervision policies. citeturn2search7turn0search0

Core challenges you will face:

  • ETS design -> State + ETS
  • Supervisor strategy -> OTP + Supervision
  • Timeout handling -> Scheduling + GC

Real World Outcome

$ ratelimit check user42
allowed (remaining=9)

$ ratelimit check user42
blocked (retry_in=12s)

$ breaker status serviceA
state: open (cooldown=30s)

The Core Question You Are Answering

“How do I enforce limits and protect dependencies without locks or shared memory races?”

Concepts You Must Understand First

  1. ETS tables
    • What happens when the owner crashes? citeturn2search7
    • Book Reference: “Erlang and OTP in Action” - ETS chapter
  2. Supervision strategy
    • How do you restart a limiter safely? citeturn0search0
    • Book Reference: “Designing for Scalability with Erlang/OTP”
  3. GenServer state
    • How do you serialize updates? citeturn0search8
    • Book Reference: “Elixir in Action”

Questions to Guide Your Design

  1. Rate limiting algorithm
    • Fixed window, sliding window, or token bucket?
    • How will you store timestamps efficiently?
  2. Circuit breaker
    • What error threshold trips the breaker?
    • How long before a half-open state?

Thinking Exercise

State Table Design

Design the ETS key for {user_id, window} and decide how to prune expired windows.

Questions to answer:

  • How do you keep reads constant time?
  • What is your cleanup strategy?

The Interview Questions They Will Ask

  1. “Why choose ETS for a rate limiter?” citeturn2search7
  2. “How do you avoid race conditions without locks?”
  3. “What supervision strategy is best for a limiter?” citeturn0search0
  4. “How do you prevent restart storms?”

Hints in Layers

Hint 1: Starting Point Model the limiter as one GenServer that owns ETS.

Hint 2: Next Level Store counters per time window and expire old keys.

Hint 3: Technical Details Pseudocode:

on check(user):
  window = floor(now / window_size)
  count = lookup({user, window})
  if count < limit -> increment and allow
  else -> deny

Hint 4: Tools/Debugging Use observer to confirm ETS table size over time.

Books That Will Help

Topic Book Chapter
ETS design “Erlang and OTP in Action” ETS chapter
Fault tolerance “Designing for Scalability with Erlang/OTP” Supervision chapter

Common Pitfalls and Debugging

Problem 1: “Limiter forgets counts after crash”

  • Why: ETS table is destroyed when owner dies. citeturn2search7
  • Fix: Rebuild from logs or accept reset as design choice.
  • Quick test: Crash owner and verify expected behavior.

Definition of Done

  • Limits enforced per user and window
  • ETS table lifecycle documented
  • Circuit breaker transitions documented
  • Supervisor restarts without global failure

Project 3: Distributed Key-Value Store

  • File: P03-distributed-kv-store.md
  • Main Programming Language: Erlang or Elixir
  • Alternative Programming Languages: Gleam
  • Coolness Level: Level 4 (See REFERENCE.md)
  • Business Potential: Level 3 (See REFERENCE.md)
  • Difficulty: Level 3 (See REFERENCE.md)
  • Knowledge Area: Distribution, Data Stores
  • Software or Tool: Distributed Erlang + ETS
  • Main Book: “Programming Erlang”

What you will build: A sharded KV store that runs on multiple nodes and replicates keys.

Why it teaches BEAM: It forces you to design with node failure, message passing, and ETS-backed state. citeturn5search4turn2search7

Core challenges you will face:

  • Node discovery -> Distribution
  • Shard mapping -> Process Model
  • Replication -> Distribution + ETS

Real World Outcome

$ kv put user:42 "active" --node nodeA
ok

$ kv get user:42 --node nodeB
"active" (replica)

$ kv cluster status
nodes: [nodeA,nodeB,nodeC]
shards: 64
replication: 2

The Core Question You Are Answering

“How do I maintain state across nodes while handling node failures gracefully?”

Concepts You Must Understand First

  1. Distributed Erlang nodes
    • How are nodes named and connected? citeturn5search4
    • Book Reference: “Programming Erlang” - Distribution chapter
  2. ETS lifecycle
    • What happens to tables when a node restarts? citeturn2search7
    • Book Reference: “Erlang and OTP in Action” - ETS chapter
  3. Supervisor policies
    • How do you restart shard processes? citeturn0search0
    • Book Reference: “Designing for Scalability with Erlang/OTP”

Questions to Guide Your Design

  1. Sharding
    • How will you assign keys to shards?
  2. Replication
    • How many replicas, and how do you keep them consistent?

Thinking Exercise

Failure Drill

Simulate node loss: what happens to keys on that node and how do clients recover?

Questions to answer:

  • How do you detect that a node is down?
  • How do you reassign shards?

The Interview Questions They Will Ask

  1. “How does distributed Erlang handle messaging between nodes?” citeturn5search4
  2. “What happens to ETS state on node failure?” citeturn2search7
  3. “How do you handle split-brain?”
  4. “Why use supervision for shard processes?” citeturn0search0

Hints in Layers

Hint 1: Starting Point Start with two nodes and a single shard process on each.

Hint 2: Next Level Use consistent hashing to map keys to shard owners.

Hint 3: Technical Details Pseudocode:

shard = hash(key) mod shard_count
primary = shard_owner(shard)
replica = next_owner(shard)

Hint 4: Tools/Debugging Use node monitors to detect failures and log shard reassignment.

Books That Will Help

Topic Book Chapter
Distribution “Programming Erlang” Distribution chapter
ETS “Erlang and OTP in Action” ETS chapter

Common Pitfalls and Debugging

Problem 1: “Keys disappear after node crash”

  • Why: State was only on the crashed node.
  • Fix: Add replication or persistence.
  • Quick test: Kill a node and verify replicas still answer.

Definition of Done

  • Keys retrievable from replica node
  • Shard ownership rebalances after node loss
  • Node discovery documented
  • Cluster health command works

Project 4: LiveView Real-Time Operations Dashboard

  • File: P04-liveview-ops-dashboard.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: None
  • Coolness Level: Level 3 (See REFERENCE.md)
  • Business Potential: Level 2 (See REFERENCE.md)
  • Difficulty: Level 2 (See REFERENCE.md)
  • Knowledge Area: Real-Time Web
  • Software or Tool: Phoenix LiveView
  • Main Book: “Programming Phoenix”

What you will build: A real-time dashboard that streams metrics to the browser with LiveView diff updates.

Why it teaches BEAM: LiveView uses server-rendered HTML with diff tracking and a persistent connection for real-time updates. citeturn3search0turn3search6

Core challenges you will face:

  • State streaming -> LiveView
  • Efficient updates -> LiveView diffing
  • Process isolation -> Process Model

Real World Outcome

The dashboard shows:

  • A top bar with system status (green/yellow/red)
  • A grid of cards for CPU, memory, message queue sizes
  • A live chart that updates once per second

Behavior:

  • When metrics spike, only the affected card updates (no full page reload). citeturn3search0

The Core Question You Are Answering

“How do I deliver live UI updates without building a heavy client app?”

Concepts You Must Understand First

  1. LiveView lifecycle
    • How does the initial render differ from updates? citeturn3search0turn3search6
    • Book Reference: “Programming Phoenix” - LiveView chapters
  2. Process model
    • What process owns the dashboard state?
    • Book Reference: “Programming Erlang” - Processes

Questions to Guide Your Design

  1. Update frequency
    • How often do you push updates without overwhelming clients?
  2. Data pipeline
    • Where do metrics originate and how are they buffered?

Thinking Exercise

Diff Strategy

Decide which UI components can update independently and which must update together.

Questions to answer:

  • What data should be computed per client?
  • What can be shared across clients?

The Interview Questions They Will Ask

  1. “Why use LiveView instead of client-side JS?” citeturn3search0
  2. “How does LiveView minimize network traffic?” citeturn3search0
  3. “How do you handle slow clients?”
  4. “What happens if a LiveView process crashes?”

Hints in Layers

Hint 1: Starting Point Build a static LiveView page with placeholders.

Hint 2: Next Level Add a timer process that sends metric updates.

Hint 3: Technical Details Pseudocode:

Every 1s:
  read metrics
  update assigns
  LiveView pushes diff

Hint 4: Tools/Debugging Use browser devtools to confirm small diff payloads.

Books That Will Help

Topic Book Chapter
LiveView “Programming Phoenix” LiveView chapters

Common Pitfalls and Debugging

Problem 1: “Whole page re-renders”

  • Why: You are reassigning the entire state on each tick.
  • Fix: Update only the changed assigns.
  • Quick test: Track diff size in logs.

Definition of Done

  • Live metrics update without page refresh
  • Diff payloads remain small
  • UI degrades gracefully under load
  • Crash of dashboard process recovers cleanly

Project 5: GenStage Backpressure Pipeline

  • File: P05-genstage-backpressure-pipeline.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang
  • Coolness Level: Level 4 (See REFERENCE.md)
  • Business Potential: Level 3 (See REFERENCE.md)
  • Difficulty: Level 3 (See REFERENCE.md)
  • Knowledge Area: Stream Processing
  • Software or Tool: GenStage
  • Main Book: “Elixir in Action”

What you will build: A producer-consumer pipeline that processes bursty events with explicit demand control.

Why it teaches BEAM: GenStage demand is a concrete, backpressure mechanism. citeturn4search2

Core challenges you will face:

  • Demand control -> Backpressure
  • Work distribution -> Scheduler/GC
  • Failure recovery -> Supervision

Real World Outcome

$ pipeline send 10000
accepted

$ pipeline status
producer_buffer=500
consumer_demand=200
processed=9800

The Core Question You Are Answering

“How do I prevent a burst of events from collapsing my system?”

Concepts You Must Understand First

  1. Backpressure
    • Why demand should be explicit? citeturn4search2
    • Book Reference: “Elixir in Action” - GenStage section
  2. Scheduler fairness
    • Why distribute work across processes?
    • Book Reference: “The BEAM Book” - Scheduler chapter

Questions to Guide Your Design

  1. Demand size
    • How many events should a consumer request at a time?
  2. Failure policy
    • What happens if a consumer crashes mid-batch?

Thinking Exercise

Burst Simulation

Design a scenario where input spikes 10x and decide how the pipeline should respond.

Questions to answer:

  • Where do events buffer?
  • How do you detect lag?

The Interview Questions They Will Ask

  1. “What is backpressure and why do we need it?” citeturn4search2
  2. “How do GenStage producers and consumers coordinate?” citeturn4search2
  3. “How do you measure pipeline lag?”
  4. “What happens if a consumer crashes?”

Hints in Layers

Hint 1: Starting Point Start with one producer and one consumer.

Hint 2: Next Level Add multiple consumers and split demand.

Hint 3: Technical Details Pseudocode:

Consumer requests N events
Producer sends N events
Consumer processes and requests more

Hint 4: Tools/Debugging Log demand and buffer size on each stage.

Books That Will Help

Topic Book Chapter
Backpressure “Elixir in Action” GenStage chapter

Common Pitfalls and Debugging

Problem 1: “Mailbox growth without bound”

  • Why: You are emitting without honoring demand. citeturn4search2
  • Fix: Enforce demand-based flow control.
  • Quick test: Trigger a large burst and verify buffers stay bounded.

Definition of Done

  • Demand-based flow implemented
  • Burst input does not crash the pipeline
  • Consumer failure is recovered via supervision
  • Metrics show bounded queues

Project 6: Fault Injection and Supervision Harness

  • File: P06-fault-injection-harness.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang
  • Coolness Level: Level 3 (See REFERENCE.md)
  • Business Potential: Level 2 (See REFERENCE.md)
  • Difficulty: Level 2 (See REFERENCE.md)
  • Knowledge Area: Fault Tolerance
  • Software or Tool: OTP Supervision
  • Main Book: “Designing for Scalability with Erlang/OTP”

What you will build: A harness that deliberately crashes workers under different strategies and records recovery behavior.

Why it teaches BEAM: You will see supervision strategies in action, not just read about them. citeturn0search0turn0search5

Core challenges you will face:

  • Strategy selection -> OTP + Supervision
  • Crash scenarios -> Process Model
  • Observability -> Scheduling/GC

Real World Outcome

$ faultlab run --strategy one_for_one
worker A crashed -> restarted
worker B unaffected

$ faultlab run --strategy one_for_all
worker A crashed -> all workers restarted

The Core Question You Are Answering

“What actually happens when a process crashes under different supervision strategies?”

Concepts You Must Understand First

  1. Supervision strategies citeturn0search0
  2. Process isolation citeturn5search0

Questions to Guide Your Design

  1. Crash injection
    • How will you trigger controlled failures?
  2. Measurement
    • What metrics show recovery speed and impact?

Thinking Exercise

Failure Tree

Draw a tree with 3 workers and decide which should restart together.

The Interview Questions They Will Ask

  1. “How do one_for_one and one_for_all differ?” citeturn0search0
  2. “How would you test your supervision tree?”
  3. “What is a restart storm and how do you prevent it?” citeturn0search0

Hints in Layers

Hint 1: Starting Point Create workers that crash on demand.

Hint 2: Next Level Swap the supervisor strategy and log the outcomes.

Hint 3: Technical Details Pseudocode:

if trigger == crash:
  worker exits with error
supervisor applies strategy

Hint 4: Tools/Debugging Use observer to watch process restarts.

Books That Will Help

Topic Book Chapter
Supervision “Designing for Scalability with Erlang/OTP” Supervision chapter

Common Pitfalls and Debugging

Problem 1: “All workers restart unexpectedly”

  • Why: You selected one_for_all for independent workers.
  • Fix: Use one_for_one when workers are independent. citeturn0search0
  • Quick test: Crash one worker and verify only that worker restarts.

Definition of Done

  • Supports multiple strategies
  • Logs restart behavior
  • Demonstrates escalation when intensity exceeded
  • Clear report of recovery time

Project 7: Presence and Notification Service

  • File: P07-presence-notifications.md
  • Main Programming Language: Elixir or Erlang
  • Alternative Programming Languages: Gleam
  • Coolness Level: Level 4 (See REFERENCE.md)
  • Business Potential: Level 3 (See REFERENCE.md)
  • Difficulty: Level 3 (See REFERENCE.md)
  • Knowledge Area: Distribution, Real-Time
  • Software or Tool: Distributed Erlang
  • Main Book: “Programming Erlang”

What you will build: A multi-node presence service that tracks online users and pushes notifications.

Why it teaches BEAM: It forces you to use node naming, message passing, and failure detection across nodes. citeturn5search4

Core challenges you will face:

  • Node awareness -> Distribution
  • Presence consistency -> Process Model
  • Failure handling -> Supervision

Real World Outcome

$ presence login user42 --node nodeA
ok

$ presence status user42 --node nodeB
online (via nodeA)

$ presence notify user42 "ping"
queued

The Core Question You Are Answering

“How do I maintain a consistent view of online users across nodes?”

Concepts You Must Understand First

  1. Distributed nodes citeturn5search4
  2. Message passing semantics citeturn1search6

Questions to Guide Your Design

  1. Source of truth
    • Is presence authoritative on one node or replicated?
  2. Failure handling
    • What happens when a node goes down?

Thinking Exercise

Partition Scenario

Simulate a network split. How do you reconcile presence when nodes reconnect?

The Interview Questions They Will Ask

  1. “How does distributed Erlang handle messaging?” citeturn5search4
  2. “What are the risks of global registries?” citeturn6search0
  3. “How do you detect node failures?”

Hints in Layers

Hint 1: Starting Point Start with local presence per node.

Hint 2: Next Level Add a router that forwards queries across nodes.

Hint 3: Technical Details Pseudocode:

if user not found locally:
  query other nodes

Hint 4: Tools/Debugging Kill a node and verify presence cleans up.

Books That Will Help

Topic Book Chapter
Distribution “Programming Erlang” Distribution chapter

Common Pitfalls and Debugging

Problem 1: “Presence shows users online after node crash”

  • Why: You did not remove presence on DOWN events.
  • Fix: Monitor node and purge presence on disconnect.
  • Quick test: Stop a node and re-check status.

Definition of Done

  • Presence queries work across nodes
  • Notifications routed to correct node
  • Node failure cleans up state
  • Basic partition handling documented

Project 8: ETS Cache and Session Service

  • File: P08-ets-cache-service.md
  • Main Programming Language: Erlang or Elixir
  • Alternative Programming Languages: Gleam
  • Coolness Level: Level 3 (See REFERENCE.md)
  • Business Potential: Level 2 (See REFERENCE.md)
  • Difficulty: Level 2 (See REFERENCE.md)
  • Knowledge Area: Data Storage
  • Software or Tool: ETS
  • Main Book: “Erlang and OTP in Action”

What you will build: An ETS-backed cache with TTL eviction and stats reporting.

Why it teaches BEAM: It forces you to manage ETS lifecycle and cleanup safely. citeturn2search7turn2search0

Core challenges you will face:

  • Table lifecycle -> ETS
  • TTL expiration -> Scheduling
  • Concurrency -> Process Model

Real World Outcome

$ cache put key1 value1 --ttl 30s
ok

$ cache get key1
value1

$ cache stats
entries=1200 evictions=45

The Core Question You Are Answering

“How do I build a fast in-memory cache without corrupting shared state?”

Concepts You Must Understand First

  1. ETS basics citeturn2search7
  2. Table lifecycle citeturn2search7

Questions to Guide Your Design

  1. Eviction
    • Time-based vs size-based eviction?
  2. Ownership
    • Which process owns the table?

Thinking Exercise

TTL Sweep

Design a cleanup process that removes expired keys without scanning all entries every time.

The Interview Questions They Will Ask

  1. “What happens if the ETS owner dies?” citeturn2search7
  2. “How do you avoid full-table scans?” citeturn2search0
  3. “Why use ETS instead of a GenServer map?”

Hints in Layers

Hint 1: Starting Point Store expiry timestamps with each key.

Hint 2: Next Level Use a periodic sweep process.

Hint 3: Technical Details Pseudocode:

Every 5s:
  scan a slice of keys
  delete expired

Hint 4: Tools/Debugging Track table size and eviction counts over time.

Books That Will Help

Topic Book Chapter
ETS “Erlang and OTP in Action” ETS chapter

Common Pitfalls and Debugging

Problem 1: “Eviction freezes system”

  • Why: Full table scan in one process.
  • Fix: Incremental sweeps and batching.
  • Quick test: Load with 1M keys and verify stable latency.

Definition of Done

  • TTL eviction works
  • Table survives normal load
  • Stats report accurate counts
  • Owner process supervised

Project 9: Hot Code Upgrade Drill

  • File: P09-hot-code-upgrade-drill.md
  • Main Programming Language: Erlang
  • Alternative Programming Languages: Elixir
  • Coolness Level: Level 4 (See REFERENCE.md)
  • Business Potential: Level 3 (See REFERENCE.md)
  • Difficulty: Level 3 (See REFERENCE.md)
  • Knowledge Area: Release Engineering
  • Software or Tool: Release handling
  • Main Book: “Programming Erlang”

What you will build: A controlled upgrade scenario with state migration and rollback.

Why it teaches BEAM: Release handling is unique to OTP and critical for zero-downtime systems. citeturn1search0turn1search2

Core challenges you will face:

  • appup design -> Release Handling
  • State migration -> Code Change
  • Rollback -> Release Handling

Real World Outcome

$ reltool build
release built: v1 -> v2

$ reltool upgrade --apply
upgrade applied
state migrated: ok

$ reltool rollback
rollback complete

The Core Question You Are Answering

“How do I upgrade running systems without corrupting state?”

Concepts You Must Understand First

  1. Release handler citeturn1search0
  2. appup/relup instructions citeturn1search0

Questions to Guide Your Design

  1. Migration
    • What state changes need transformation?
  2. Rollback
    • How do you verify upgrade success before committing?

Thinking Exercise

State Evolution

Design a version change that adds a new field to state and requires a default.

The Interview Questions They Will Ask

  1. “What is an appup file?” citeturn1search0
  2. “Why do core apps require restart?” citeturn1search2
  3. “How do you handle rollback?”

Hints in Layers

Hint 1: Starting Point Create a minimal appup with restart_application.

Hint 2: Next Level Add a code_change step for state migration.

Hint 3: Technical Details Pseudocode:

code_change:
  add default field to state

Hint 4: Tools/Debugging Use release_handler logs to verify upgrade steps. citeturn1search0

Books That Will Help

Topic Book Chapter
Upgrades “Programming Erlang” Code upgrade chapter

Common Pitfalls and Debugging

Problem 1: “Upgrade fails at runtime”

  • Why: Missing appup instructions.
  • Fix: Add explicit upgrade steps and test on a staging node.
  • Quick test: Run upgrade in a controlled environment.

Definition of Done

  • Upgrade runs without downtime
  • State migration works
  • Rollback works
  • Logs capture each step

Project 10: Telemetry Pipeline + Live Metrics

  • File: P10-telemetry-live-metrics.md
  • Main Programming Language: Elixir
  • Alternative Programming Languages: Erlang
  • Coolness Level: Level 4 (See REFERENCE.md)
  • Business Potential: Level 3 (See REFERENCE.md)
  • Difficulty: Level 3 (See REFERENCE.md)
  • Knowledge Area: Observability
  • Software or Tool: Telemetry + LiveView
  • Main Book: “Elixir in Action”

What you will build: A telemetry pipeline that collects metrics and streams them to a LiveView dashboard.

Why it teaches BEAM: It combines scheduling, backpressure, and LiveView updates in one system. citeturn3search0turn4search2

Core challenges you will face:

  • Event aggregation -> Scheduler + GC
  • Backpressure -> GenStage
  • Live UI updates -> LiveView

Real World Outcome

The UI shows:

  • A rolling chart of request latency
  • A table of top message queue sizes
  • A live count of process restarts

CLI validation:

$ metricsctl status
events/sec=1200 queue=low live_clients=8

The Core Question You Are Answering

“How do I observe a BEAM system in real time without overwhelming it?”

Concepts You Must Understand First

  1. Backpressure citeturn4search2
  2. LiveView rendering citeturn3search0

Questions to Guide Your Design

  1. Sampling
    • How often do you emit events?
  2. Aggregation
    • Do you batch or stream raw events?

Thinking Exercise

Noise Control

Decide which metrics are critical and which can be sampled.

The Interview Questions They Will Ask

  1. “Why is backpressure necessary for telemetry?” citeturn4search2
  2. “How does LiveView reduce bandwidth?” citeturn3search0
  3. “What metrics reveal mailbox pressure?”

Hints in Layers

Hint 1: Starting Point Emit simple counters and render them.

Hint 2: Next Level Add a GenStage pipeline for aggregation.

Hint 3: Technical Details Pseudocode:

collect -> aggregate -> publish -> LiveView

Hint 4: Tools/Debugging Check event rates and backlog in your logs.

Books That Will Help

Topic Book Chapter
Pipelines “Elixir in Action” Processes/GenStage

Common Pitfalls and Debugging

Problem 1: “Dashboard lags under load”

  • Why: Too many events pushed to LiveView.
  • Fix: Aggregate before publishing.
  • Quick test: Increase load and observe latency trend.

Definition of Done

  • Metrics update in real time
  • Pipeline handles burst without backlog
  • Dashboard remains responsive
  • Metrics are accurate and documented

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Chat System Level 2 Weekend+ High ★★★★☆
2. Rate Limiter Level 2 Weekend+ High ★★★☆☆
3. Distributed KV Level 3 2-3 weeks Very High ★★★★☆
4. LiveView Dashboard Level 2 Weekend+ High ★★★★☆
5. GenStage Pipeline Level 3 2-3 weeks Very High ★★★★☆
6. Fault Harness Level 2 Weekend+ High ★★★☆☆
7. Presence Service Level 3 2-3 weeks Very High ★★★★☆
8. ETS Cache Level 2 Weekend+ High ★★★☆☆
9. Upgrade Drill Level 3 2-3 weeks Very High ★★★★☆
10. Telemetry Pipeline Level 3 2-3 weeks Very High ★★★★☆

Recommendation

If you are new to BEAM: Start with Project 1 to internalize process isolation and supervision. If you are a web developer: Start with Project 4 to see LiveView’s real-time model in action. If you want distributed systems mastery: Focus on Projects 3 and 7.

Final Overall Project: Fault-Tolerant Real-Time Platform

The Goal: Combine Projects 1, 3, 4, and 10 into a single platform with chat, presence, metrics, and distributed storage.

  1. Use the chat system as the real-time core.
  2. Store presence and sessions in the distributed KV store.
  3. Render operational metrics via LiveView.
  4. Add telemetry with backpressure to keep the system stable under load.

Success Criteria: The platform keeps running under simulated crashes, recovers via supervision, and shows live metrics during recovery.

From Learning to Production: What Is Next

Your Project Production Equivalent Gap to Fill
Project 1 Phoenix Presence + Channels Auth, persistence, scaling
Project 3 Riak / Mnesia-based store Replication, partition handling
Project 4 LiveView ops dashboards Auth, multi-tenant UI
Project 9 Release handling pipelines CI/CD automation

Summary

This learning path covers BEAM systems through 10 hands-on projects.

# Project Name Main Language Difficulty Time Estimate
1 Supervised Chat System Elixir Level 2 10-15 hrs
2 Rate Limiter + Circuit Breaker Elixir Level 2 12-18 hrs
3 Distributed KV Store Erlang/Elixir Level 3 20-30 hrs
4 LiveView Dashboard Elixir Level 2 12-20 hrs
5 GenStage Backpressure Pipeline Elixir Level 3 18-25 hrs
6 Fault Injection Harness Elixir Level 2 10-15 hrs
7 Presence Service Erlang/Elixir Level 3 20-30 hrs
8 ETS Cache Service Erlang/Elixir Level 2 10-15 hrs
9 Hot Code Upgrade Drill Erlang Level 3 15-25 hrs
10 Telemetry Pipeline Elixir Level 3 15-25 hrs

Expected Outcomes

  • Design supervision trees that isolate failures
  • Build distributed BEAM services with clear failure semantics
  • Implement backpressure pipelines and real-time dashboards

Additional Resources and References

Standards and Specifications

  • OTP Supervision Principles. citeturn0search5
  • Distributed Erlang documentation. citeturn5search4
  • Release Handling documentation. citeturn1search0

Industry Analysis

  • Discord on scaling Elixir to 5,000,000 concurrent users (2017). citeturn4search0

Books

  • “Programming Erlang” by Joe Armstrong - the classic reference
  • “Elixir in Action” by Sasa Juric - practical OTP patterns
  • “Designing for Scalability with Erlang/OTP” by Cesarini/Thompson - supervision and reliability