Sprint: BEAM (Erlang/Elixir) Mastery - Real World Projects
Goal: Build a first-principles mental model of how BEAM languages (Erlang, Elixir, Gleam) achieve fault tolerance and massive concurrency through isolated lightweight processes, message passing, and supervision trees. citeturn5search0turn1search6turn0search5 You will internalize OTP behaviors (GenServer, Supervisor, Application), distribution semantics, and the mechanics of scheduling, memory, and code upgrades. citeturn0search8turn6search0turn5search4turn1search0 By building production-style services such as real-time dashboards, rate limiters, distributed stores, and backpressure pipelines, you will learn when BEAM is the right tool and how to design systems that self-heal under failure. citeturn3search0turn4search2
Introduction
BEAM is the virtual machine and runtime system behind Erlang and Elixir. Its design assumes that failures are normal and that concurrency should be cheap, isolated, and easy to supervise. citeturn5search0turn0search5 This guide teaches you to reason about BEAM systems as networks of small processes coordinated by supervisors, not as monolithic servers protected by locks.
What problem does it solve today?
- It makes it practical to build fault-tolerant systems by isolating failures and restarting components automatically. citeturn0search5turn0search0
- It enables huge numbers of concurrent activities without shared-memory locks, using message passing as the default. citeturn5search0turn1search6
- It supports distribution (multiple nodes) with transparent message passing and links/monitors across nodes. citeturn5search4
What you will build across the projects:
- A supervised real-time chat system
- A rate limiter and circuit breaker service
- A distributed key-value store
- A Phoenix LiveView dashboard
- A GenStage backpressure pipeline
- A fault-injection harness to test supervision strategies
- A hot-code-upgrade drill with release handling artifacts
- A presence and notification service (Discord-style)
- An ETS-powered session/cache service
- A telemetry and observability pipeline
In scope:
- BEAM processes, message passing, and isolation
- OTP behaviors (GenServer, Supervisor, Application)
- Supervision strategies and fault recovery
- Scheduling, reductions, and per-process GC
- Distribution, nodes, links, and monitors
- ETS/Mnesia storage concepts
- Hot code upgrades and release handling
- Real-time web with Phoenix LiveView and backpressure with GenStage
Out of scope:
- Low-level VM internals beyond behavioral guarantees
- Full SIP/Telco protocol implementations
- Custom NIFs beyond conceptual mentions
Big-Picture ASCII Diagram
CLIENTS -> ROUTERS -> BEAM RUNTIME -> SUPERVISION TREE -> WORKERS
| | | | |
v v v v v
WebSocket API Gate Process Scheduler Supervisor GenServer
HTTP Rate Lim Message Passing Strategies ETS Cache
Failures -> local process crash -> supervisor restart -> system stays up
How to Use This Guide
- Read the Theory Primer first to build the mental model (do not skip it).
- Build projects in order for your first pass; later, jump by learning path.
- After each project, verify behavior using the exact outputs in the Definition of Done.
- Keep a failure log: for each crash you cause, write what recovered it and why.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Programming fundamentals: functions, recursion, pattern matching or conditional logic
- Basic concurrency concepts: what a process/thread is, what a message is
- Command-line basics: running commands, reading logs
- Recommended Reading: “Operating Systems: Three Easy Pieces” - Concurrency chapters
Helpful But Not Required
- Basic networking (TCP/HTTP) for real-time projects
- Database basics for distributed KV and dashboards
- UI basics for LiveView (HTML/CSS)
Self-Assessment Questions
- Can you explain the difference between a process and a thread?
- Can you describe a queue and why ordering matters?
- Can you reason about what happens if one component in a system crashes?
- Can you read and interpret a simple log stream?
Development Environment Setup Required Tools:
- Erlang/OTP (latest stable)
- Elixir (latest stable)
- Mix (Elixir build tool)
Recommended Tools:
observeror:observerfor runtime inspectionreconortelemetrylibraries for instrumentation- PostgreSQL for the LiveView dashboard project
Testing Your Setup:
$ elixir --version
Erlang/OTP: <version>
Elixir: <version>
$ erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().' -noshell
"<otp_release>"
Time Investment
- Simple projects: 4-8 hours each
- Moderate projects: 10-20 hours each
- Complex projects: 20-40 hours each
- Total sprint: 2-4 months
Important Reality Check BEAM mastery is about system behavior under failure, not just syntax. If your system never crashes during these projects, you are not pushing it hard enough. The goal is to learn how to design for recovery, not how to avoid every error.
Big Picture / Mental Model
Think of BEAM systems as a forest of tiny, isolated processes supervised by a hierarchy that enforces recovery policies. Message passing is the only coordination mechanism, and the scheduler guarantees fairness by preempting work.
+-------------------+ +-------------------+
| Supervisor A | | Supervisor B |
+---------+---------+ +---------+---------+
| |
+---------+---------+ +---------+---------+
| Worker 1 (GS) | | Worker 3 (GS) |
+---------+---------+ +---------+---------+
| |
+---------+---------+ +---------+---------+
| Worker 2 (ETS) | | Worker 4 (Stage) |
+-------------------+ +-------------------+
If Worker 2 crashes -> Supervisor A restarts only Worker 2 (one_for_one).
Theory Primer
This primer is a mini-book. Each concept below maps directly to multiple projects.
Concept 1: BEAM Processes and Actor Isolation
Fundamentals BEAM processes are lightweight, isolated units of execution with their own heap and mailbox. They are not OS threads; they are managed by the BEAM runtime and are designed to be created in very large numbers. citeturn5search0 Messages between processes are copied by default, which eliminates shared-memory data races and makes isolation a first-class property. citeturn1search6 This isolation allows failures to be contained: if one process crashes, others remain unaffected unless they are explicitly linked or monitored. The actor model in BEAM is therefore not just a design pattern but a runtime guarantee.
Deep Dive In BEAM, every process is a self-contained actor with three core pieces: mailbox, state (heap), and behavior (message handlers). Unlike thread-based concurrency models that rely on shared memory and locks, BEAM uses message passing, which drastically reduces the complexity of reasoning about concurrency. citeturn5search0turn1search6 The price you pay is message copying, but the reward is deterministic isolation: no process can mutate another process’s memory, and no data race is possible by construction.
This is not a theoretical statement. The efficiency guide documents that a newly spawned Erlang process uses a small, fixed amount of memory, with a conservative initial heap size so that systems can run hundreds of thousands or millions of processes. citeturn5search0 The runtime expands and shrinks heaps as needed, which means memory use is proportional to work, not to worst-case allocation. This supports a system design style where you spawn a new process per connection, per task, or per workflow step without worrying about OS thread exhaustion.
Message passing itself has important semantics. Messages are copied between heaps; this creates a clear ownership boundary and avoids aliasing bugs. citeturn1search6 When messages cross nodes, they are encoded into the external term format and transported over TCP, then decoded on the receiving node. citeturn5search1turn5search4 This means distribution is conceptually the same as local message passing, but with explicit latency, serialization, and security considerations.
The mailbox is FIFO, but selective receive can reorder processing because the runtime scans the mailbox for a matching pattern. This is a subtle performance and correctness issue: if you write a receive clause that matches a rare pattern, the runtime may scan a long mailbox for every receive, adding overhead. Message copying and mailbox scanning together influence the architecture: you typically use tagged messages and short queues, or you split responsibilities across processes to keep mailboxes small. citeturn1search6
A BEAM process can be linked or monitored. Links are bidirectional and propagate exits; monitors are unidirectional and deliver a DOWN message. These primitives allow you to build fault detection and cascading recovery policies. Supervisors use these to restart crashed workers. citeturn0search5turn0search0 When you design your own process tree, you must decide which failures should be isolated and which should propagate upward.
The key design mindset is that processes are cheap and disposable. Instead of writing complex defensive code for every edge case, you let the process crash and rely on supervision to recover. This is not recklessness; it is a deliberate architecture that trades local complexity for system-level stability.
How this fit on projects You will use this concept in every project, especially the chat system, rate limiter, and distributed store.
Definitions & key terms
- Process: A BEAM-managed actor with its own heap and mailbox. citeturn5search0
- Mailbox: FIFO queue of incoming messages.
- Message passing: Communication by sending immutable messages between processes. citeturn1search6
- Link/Monitor: Failure propagation and observation primitives. citeturn0search5turn0search0
Mental model diagram
[Process A] --send--> [Mailbox B] -> [Receive Loop] -> [State Update]
(copy) (queue) (pattern match)
How it works (step-by-step, with invariants and failure modes)
- A process sends a message to another process.
- The message is copied into the receiver’s mailbox. citeturn1search6
- The receiver scans for a matching receive clause.
- On match, the process updates its local state.
- Invariant: no process can mutate another’s memory.
- Failure modes: mailbox buildup, selective receive overhead, unhandled messages.
Minimal concrete example
Process Counter:
- State: count
- On message {inc}: count = count + 1
- On message {get, reply_to}: send {count, value} to reply_to
Common misconceptions
- “Processes are threads.” They are lighter and runtime-managed. citeturn5search0
- “Messages are references.” They are copied (with refc binary exceptions). citeturn1search6
Check-your-understanding questions
- Why does message copying prevent data races?
- What happens when a mailbox grows very large?
- How does selective receive affect performance?
Check-your-understanding answers
- Each process owns its memory; no shared mutation is possible. citeturn1search6
- Receive scans become expensive; latency grows.
- The runtime may scan many messages to find a match.
Real-world applications
- Connection-per-process servers
- Fault-isolated background jobs
- Concurrent pipelines with message passing
Where you’ll apply it
- Project 1 (Chat System)
- Project 2 (Rate Limiter)
- Project 3 (Distributed KV)
- Project 6 (Fault Injection Harness)
References
- Erlang Efficiency Guide: Processes. citeturn5search0
- Erlang Efficiency Guide: Process Messages. citeturn1search6
- OTP Design Principles: Supervision Trees. citeturn0search5
Key insights Isolation + message passing is the foundation of BEAM reliability.
Summary BEAM processes are lightweight, isolated actors with mailbox-driven concurrency. This model avoids shared-memory hazards and makes failure recovery a system design concern rather than a local coding burden.
Homework/Exercises to practice the concept
- Draw a message flow between three processes for a request/response cycle.
- Explain why selective receive can slow a busy process.
Solutions to the homework/exercises
- The sender posts a message; the receiver replies; the sender processes the reply.
- The mailbox must be scanned for the matching pattern each time.
Concept 2: OTP Behaviors and Supervision Trees
Fundamentals OTP behaviors are standardized patterns for long-running processes (GenServer, Supervisor, Application). They provide uniform lifecycle, error handling, and integration with supervision. citeturn0search8turn6search0 A supervision tree is a hierarchy of supervisors and workers where supervisors monitor and restart children according to a defined strategy. citeturn0search5turn0search0 This structure is the practical implementation of the “let it crash” philosophy: local failures are expected and recovered by a supervising process rather than by complex defensive code.
Deep Dive OTP behaviors exist because many processes in BEAM systems follow the same lifecycle: initialize state, receive messages, handle calls/casts, and terminate cleanly. GenServer formalizes this cycle, providing callbacks for initialization, synchronous calls, asynchronous casts, and miscellaneous messages. citeturn0search8 The benefit is not convenience alone; it is interoperability. A GenServer can be supervised, introspected, and upgraded consistently across the system.
Supervision trees are the backbone of fault tolerance. The design principles describe supervisors as processes that monitor workers and restart them when they fail. citeturn0search5 Supervisors apply strategies such as one_for_one (restart only the failed child), one_for_all (restart all children), or rest_for_one (restart the failed child and those started after it). citeturn0search0turn6search0 The strategy is a semantic decision about dependency: if workers are independent, one_for_one is safer; if they depend on shared state, one_for_all may be appropriate.
The restart intensity and period parameters provide circuit-breaker behavior: too many restarts in a short time can force a supervisor to give up, which then escalates the failure up the tree. citeturn0search0turn6search0 This is the mechanism that prevents infinite restart loops and signals that a deeper issue exists. The tree is not a simple retry mechanism; it is a controlled failure policy.
OTP behaviors also encode system messages: a GenServer automatically handles system-level calls such as code upgrades or state inspection. citeturn0search8 This is why you should not write your own manual receive loops for long-lived services unless you need custom semantics. By using behaviors, you gain the runtime’s built-in tooling, introspection, and fault handling.
Another critical aspect is naming and registration. GenServers and supervisors can be registered locally or globally, which affects discoverability and distribution. Naming semantics are shared between GenServer and Supervisor in Elixir. citeturn0search8turn6search0 If you register globally, you must handle network partitions and name conflicts; if you register locally, you need a discovery mechanism. These are architectural trade-offs that become explicit in distributed projects.
The “let it crash” philosophy only works when supervision trees are designed with intent. You must classify which failures are recoverable locally and which should cascade. For example, a crashed cache worker should be restarted; a corrupted database connection might need escalation to shut down the service cleanly. Supervision is therefore not just restart logic; it is policy design.
How this fit on projects Projects 1, 2, 3, 5, and 6 are rooted in OTP behaviors and supervision decisions.
Definitions & key terms
- OTP behavior: A standardized process pattern (GenServer, Supervisor). citeturn0search8turn6search0
- Supervisor: A process that starts, monitors, and restarts children. citeturn0search0
- Supervision tree: Hierarchical structure of an application into workers and supervisors. citeturn0search5
- Restart strategy: Policy for handling child failures. citeturn0search0turn6search0
Mental model diagram
Supervisor
|-- Worker A (GenServer)
|-- Worker B (GenServer)
|-- Supervisor C
|-- Worker C1
How it works (step-by-step, with invariants and failure modes)
- Supervisor starts children in defined order.
- Children run their workloads.
- On failure, supervisor applies its strategy. citeturn0search0
- If restart intensity is exceeded, supervisor terminates and escalates. citeturn0search0turn6search0
- Invariant: supervisors remain responsible for child lifecycle.
- Failure modes: incorrect strategy choice, restart loops, missing cleanup.
Minimal concrete example
Service Tree:
- Top supervisor
- DB worker (restart: transient)
- Cache worker (restart: permanent)
If DB worker fails repeatedly, supervisor escalates; cache restarts alone.
Common misconceptions
- “Supervision trees prevent all outages.” They limit failure blast radius, not eliminate outages.
- “One_for_all is always safer.” It is only safer when children are tightly coupled. citeturn0search0
Check-your-understanding questions
- When would you choose one_for_one vs one_for_all?
- Why do supervisors restart children in reverse order on shutdown?
- What happens if restart intensity is exceeded?
Check-your-understanding answers
- One_for_one for independent children; one_for_all for tightly dependent ones. citeturn0search0turn6search0
- Supervisors terminate children in reverse start order. citeturn0search3
- The supervisor terminates and failure escalates up the tree. citeturn0search0turn6search0
Real-world applications
- Web servers with resilient worker pools
- Background job systems
- Fault-tolerant caches and queues
Where you’ll apply it
- Project 1 (Chat System)
- Project 2 (Rate Limiter)
- Project 5 (GenStage Pipeline)
- Project 6 (Fault Injection Harness)
References
- OTP Supervision Principles. citeturn0search5
- Erlang supervisor manual. citeturn0search3
- Elixir GenServer documentation. citeturn0search8
- Elixir Supervisor documentation. citeturn6search0
Key insights Supervision is policy, not just restart logic.
Summary OTP behaviors standardize process structure and supervision trees enforce fault recovery policies. This is the core of BEAM reliability.
Homework/Exercises to practice the concept
- Sketch a tree for a web app with DB, cache, and worker pool.
- Decide which children should be permanent vs transient.
Solutions to the homework/exercises
- Separate supervisors for DB and workers; cache as a child of app supervisor.
- DB connection often transient; worker pool permanent.
Concept 3: Scheduling, Reductions, and Per-Process GC
Fundamentals BEAM uses preemptive scheduling based on reductions to ensure no single process can monopolize CPU time. This creates fairness across thousands of processes and enables soft real-time behavior. Each process has its own heap and garbage collector, so GC pauses are localized rather than global. citeturn5search0 These design choices make latency more predictable than in stop-the-world GC systems.
Deep Dive The BEAM scheduler is designed for fairness and responsiveness. Instead of allowing a process to run indefinitely, the scheduler counts reductions (units of work) and yields execution after a quota. This means CPU-bound tasks cannot starve I/O-bound tasks. The exact reduction count is an implementation detail, but the design guarantee is that preemption occurs regularly and cannot be disabled by user code.
Per-process garbage collection is equally important. Each process has a private heap; GC runs on that heap only. The efficiency guide documents that the heap grows as needed and can shrink under certain conditions. citeturn5search0 This prevents global pauses. If one process allocates too much memory and triggers GC, only that process is paused. The rest of the system keeps running. This is a key property for soft real-time systems where latency spikes must be bounded.
The efficiency guide also notes that you can control minimum heap size to reduce GC overhead for short-lived processes, but warns this is an optimization that requires careful measurement. citeturn1search6 This highlights a broader lesson: BEAM’s defaults are conservative and safe, but you can tune them when you understand your workload.
Scheduler fairness interacts with mailbox patterns. If a process is constantly handling messages, it might be scheduled frequently, while a CPU-bound process will be preempted. This is why BEAM systems prefer to split heavy computation into separate processes or use ports/NIFs for compute-intensive tasks. The scheduler model encourages concurrency-friendly, reactive workloads rather than long-running CPU loops.
Understanding these mechanics matters because they shape how you design services. For example, a GenStage pipeline assumes that work is distributed across many processes so backpressure can be applied. If you put all work in a single process, scheduler fairness cannot help you; the bottleneck remains. The right architecture is one that aligns with BEAM’s scheduling and GC model.
How this fit on projects Projects 2, 5, and 10 rely on predictable scheduling and GC behavior.
Definitions & key terms
- Reduction: A unit of work used for scheduling fairness.
- Per-process GC: Garbage collection scoped to a single process. citeturn5search0
- Soft real-time: Systems with bounded but not hard deterministic latency.
Mental model diagram
Scheduler
-> run P1 for N reductions
-> run P2 for N reductions
-> run P3 for N reductions
GC runs inside each process heap, not globally.
How it works (step-by-step, with invariants and failure modes)
- Scheduler picks runnable processes.
- Runs each for a fixed reduction budget.
- Preempts and moves to the next.
- GC occurs when a process heap threshold is reached. citeturn5search0
- Invariant: no global stop-the-world pauses.
- Failure modes: CPU-bound single process, excessive allocations, mailbox overflow.
Minimal concrete example
Pipeline:
- Producer process emits events
- Multiple worker processes handle tasks
- Each worker yields to scheduler periodically
Common misconceptions
- “BEAM is always fast.” It is fair and responsive, but CPU-heavy tasks can still bottleneck.
- “GC is global.” It is per-process. citeturn5search0
Check-your-understanding questions
- Why does per-process GC reduce latency spikes?
- How does reduction-based scheduling prevent starvation?
- What happens if one process receives all work?
Check-your-understanding answers
- Only the allocating process pauses; others continue. citeturn5search0
- Processes are preempted after a fixed budget.
- It becomes the bottleneck regardless of scheduler fairness.
Real-world applications
- High-concurrency web services
- Stream processing pipelines
- Real-time dashboards
Where you’ll apply it
- Project 2 (Rate Limiter)
- Project 5 (GenStage Pipeline)
- Project 10 (Telemetry Pipeline)
References
- Erlang Efficiency Guide: Processes and heap size. citeturn5search0
- Erlang Efficiency Guide: Process Messages. citeturn1search6
Key insights Fair scheduling and localized GC are core to BEAM responsiveness.
Summary BEAM schedules processes fairly and performs GC locally, enabling predictable latency under heavy concurrency when workloads are well-distributed.
Homework/Exercises to practice the concept
- Explain why a single CPU-bound process can hurt system responsiveness.
- Draw a process graph that avoids a single bottleneck.
Solutions to the homework/exercises
- The scheduler can preempt but the workload remains centralized.
- Use multiple workers with a supervisor and load distribution.
Concept 4: Distribution, Nodes, and Fault Boundaries
Fundamentals A distributed Erlang system consists of multiple runtime nodes that communicate over TCP/IP. Message passing between processes at different nodes, as well as links and monitors, are transparent when pids are used. Registered names, however, are local to each node. citeturn5search4 This means distribution feels like local message passing but introduces network latency, partial failure, and security concerns. Distributed nodes must be explicitly named, and secure distribution requires TLS configuration. citeturn5search4
Deep Dive Distribution is one of BEAM’s strongest differentiators. When you connect nodes, you gain transparent messaging, links, and monitors across machines. citeturn5search4 The same primitives used for local fault detection can therefore be used across a cluster. This simplifies distributed design because you do not need a separate protocol for failure detection; the runtime already provides it.
However, transparency does not remove the realities of distributed systems. Messages must be serialized into an external term format and transmitted over TCP; this adds latency and increases the cost of large messages. citeturn5search1turn5search4 The system can still experience partitions or node failures. When a node goes down, linked or monitored processes receive exit or DOWN messages, which you must handle in supervision logic. This is how BEAM represents network failure in the same language as process failure.
Naming is another critical issue. Pids are globally unique within a distributed system, but registered names are node-local. citeturn5search4 If you want to address a named process on another node, you must include the node name. This drives design decisions about service discovery: do you rely on global registries, or do you maintain a routing layer? These are not just configuration choices; they are part of your failure model.
Security is also explicit. The distribution documentation warns that starting a node without TLS (inet_tls) exposes it to attacks that may give complete access to the node and the cluster. citeturn5search4 This means secure distribution is not optional in production. You must choose cookies, TLS, and network isolation deliberately.
The distributed model lends itself to certain architectures: sharded state per node, data locality via process placement, and supervision trees that span nodes. But you must also design for partition tolerance. If you choose to register a single global name, what happens during a split-brain? Your projects will force you to answer these questions by building a distributed KV store and a presence service that handles node loss gracefully.
How this fit on projects Projects 3, 7, and 9 rely on distributed Erlang semantics and failure handling.
Definitions & key terms
- Node: A named Erlang runtime in a distributed system. citeturn5search4
- Distribution: Transparent messaging across nodes. citeturn5search4
- Registered name: A local alias for a pid on a node. citeturn5search4
- Partition: Loss of connectivity between nodes.
Mental model diagram
Node A (name@host) <--- TCP ---> Node B (name@host)
| |
Pid A Pid B
send(Pid B, Msg) works like local message send
How it works (step-by-step, with invariants and failure modes)
- Start nodes with explicit names.
- Connect nodes; establish distribution channel. citeturn5search4
- Send messages using pids across nodes.
- Monitor remote pids for failure.
- Invariant: messaging semantics are the same locally and remotely.
- Failure modes: network partitions, node crashes, name collisions, insecure distribution.
Minimal concrete example
Cluster messaging:
- Node A sends {ping, t} to Pid on Node B
- Node B replies {pong, t}
- If Node B disconnects, Node A receives DOWN
Common misconceptions
- “Distributed Erlang hides all network issues.” It hides APIs, not failures.
- “Registered names are global.” They are local per node. citeturn5search4
Check-your-understanding questions
- What must be included when sending to a registered name on another node?
- Why is TLS important for distribution?
- How do you detect that a remote node is down?
Check-your-understanding answers
- You must include the node name because registrations are local. citeturn5search4
- Without TLS the node may be fully compromised. citeturn5search4
- Links or monitors deliver exit/DOWN messages.
Real-world applications
- Distributed caches
- Presence systems and chat backends
- Clustered job processors
Where you’ll apply it
- Project 3 (Distributed KV Store)
- Project 7 (Presence and Notification Service)
- Project 9 (Hot Code Upgrade Drill)
References
- Distributed Erlang documentation. citeturn5search4
- External Term Format documentation. citeturn5search1
Key insights Distribution is transparent in API, not in failure semantics.
Summary Distributed Erlang makes cross-node messaging feel local while preserving the realities of latency and failure. You must design supervision and naming with these realities in mind.
Homework/Exercises to practice the concept
- Sketch a two-node cluster and describe how to detect node failure.
- Describe a naming strategy for a distributed service.
Solutions to the homework/exercises
- Use monitors on remote pids and handle DOWN messages.
- Use a local registry plus a routing process per node.
Concept 5: State Management with ETS and Mnesia
Fundamentals BEAM systems often store mutable state inside processes, but ETS (Erlang Term Storage) provides in-memory tables with efficient access for larger shared datasets. citeturn2search7 ETS tables are dynamic tables created by a process, and they support multiple table types (set, ordered_set, bag, duplicate_bag). citeturn2search7 Mnesia builds on these concepts to provide a distributed database with transactions and replication, but it inherits ETS-like access patterns. citeturn2search0
Deep Dive In BEAM, mutable state is typically owned by a single process, which serializes access. This is ideal for small, strongly encapsulated state. For large datasets or shared lookups, ETS is the standard tool. ETS provides constant-time access for sets and logarithmic access for ordered sets, making it useful for caches, registries, and session tables. citeturn2search7
ETS tables are created by a process and are destroyed when that process terminates. This is an important lifecycle property: your table’s availability depends on the owner process. citeturn2search7 If you put ETS in a supervisor-managed owner process, the table can be recreated on restart, but you must think about persistence and recovery. For state that cannot be lost, you either persist to disk or use Mnesia, which provides transactions and replication across nodes. citeturn2search0
ETS is not a database; it is a high-performance in-memory store optimized for BEAM use. It supports match and select operations, but these can be expensive and may require scanning the whole table. The tables-and-databases guide explicitly warns that select/match can become expensive and recommends structuring data to minimize full scans. citeturn2search0 This is a key design constraint: ETS is fast for key-based access, but you must design your keys and queries intentionally.
Mnesia adds transactions and distribution. Its API resembles ETS for basic operations, but it supports replication and transactions across nodes. This makes it attractive for distributed KV stores, but it also introduces complexity: partitions, schema management, and recovery. For learning projects, you can simulate a Mnesia-like log with ETS plus append-only persistence to understand the trade-offs.
In OTP systems, ETS often sits behind a GenServer or GenStage pipeline. The process serializes writes and defines a clear API, while ETS provides fast reads. This pattern gives you the speed of shared memory with the safety of controlled access. It is not magic; you still must handle concurrency, but you do so at the process boundary rather than with locks.
How this fit on projects Projects 2, 3, and 8 depend on ETS-based storage and state management.
Definitions & key terms
- ETS: Built-in term storage with fast access. citeturn2search7
- Table owner: The process that created an ETS table. citeturn2search7
- Select/match: Query operations that may scan tables. citeturn2search0
- Mnesia: Distributed DB built on Erlang concepts. citeturn2search0
Mental model diagram
[Client] -> [GenServer] -> [ETS Table]
| (fast lookup)
v
[State]
How it works (step-by-step, with invariants and failure modes)
- A process creates an ETS table.
- Clients access via API or direct table reads.
- Data is inserted, updated, looked up by key. citeturn2search7
- If owner dies, the table is destroyed. citeturn2search7
- Invariant: ETS access is efficient for key-based reads.
- Failure modes: table loss on crash, expensive scans, inconsistent writes.
Minimal concrete example
Session Cache:
- Key: session_id
- Value: user_id, expiry
- Read path: lookup by session_id
Common misconceptions
- “ETS is durable.” It is in-memory; the table is destroyed on owner crash. citeturn2search7
- “ETS is a full database.” It is optimized for key access, not complex queries. citeturn2search0
Check-your-understanding questions
- What happens to an ETS table when its owner crashes?
- Why are select/match operations potentially expensive?
- When would you choose Mnesia over ETS?
Check-your-understanding answers
- The table is destroyed with its owner process. citeturn2search7
- They may scan the entire table. citeturn2search0
- When you need replication or transactions across nodes.
Real-world applications
- Session stores and caches
- Local registries
- Distributed KV stores
Where you’ll apply it
- Project 2 (Rate Limiter)
- Project 3 (Distributed KV Store)
- Project 8 (ETS Cache Service)
References
- ETS documentation. citeturn2search7
- Tables and Databases guide. citeturn2search0
Key insights ETS gives you fast shared data when you respect its lifecycle and query limits.
Summary ETS is a high-performance in-memory table system; Mnesia adds distribution and transactions. Both require careful lifecycle and query design.
Homework/Exercises to practice the concept
- Design a table schema for a rate limiter.
- Identify which operations require full-table scans and avoid them.
Solutions to the homework/exercises
- Key by {user_id, window} with count as value.
- Use key lookups; avoid scan-based queries.
Concept 6: Real-Time Web and Backpressure (LiveView + GenStage)
Fundamentals Phoenix LiveView enables rich, real-time user experiences with server-rendered HTML, diff tracking, and WebSocket-based updates. citeturn3search0 A persistent connection is established between client and server, which reduces work per request and allows faster reactions to user events. citeturn3search6 GenStage provides backpressure-aware pipelines where consumers explicitly demand events and producers never send more than requested. citeturn4search2 Together, they demonstrate how BEAM’s process model supports real-time systems without heavy client-side code or external queues.
Deep Dive LiveView works by rendering HTML on the server, sending it to the client initially as a static page, then maintaining a persistent connection that streams updates. citeturn3search0turn3search6 This design means the client does not need to own application state; instead, the server is authoritative. The runtime diffs state changes and sends only the minimal updates. citeturn3search0 This is powerful for dashboards, collaborative tools, and real-time monitoring where consistency matters more than offline capability.
GenStage addresses the other side of real-time systems: throughput and backpressure. In a naive system, producers can overwhelm consumers, leading to mailbox growth and latency spikes. GenStage lets consumers explicitly demand a specific number of events, ensuring that producers never outpace the system’s capacity. citeturn4search2 This is not a library convenience; it is a concurrency contract that aligns with BEAM’s message-passing model.
The Discord engineering blog describes how Elixir was used for a highly concurrent real-time system, reporting nearly five million concurrent users and millions of events per second by July 2017. citeturn4search0 This scale is only achievable when flow control is a first-class concern. You cannot rely on infinite queues or manual throttling. The runtime must treat demand as a signal that shapes how work moves through the system.
LiveView and GenStage are often combined. A real-time dashboard can subscribe to a GenStage pipeline that feeds events; the LiveView process receives messages and updates its state, and the runtime only sends changed parts of the UI. This is a BEAM-native approach to real-time systems: concurrency, backpressure, and UI updates all in one runtime.
How this fit on projects Projects 4, 5, and 10 are built on these concepts.
Definitions & key terms
- LiveView: Server-rendered, real-time UI with diff updates. citeturn3search0
- Persistent connection: Long-lived channel between client and server. citeturn3search6
- Backpressure: Consumer-controlled demand to prevent overload. citeturn4search2
- Producer/Consumer: Components in a data pipeline. citeturn4search2
Mental model diagram
Events -> GenStage Producer -> GenStage Consumer -> LiveView -> Browser
(demand-based flow) (bounded) (diff updates)
How it works (step-by-step, with invariants and failure modes)
- Producers emit events only when consumers demand them. citeturn4search2
- Consumers process events and update state.
- LiveView diffs state and pushes UI updates. citeturn3search0
- Invariant: demand controls flow; UI updates are incremental.
- Failure modes: unbounded demand, heavy LiveView processes, burst storms.
Minimal concrete example
Real-time chart:
- Producer emits metrics
- Consumer aggregates
- LiveView updates chart every second
Common misconceptions
- “LiveView is just websockets.” It is server-rendered with diff updates. citeturn3search0
- “Backpressure is optional.” Without it, queues grow until failure. citeturn4search2
Check-your-understanding questions
- Why is demand-driven flow safer than unbounded queues?
- What does LiveView send after the initial render?
- How would you prevent a slow consumer from collapsing a pipeline?
Check-your-understanding answers
- It bounds work and prevents overload. citeturn4search2
- Only the diffs (changed parts) are sent. citeturn3search0
- Limit demand and use backpressure-aware stages.
Real-world applications
- Monitoring dashboards
- Chat systems and activity feeds
- Burst-resistant pipelines
Where you’ll apply it
- Project 4 (LiveView Dashboard)
- Project 5 (GenStage Pipeline)
- Project 10 (Telemetry Pipeline)
References
- Phoenix LiveView README. citeturn3search0
- Phoenix LiveView docs (persistent connection). citeturn3search6
- GenStage documentation. citeturn4search2
- Discord engineering blog (5M concurrent users, millions events/sec). citeturn4search0
Key insights Real-time systems need both UI diffing and backpressure to stay stable.
Summary LiveView delivers server-rendered realtime UI; GenStage delivers safe flow control. Combined, they are a BEAM-native real-time stack.
Homework/Exercises to practice the concept
- Design a pipeline with bounded demand and describe how to tune it.
- Sketch a LiveView state update flow for a dashboard.
Solutions to the homework/exercises
- Set demand limits and buffer sizes at each stage.
- Server state changes trigger diff updates to the client.
Concept 7: Code Loading and Release Handling
Fundamentals Erlang/OTP supports runtime code replacement and release handling through SASL. The release handler installs upgrades based on appup and relup instructions, and can reload, restart, or replace applications as needed. citeturn1search0 Core applications (ERTS, kernel, stdlib, sasl) require runtime restarts during upgrades via restart_new_emulator, while other upgrades may use restart_emulator or in-place changes. citeturn1search2turn1search0
Deep Dive Runtime code loading is a distinguishing feature of BEAM. The system can load a new version of a module while the system runs, and processes can transition to the new code when they next call into that module. This is not magic; it relies on careful design of process state and upgrade callbacks. The release handling framework in OTP formalizes this for full releases. It uses appup files to describe application upgrade steps and relup files to describe release-level steps. citeturn1search0
Release handling is explicit about restart boundaries. The documentation describes restart_new_emulator for upgrades that change the runtime system or core applications. This instruction reboots the runtime and is required when ERTS or core apps are upgraded. citeturn1search2 For other upgrades, restart_emulator can be used at the end of a relup to reboot after upgrade instructions are executed. citeturn1search0 This makes it clear that “hot upgrade” has constraints: some upgrades require a controlled reboot, not a pure in-place swap.
A key challenge in hot upgrades is state migration. If the internal state structure of a process changes, you must define a code_change step to transform state. This is a design commitment; if you do not plan for it, hot upgrades will be painful. Projects in this guide include a controlled upgrade drill so you can practice the mechanics of a safe state transition.
Release handling also interacts with distribution. Each node can have its own release version, and upgrades can be coordinated across nodes using synchronization instructions. This enables rolling upgrades or staged rollouts, but only if you design your upgrade plan and compatibility boundaries.
The goal of learning this concept is not to make every project hot-upgradeable. It is to understand the constraints and the tooling so that when uptime requirements demand it, you can design for upgrade safety rather than retrofitting under pressure.
How this fit on projects Project 9 is entirely about release handling, and projects 1-3 benefit from upgrade-safe state design.
Definitions & key terms
- Release handler: OTP component that installs upgrades. citeturn1search0
- appup/relup: Upgrade instruction files for apps and releases. citeturn1search0
- restart_new_emulator: Required for core runtime upgrades. citeturn1search2
- restart_emulator: Reboot instruction for non-core upgrades. citeturn1search0
Mental model diagram
Old Release -> appup/relup -> release_handler -> Upgrade Steps -> New Release
| |
+---- code_change(state) ------------------------+
How it works (step-by-step, with invariants and failure modes)
- Build release with appup/relup instructions. citeturn1search0
- Install release package on running node.
- Release handler executes upgrade steps. citeturn1search0
- Processes transition state via code_change.
- Invariant: state transitions must be explicit and safe.
- Failure modes: incompatible state, missing appup, core upgrade without restart.
Minimal concrete example
State migration:
- v1 state: {user_id, count}
- v2 state: {user_id, count, last_seen}
- code_change adds last_seen with default
Common misconceptions
- “All upgrades are hot.” Core runtime upgrades require restart. citeturn1search2
- “State changes are automatic.” You must define migration steps.
Check-your-understanding questions
- Why do core OTP applications require runtime restart?
- What does an appup file describe?
- Why is state migration central to hot upgrades?
Check-your-understanding answers
- The runtime itself cannot hot-swap its core components. citeturn1search2
- Upgrade instructions between application versions. citeturn1search0
- Processes carry state across versions; you must transform it.
Real-world applications
- Zero-downtime upgrades in telecom systems
- Rolling upgrades in distributed services
Where you’ll apply it
- Project 9 (Hot Code Upgrade Drill)
- Project 1 (Chat System) as optional enhancement
References
- Release Handling in OTP. citeturn1search0
- restart_new_emulator and restart_emulator instructions. citeturn1search2turn1search0
Key insights Hot upgrades are possible only when state and upgrade steps are explicit.
Summary Release handling makes runtime upgrades possible but requires careful design of state, versioning, and upgrade steps.
Homework/Exercises to practice the concept
- Describe a state change that requires migration.
- Sketch an upgrade plan with a rollback strategy.
Solutions to the homework/exercises
- Add a new field with a default value during upgrade.
- Deploy appup, verify health, and maintain rollback relup.
Glossary
- BEAM: The Erlang virtual machine and runtime system.
- OTP: Open Telecom Platform; a set of libraries and design principles. citeturn0search5
- GenServer: OTP behavior for server processes. citeturn0search8
- Supervisor: OTP behavior that monitors and restarts children. citeturn0search0
- Reduction: Scheduling unit for fairness.
- ETS: In-memory term storage. citeturn2search7
- Distributed Erlang: Nodes connected for transparent message passing. citeturn5search4
- LiveView: Server-rendered real-time UI with diffs. citeturn3search0
- GenStage: Backpressure-aware pipeline. citeturn4search2
- Release handling: OTP upgrade framework. citeturn1search0
Why BEAM Matters
- BEAM’s process model and supervision trees are explicitly designed for fault tolerance at scale. citeturn5search0turn0search5
- Discord reports scaling Elixir to nearly five million concurrent users and millions of events per second (July 6, 2017), demonstrating BEAM’s relevance for real-time systems. citeturn4search0
- LiveView and GenStage provide a BEAM-native approach to real-time UI and backpressure-driven pipelines. citeturn3search0turn4search2
ASCII diagram: old vs new concurrency model
OLD (Threads + Locks) NEW (Processes + Messages)
Shared state Isolated state
Locks and contention Message passing
One crash can corrupt state Crash is isolated
Complex recovery Supervisor restarts
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Process Model | Lightweight isolated processes, message copying, mailbox semantics. |
| OTP + Supervision | GenServer/Supervisor lifecycle and restart strategies. |
| Scheduling + GC | Reduction-based fairness and per-process GC. |
| Distribution | Node naming, transparent messaging, and failure semantics. |
| State + ETS | In-memory tables, lifecycle, and query constraints. |
| Real-Time + Backpressure | LiveView diffing and GenStage demand control. |
| Release Handling | appup/relup, code_change, and upgrade boundaries. |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| Project 1 | Process Model, OTP + Supervision |
| Project 2 | OTP + Supervision, State + ETS |
| Project 3 | Distribution, State + ETS |
| Project 4 | Real-Time + Backpressure |
| Project 5 | Real-Time + Backpressure, Scheduling + GC |
| Project 6 | OTP + Supervision, Scheduling + GC |
| Project 7 | Distribution, Process Model |
| Project 8 | State + ETS |
| Project 9 | Release Handling |
| Project 10 | Scheduling + GC, Real-Time + Backpressure |
Deep Dive Reading by Concept
| Concept | Book and Chapter | Why This Matters |
|---|---|---|
| Process Model | “Programming Erlang” by Joe Armstrong - Processes & Message Passing | Actor model and concurrency fundamentals |
| OTP + Supervision | “Designing for Scalability with Erlang/OTP” - Supervision | Practical fault-tolerance design |
| Scheduling + GC | “The BEAM Book” - Scheduler & Memory | Understand runtime behavior |
| Distribution | “Programming Erlang” - Distributed Erlang | Cluster semantics and failure handling |
| State + ETS | “Erlang and OTP in Action” - ETS/Mnesia | Data structures and storage |
| Real-Time + Backpressure | “Elixir in Action” - Processes and GenStage | Flow control in pipelines |
| Release Handling | “Programming Erlang” - Code upgrades | Safe runtime upgrades |
Quick Start: Your First 48 Hours
Day 1:
- Read Concept 1 and Concept 2 in the Theory Primer.
- Start Project 1 and get a supervised GenServer running.
Day 2:
- Validate Project 1 against the Definition of Done.
- Read Concept 3 and Concept 5 for ETS and scheduling basics.
Recommended Learning Paths
Path 1: The Web Builder
- Project 4 -> Project 1 -> Project 2 -> Project 10
Path 2: The Systems Learner
- Project 1 -> Project 2 -> Project 3 -> Project 7 -> Project 9
Path 3: The Distributed Systems Learner
- Project 3 -> Project 7 -> Project 5 -> Project 10
Success Metrics
- You can explain supervision strategies and justify your choices.
- You can design a process tree for a real service.
- You can identify and fix mailbox backlogs.
- You can run a controlled upgrade with a state migration step.
Optional Appendix: BEAM Tooling Cheat Sheet
observer: visualize processes, memory, and message queues:sys: inspect GenServer state and system messages:dbg: trace function calls and message flowtelemetry: instrument and export metrics
Project Overview Table
| # | Project Name | Main Language | Difficulty | Time Estimate | Core Concepts | Coolness |
|---|---|---|---|---|---|---|
| 1 | Supervised Chat System | Elixir | Level 2 | 10-15 hrs | OTP + Supervision | Level 3 |
| 2 | Rate Limiter + Circuit Breaker | Elixir | Level 2 | 12-18 hrs | ETS + Supervision | Level 3 |
| 3 | Distributed KV Store | Erlang/Elixir | Level 3 | 20-30 hrs | Distribution + ETS | Level 4 |
| 4 | LiveView Real-Time Dashboard | Elixir | Level 2 | 12-20 hrs | LiveView | Level 3 |
| 5 | GenStage Backpressure Pipeline | Elixir | Level 3 | 18-25 hrs | Backpressure | Level 4 |
| 6 | Fault Injection Harness | Elixir | Level 2 | 10-15 hrs | Supervision | Level 3 |
| 7 | Presence and Notification Service | Erlang/Elixir | Level 3 | 20-30 hrs | Distribution | Level 4 |
| 8 | ETS Cache Service | Erlang/Elixir | Level 2 | 10-15 hrs | ETS | Level 3 |
| 9 | Hot Code Upgrade Drill | Erlang | Level 3 | 15-25 hrs | Release Handling | Level 4 |
| 10 | Telemetry Pipeline + Live Metrics | Elixir | Level 3 | 15-25 hrs | Scheduling + LiveView | Level 4 |
Project List
The following projects guide you from core OTP patterns to distributed, real-time systems.
Project 1: Supervised Real-Time Chat System
- File: P01-supervised-chat-system.md
- Main Programming Language: Elixir
- Alternative Programming Languages: Erlang, Gleam
- Coolness Level: Level 3 (See REFERENCE.md)
- Business Potential: Level 2 (See REFERENCE.md)
- Difficulty: Level 2 (See REFERENCE.md)
- Knowledge Area: Concurrency, Fault Tolerance
- Software or Tool: OTP/GenServer
- Main Book: “Programming Erlang”
What you will build: A multi-room chat service where each room is a supervised process and failures are isolated.
Why it teaches BEAM: It forces you to model state as processes and to recover from crashes using supervisors. citeturn0search5turn0search0turn0search8
Core challenges you will face:
- Process isolation -> Process Model
- Room supervision -> OTP + Supervision
- Message routing -> Process Model
Real World Outcome
You can run a CLI client for two rooms, kill a room process, and watch it restart without losing other rooms.
$ chatctl create room general
room general: pid=<0.215.0>
$ chatctl send general "hello"
[general] user42: hello
$ chatctl crash room general
room general crashed; supervisor restarted it
$ chatctl send general "we are back"
[general] user42: we are back
The Core Question You Are Answering
“How do I build a service where one failing chat room does not bring down the whole system?”
Concepts You Must Understand First
- BEAM process isolation
- Why are messages copied and state isolated? citeturn1search6
- Book Reference: “Programming Erlang” - Processes chapter
- Supervision strategy
- Which restart strategy fits independent rooms? citeturn0search0
- Book Reference: “Designing for Scalability with Erlang/OTP” - Supervision
- GenServer lifecycle
- How are calls and casts handled? citeturn0search8
- Book Reference: “Elixir in Action” - OTP section
Questions to Guide Your Design
- State model
- Will each room be one process or multiple?
- How will you store room history safely?
- Failure model
- What happens when a room crashes mid-message?
- How do you inform clients that the room restarted?
Thinking Exercise
Room Isolation Sketch
Draw a supervision tree for 3 rooms. Decide where to place a router process that handles room discovery.
Questions to answer:
- What should happen if the router dies?
- Which processes should be linked vs monitored?
The Interview Questions They Will Ask
- “Why use one process per room instead of shared state?”
- “How does supervision improve reliability?” citeturn0search5turn0search0
- “How do GenServer calls differ from casts?” citeturn0search8
- “What happens to messages during a crash?”
- “How would you scale this across nodes?”
Hints in Layers
Hint 1: Starting Point Start with one room process and a simple send/receive API.
Hint 2: Next Level Add a supervisor that restarts the room on crash.
Hint 3: Technical Details Pseudocode:
Room process:
- state: list of users
- on {join, user}: add to list
- on {message, user, text}: broadcast
Hint 4: Tools/Debugging
Use observer to confirm the room process restarts after a crash.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Processes | “Programming Erlang” | Processes chapter |
| Supervision | “Designing for Scalability with Erlang/OTP” | Supervision chapter |
Common Pitfalls and Debugging
Problem 1: “Room crashes kill the whole app”
- Why: You started rooms without a supervisor.
- Fix: Put each room under a one_for_one supervisor. citeturn0search0
- Quick test: Crash a room and verify the supervisor restarts only that room.
Definition of Done
- Each room is its own process
- Rooms restart on crash without affecting others
- Messages still flow after a restart
- Process tree visible in
observer
Project 2: Rate Limiter and Circuit Breaker Service
- File: P02-rate-limiter-circuit-breaker.md
- Main Programming Language: Elixir
- Alternative Programming Languages: Erlang, Gleam
- Coolness Level: Level 3 (See REFERENCE.md)
- Business Potential: Level 2 (See REFERENCE.md)
- Difficulty: Level 2 (See REFERENCE.md)
- Knowledge Area: State, Fault Tolerance
- Software or Tool: ETS + GenServer
- Main Book: “Erlang and OTP in Action”
What you will build: A rate limiter backed by ETS and a circuit breaker that trips on repeated failures.
Why it teaches BEAM: You must manage state safely and choose supervision policies. citeturn2search7turn0search0
Core challenges you will face:
- ETS design -> State + ETS
- Supervisor strategy -> OTP + Supervision
- Timeout handling -> Scheduling + GC
Real World Outcome
$ ratelimit check user42
allowed (remaining=9)
$ ratelimit check user42
blocked (retry_in=12s)
$ breaker status serviceA
state: open (cooldown=30s)
The Core Question You Are Answering
“How do I enforce limits and protect dependencies without locks or shared memory races?”
Concepts You Must Understand First
- ETS tables
- What happens when the owner crashes? citeturn2search7
- Book Reference: “Erlang and OTP in Action” - ETS chapter
- Supervision strategy
- How do you restart a limiter safely? citeturn0search0
- Book Reference: “Designing for Scalability with Erlang/OTP”
- GenServer state
- How do you serialize updates? citeturn0search8
- Book Reference: “Elixir in Action”
Questions to Guide Your Design
- Rate limiting algorithm
- Fixed window, sliding window, or token bucket?
- How will you store timestamps efficiently?
- Circuit breaker
- What error threshold trips the breaker?
- How long before a half-open state?
Thinking Exercise
State Table Design
Design the ETS key for {user_id, window} and decide how to prune expired windows.
Questions to answer:
- How do you keep reads constant time?
- What is your cleanup strategy?
The Interview Questions They Will Ask
- “Why choose ETS for a rate limiter?” citeturn2search7
- “How do you avoid race conditions without locks?”
- “What supervision strategy is best for a limiter?” citeturn0search0
- “How do you prevent restart storms?”
Hints in Layers
Hint 1: Starting Point Model the limiter as one GenServer that owns ETS.
Hint 2: Next Level Store counters per time window and expire old keys.
Hint 3: Technical Details Pseudocode:
on check(user):
window = floor(now / window_size)
count = lookup({user, window})
if count < limit -> increment and allow
else -> deny
Hint 4: Tools/Debugging
Use observer to confirm ETS table size over time.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| ETS design | “Erlang and OTP in Action” | ETS chapter |
| Fault tolerance | “Designing for Scalability with Erlang/OTP” | Supervision chapter |
Common Pitfalls and Debugging
Problem 1: “Limiter forgets counts after crash”
- Why: ETS table is destroyed when owner dies. citeturn2search7
- Fix: Rebuild from logs or accept reset as design choice.
- Quick test: Crash owner and verify expected behavior.
Definition of Done
- Limits enforced per user and window
- ETS table lifecycle documented
- Circuit breaker transitions documented
- Supervisor restarts without global failure
Project 3: Distributed Key-Value Store
- File: P03-distributed-kv-store.md
- Main Programming Language: Erlang or Elixir
- Alternative Programming Languages: Gleam
- Coolness Level: Level 4 (See REFERENCE.md)
- Business Potential: Level 3 (See REFERENCE.md)
- Difficulty: Level 3 (See REFERENCE.md)
- Knowledge Area: Distribution, Data Stores
- Software or Tool: Distributed Erlang + ETS
- Main Book: “Programming Erlang”
What you will build: A sharded KV store that runs on multiple nodes and replicates keys.
Why it teaches BEAM: It forces you to design with node failure, message passing, and ETS-backed state. citeturn5search4turn2search7
Core challenges you will face:
- Node discovery -> Distribution
- Shard mapping -> Process Model
- Replication -> Distribution + ETS
Real World Outcome
$ kv put user:42 "active" --node nodeA
ok
$ kv get user:42 --node nodeB
"active" (replica)
$ kv cluster status
nodes: [nodeA,nodeB,nodeC]
shards: 64
replication: 2
The Core Question You Are Answering
“How do I maintain state across nodes while handling node failures gracefully?”
Concepts You Must Understand First
- Distributed Erlang nodes
- How are nodes named and connected? citeturn5search4
- Book Reference: “Programming Erlang” - Distribution chapter
- ETS lifecycle
- What happens to tables when a node restarts? citeturn2search7
- Book Reference: “Erlang and OTP in Action” - ETS chapter
- Supervisor policies
- How do you restart shard processes? citeturn0search0
- Book Reference: “Designing for Scalability with Erlang/OTP”
Questions to Guide Your Design
- Sharding
- How will you assign keys to shards?
- Replication
- How many replicas, and how do you keep them consistent?
Thinking Exercise
Failure Drill
Simulate node loss: what happens to keys on that node and how do clients recover?
Questions to answer:
- How do you detect that a node is down?
- How do you reassign shards?
The Interview Questions They Will Ask
- “How does distributed Erlang handle messaging between nodes?” citeturn5search4
- “What happens to ETS state on node failure?” citeturn2search7
- “How do you handle split-brain?”
- “Why use supervision for shard processes?” citeturn0search0
Hints in Layers
Hint 1: Starting Point Start with two nodes and a single shard process on each.
Hint 2: Next Level Use consistent hashing to map keys to shard owners.
Hint 3: Technical Details Pseudocode:
shard = hash(key) mod shard_count
primary = shard_owner(shard)
replica = next_owner(shard)
Hint 4: Tools/Debugging Use node monitors to detect failures and log shard reassignment.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Distribution | “Programming Erlang” | Distribution chapter |
| ETS | “Erlang and OTP in Action” | ETS chapter |
Common Pitfalls and Debugging
Problem 1: “Keys disappear after node crash”
- Why: State was only on the crashed node.
- Fix: Add replication or persistence.
- Quick test: Kill a node and verify replicas still answer.
Definition of Done
- Keys retrievable from replica node
- Shard ownership rebalances after node loss
- Node discovery documented
- Cluster health command works
Project 4: LiveView Real-Time Operations Dashboard
- File: P04-liveview-ops-dashboard.md
- Main Programming Language: Elixir
- Alternative Programming Languages: None
- Coolness Level: Level 3 (See REFERENCE.md)
- Business Potential: Level 2 (See REFERENCE.md)
- Difficulty: Level 2 (See REFERENCE.md)
- Knowledge Area: Real-Time Web
- Software or Tool: Phoenix LiveView
- Main Book: “Programming Phoenix”
What you will build: A real-time dashboard that streams metrics to the browser with LiveView diff updates.
Why it teaches BEAM: LiveView uses server-rendered HTML with diff tracking and a persistent connection for real-time updates. citeturn3search0turn3search6
Core challenges you will face:
- State streaming -> LiveView
- Efficient updates -> LiveView diffing
- Process isolation -> Process Model
Real World Outcome
The dashboard shows:
- A top bar with system status (green/yellow/red)
- A grid of cards for CPU, memory, message queue sizes
- A live chart that updates once per second
Behavior:
- When metrics spike, only the affected card updates (no full page reload). citeturn3search0
The Core Question You Are Answering
“How do I deliver live UI updates without building a heavy client app?”
Concepts You Must Understand First
- LiveView lifecycle
- How does the initial render differ from updates? citeturn3search0turn3search6
- Book Reference: “Programming Phoenix” - LiveView chapters
- Process model
- What process owns the dashboard state?
- Book Reference: “Programming Erlang” - Processes
Questions to Guide Your Design
- Update frequency
- How often do you push updates without overwhelming clients?
- Data pipeline
- Where do metrics originate and how are they buffered?
Thinking Exercise
Diff Strategy
Decide which UI components can update independently and which must update together.
Questions to answer:
- What data should be computed per client?
- What can be shared across clients?
The Interview Questions They Will Ask
- “Why use LiveView instead of client-side JS?” citeturn3search0
- “How does LiveView minimize network traffic?” citeturn3search0
- “How do you handle slow clients?”
- “What happens if a LiveView process crashes?”
Hints in Layers
Hint 1: Starting Point Build a static LiveView page with placeholders.
Hint 2: Next Level Add a timer process that sends metric updates.
Hint 3: Technical Details Pseudocode:
Every 1s:
read metrics
update assigns
LiveView pushes diff
Hint 4: Tools/Debugging Use browser devtools to confirm small diff payloads.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| LiveView | “Programming Phoenix” | LiveView chapters |
Common Pitfalls and Debugging
Problem 1: “Whole page re-renders”
- Why: You are reassigning the entire state on each tick.
- Fix: Update only the changed assigns.
- Quick test: Track diff size in logs.
Definition of Done
- Live metrics update without page refresh
- Diff payloads remain small
- UI degrades gracefully under load
- Crash of dashboard process recovers cleanly
Project 5: GenStage Backpressure Pipeline
- File: P05-genstage-backpressure-pipeline.md
- Main Programming Language: Elixir
- Alternative Programming Languages: Erlang
- Coolness Level: Level 4 (See REFERENCE.md)
- Business Potential: Level 3 (See REFERENCE.md)
- Difficulty: Level 3 (See REFERENCE.md)
- Knowledge Area: Stream Processing
- Software or Tool: GenStage
- Main Book: “Elixir in Action”
What you will build: A producer-consumer pipeline that processes bursty events with explicit demand control.
Why it teaches BEAM: GenStage demand is a concrete, backpressure mechanism. citeturn4search2
Core challenges you will face:
- Demand control -> Backpressure
- Work distribution -> Scheduler/GC
- Failure recovery -> Supervision
Real World Outcome
$ pipeline send 10000
accepted
$ pipeline status
producer_buffer=500
consumer_demand=200
processed=9800
The Core Question You Are Answering
“How do I prevent a burst of events from collapsing my system?”
Concepts You Must Understand First
- Backpressure
- Why demand should be explicit? citeturn4search2
- Book Reference: “Elixir in Action” - GenStage section
- Scheduler fairness
- Why distribute work across processes?
- Book Reference: “The BEAM Book” - Scheduler chapter
Questions to Guide Your Design
- Demand size
- How many events should a consumer request at a time?
- Failure policy
- What happens if a consumer crashes mid-batch?
Thinking Exercise
Burst Simulation
Design a scenario where input spikes 10x and decide how the pipeline should respond.
Questions to answer:
- Where do events buffer?
- How do you detect lag?
The Interview Questions They Will Ask
- “What is backpressure and why do we need it?” citeturn4search2
- “How do GenStage producers and consumers coordinate?” citeturn4search2
- “How do you measure pipeline lag?”
- “What happens if a consumer crashes?”
Hints in Layers
Hint 1: Starting Point Start with one producer and one consumer.
Hint 2: Next Level Add multiple consumers and split demand.
Hint 3: Technical Details Pseudocode:
Consumer requests N events
Producer sends N events
Consumer processes and requests more
Hint 4: Tools/Debugging Log demand and buffer size on each stage.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Backpressure | “Elixir in Action” | GenStage chapter |
Common Pitfalls and Debugging
Problem 1: “Mailbox growth without bound”
- Why: You are emitting without honoring demand. citeturn4search2
- Fix: Enforce demand-based flow control.
- Quick test: Trigger a large burst and verify buffers stay bounded.
Definition of Done
- Demand-based flow implemented
- Burst input does not crash the pipeline
- Consumer failure is recovered via supervision
- Metrics show bounded queues
Project 6: Fault Injection and Supervision Harness
- File: P06-fault-injection-harness.md
- Main Programming Language: Elixir
- Alternative Programming Languages: Erlang
- Coolness Level: Level 3 (See REFERENCE.md)
- Business Potential: Level 2 (See REFERENCE.md)
- Difficulty: Level 2 (See REFERENCE.md)
- Knowledge Area: Fault Tolerance
- Software or Tool: OTP Supervision
- Main Book: “Designing for Scalability with Erlang/OTP”
What you will build: A harness that deliberately crashes workers under different strategies and records recovery behavior.
Why it teaches BEAM: You will see supervision strategies in action, not just read about them. citeturn0search0turn0search5
Core challenges you will face:
- Strategy selection -> OTP + Supervision
- Crash scenarios -> Process Model
- Observability -> Scheduling/GC
Real World Outcome
$ faultlab run --strategy one_for_one
worker A crashed -> restarted
worker B unaffected
$ faultlab run --strategy one_for_all
worker A crashed -> all workers restarted
The Core Question You Are Answering
“What actually happens when a process crashes under different supervision strategies?”
Concepts You Must Understand First
- Supervision strategies citeturn0search0
- Process isolation citeturn5search0
Questions to Guide Your Design
- Crash injection
- How will you trigger controlled failures?
- Measurement
- What metrics show recovery speed and impact?
Thinking Exercise
Failure Tree
Draw a tree with 3 workers and decide which should restart together.
The Interview Questions They Will Ask
- “How do one_for_one and one_for_all differ?” citeturn0search0
- “How would you test your supervision tree?”
- “What is a restart storm and how do you prevent it?” citeturn0search0
Hints in Layers
Hint 1: Starting Point Create workers that crash on demand.
Hint 2: Next Level Swap the supervisor strategy and log the outcomes.
Hint 3: Technical Details Pseudocode:
if trigger == crash:
worker exits with error
supervisor applies strategy
Hint 4: Tools/Debugging
Use observer to watch process restarts.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Supervision | “Designing for Scalability with Erlang/OTP” | Supervision chapter |
Common Pitfalls and Debugging
Problem 1: “All workers restart unexpectedly”
- Why: You selected one_for_all for independent workers.
- Fix: Use one_for_one when workers are independent. citeturn0search0
- Quick test: Crash one worker and verify only that worker restarts.
Definition of Done
- Supports multiple strategies
- Logs restart behavior
- Demonstrates escalation when intensity exceeded
- Clear report of recovery time
Project 7: Presence and Notification Service
- File: P07-presence-notifications.md
- Main Programming Language: Elixir or Erlang
- Alternative Programming Languages: Gleam
- Coolness Level: Level 4 (See REFERENCE.md)
- Business Potential: Level 3 (See REFERENCE.md)
- Difficulty: Level 3 (See REFERENCE.md)
- Knowledge Area: Distribution, Real-Time
- Software or Tool: Distributed Erlang
- Main Book: “Programming Erlang”
What you will build: A multi-node presence service that tracks online users and pushes notifications.
Why it teaches BEAM: It forces you to use node naming, message passing, and failure detection across nodes. citeturn5search4
Core challenges you will face:
- Node awareness -> Distribution
- Presence consistency -> Process Model
- Failure handling -> Supervision
Real World Outcome
$ presence login user42 --node nodeA
ok
$ presence status user42 --node nodeB
online (via nodeA)
$ presence notify user42 "ping"
queued
The Core Question You Are Answering
“How do I maintain a consistent view of online users across nodes?”
Concepts You Must Understand First
- Distributed nodes citeturn5search4
- Message passing semantics citeturn1search6
Questions to Guide Your Design
- Source of truth
- Is presence authoritative on one node or replicated?
- Failure handling
- What happens when a node goes down?
Thinking Exercise
Partition Scenario
Simulate a network split. How do you reconcile presence when nodes reconnect?
The Interview Questions They Will Ask
- “How does distributed Erlang handle messaging?” citeturn5search4
- “What are the risks of global registries?” citeturn6search0
- “How do you detect node failures?”
Hints in Layers
Hint 1: Starting Point Start with local presence per node.
Hint 2: Next Level Add a router that forwards queries across nodes.
Hint 3: Technical Details Pseudocode:
if user not found locally:
query other nodes
Hint 4: Tools/Debugging Kill a node and verify presence cleans up.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Distribution | “Programming Erlang” | Distribution chapter |
Common Pitfalls and Debugging
Problem 1: “Presence shows users online after node crash”
- Why: You did not remove presence on DOWN events.
- Fix: Monitor node and purge presence on disconnect.
- Quick test: Stop a node and re-check status.
Definition of Done
- Presence queries work across nodes
- Notifications routed to correct node
- Node failure cleans up state
- Basic partition handling documented
Project 8: ETS Cache and Session Service
- File: P08-ets-cache-service.md
- Main Programming Language: Erlang or Elixir
- Alternative Programming Languages: Gleam
- Coolness Level: Level 3 (See REFERENCE.md)
- Business Potential: Level 2 (See REFERENCE.md)
- Difficulty: Level 2 (See REFERENCE.md)
- Knowledge Area: Data Storage
- Software or Tool: ETS
- Main Book: “Erlang and OTP in Action”
What you will build: An ETS-backed cache with TTL eviction and stats reporting.
Why it teaches BEAM: It forces you to manage ETS lifecycle and cleanup safely. citeturn2search7turn2search0
Core challenges you will face:
- Table lifecycle -> ETS
- TTL expiration -> Scheduling
- Concurrency -> Process Model
Real World Outcome
$ cache put key1 value1 --ttl 30s
ok
$ cache get key1
value1
$ cache stats
entries=1200 evictions=45
The Core Question You Are Answering
“How do I build a fast in-memory cache without corrupting shared state?”
Concepts You Must Understand First
- ETS basics citeturn2search7
- Table lifecycle citeturn2search7
Questions to Guide Your Design
- Eviction
- Time-based vs size-based eviction?
- Ownership
- Which process owns the table?
Thinking Exercise
TTL Sweep
Design a cleanup process that removes expired keys without scanning all entries every time.
The Interview Questions They Will Ask
- “What happens if the ETS owner dies?” citeturn2search7
- “How do you avoid full-table scans?” citeturn2search0
- “Why use ETS instead of a GenServer map?”
Hints in Layers
Hint 1: Starting Point Store expiry timestamps with each key.
Hint 2: Next Level Use a periodic sweep process.
Hint 3: Technical Details Pseudocode:
Every 5s:
scan a slice of keys
delete expired
Hint 4: Tools/Debugging Track table size and eviction counts over time.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| ETS | “Erlang and OTP in Action” | ETS chapter |
Common Pitfalls and Debugging
Problem 1: “Eviction freezes system”
- Why: Full table scan in one process.
- Fix: Incremental sweeps and batching.
- Quick test: Load with 1M keys and verify stable latency.
Definition of Done
- TTL eviction works
- Table survives normal load
- Stats report accurate counts
- Owner process supervised
Project 9: Hot Code Upgrade Drill
- File: P09-hot-code-upgrade-drill.md
- Main Programming Language: Erlang
- Alternative Programming Languages: Elixir
- Coolness Level: Level 4 (See REFERENCE.md)
- Business Potential: Level 3 (See REFERENCE.md)
- Difficulty: Level 3 (See REFERENCE.md)
- Knowledge Area: Release Engineering
- Software or Tool: Release handling
- Main Book: “Programming Erlang”
What you will build: A controlled upgrade scenario with state migration and rollback.
Why it teaches BEAM: Release handling is unique to OTP and critical for zero-downtime systems. citeturn1search0turn1search2
Core challenges you will face:
- appup design -> Release Handling
- State migration -> Code Change
- Rollback -> Release Handling
Real World Outcome
$ reltool build
release built: v1 -> v2
$ reltool upgrade --apply
upgrade applied
state migrated: ok
$ reltool rollback
rollback complete
The Core Question You Are Answering
“How do I upgrade running systems without corrupting state?”
Concepts You Must Understand First
- Release handler citeturn1search0
- appup/relup instructions citeturn1search0
Questions to Guide Your Design
- Migration
- What state changes need transformation?
- Rollback
- How do you verify upgrade success before committing?
Thinking Exercise
State Evolution
Design a version change that adds a new field to state and requires a default.
The Interview Questions They Will Ask
- “What is an appup file?” citeturn1search0
- “Why do core apps require restart?” citeturn1search2
- “How do you handle rollback?”
Hints in Layers
Hint 1: Starting Point Create a minimal appup with restart_application.
Hint 2: Next Level Add a code_change step for state migration.
Hint 3: Technical Details Pseudocode:
code_change:
add default field to state
Hint 4: Tools/Debugging Use release_handler logs to verify upgrade steps. citeturn1search0
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Upgrades | “Programming Erlang” | Code upgrade chapter |
Common Pitfalls and Debugging
Problem 1: “Upgrade fails at runtime”
- Why: Missing appup instructions.
- Fix: Add explicit upgrade steps and test on a staging node.
- Quick test: Run upgrade in a controlled environment.
Definition of Done
- Upgrade runs without downtime
- State migration works
- Rollback works
- Logs capture each step
Project 10: Telemetry Pipeline + Live Metrics
- File: P10-telemetry-live-metrics.md
- Main Programming Language: Elixir
- Alternative Programming Languages: Erlang
- Coolness Level: Level 4 (See REFERENCE.md)
- Business Potential: Level 3 (See REFERENCE.md)
- Difficulty: Level 3 (See REFERENCE.md)
- Knowledge Area: Observability
- Software or Tool: Telemetry + LiveView
- Main Book: “Elixir in Action”
What you will build: A telemetry pipeline that collects metrics and streams them to a LiveView dashboard.
Why it teaches BEAM: It combines scheduling, backpressure, and LiveView updates in one system. citeturn3search0turn4search2
Core challenges you will face:
- Event aggregation -> Scheduler + GC
- Backpressure -> GenStage
- Live UI updates -> LiveView
Real World Outcome
The UI shows:
- A rolling chart of request latency
- A table of top message queue sizes
- A live count of process restarts
CLI validation:
$ metricsctl status
events/sec=1200 queue=low live_clients=8
The Core Question You Are Answering
“How do I observe a BEAM system in real time without overwhelming it?”
Concepts You Must Understand First
- Backpressure citeturn4search2
- LiveView rendering citeturn3search0
Questions to Guide Your Design
- Sampling
- How often do you emit events?
- Aggregation
- Do you batch or stream raw events?
Thinking Exercise
Noise Control
Decide which metrics are critical and which can be sampled.
The Interview Questions They Will Ask
- “Why is backpressure necessary for telemetry?” citeturn4search2
- “How does LiveView reduce bandwidth?” citeturn3search0
- “What metrics reveal mailbox pressure?”
Hints in Layers
Hint 1: Starting Point Emit simple counters and render them.
Hint 2: Next Level Add a GenStage pipeline for aggregation.
Hint 3: Technical Details Pseudocode:
collect -> aggregate -> publish -> LiveView
Hint 4: Tools/Debugging Check event rates and backlog in your logs.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Pipelines | “Elixir in Action” | Processes/GenStage |
Common Pitfalls and Debugging
Problem 1: “Dashboard lags under load”
- Why: Too many events pushed to LiveView.
- Fix: Aggregate before publishing.
- Quick test: Increase load and observe latency trend.
Definition of Done
- Metrics update in real time
- Pipeline handles burst without backlog
- Dashboard remains responsive
- Metrics are accurate and documented
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Chat System | Level 2 | Weekend+ | High | ★★★★☆ |
| 2. Rate Limiter | Level 2 | Weekend+ | High | ★★★☆☆ |
| 3. Distributed KV | Level 3 | 2-3 weeks | Very High | ★★★★☆ |
| 4. LiveView Dashboard | Level 2 | Weekend+ | High | ★★★★☆ |
| 5. GenStage Pipeline | Level 3 | 2-3 weeks | Very High | ★★★★☆ |
| 6. Fault Harness | Level 2 | Weekend+ | High | ★★★☆☆ |
| 7. Presence Service | Level 3 | 2-3 weeks | Very High | ★★★★☆ |
| 8. ETS Cache | Level 2 | Weekend+ | High | ★★★☆☆ |
| 9. Upgrade Drill | Level 3 | 2-3 weeks | Very High | ★★★★☆ |
| 10. Telemetry Pipeline | Level 3 | 2-3 weeks | Very High | ★★★★☆ |
Recommendation
If you are new to BEAM: Start with Project 1 to internalize process isolation and supervision. If you are a web developer: Start with Project 4 to see LiveView’s real-time model in action. If you want distributed systems mastery: Focus on Projects 3 and 7.
Final Overall Project: Fault-Tolerant Real-Time Platform
The Goal: Combine Projects 1, 3, 4, and 10 into a single platform with chat, presence, metrics, and distributed storage.
- Use the chat system as the real-time core.
- Store presence and sessions in the distributed KV store.
- Render operational metrics via LiveView.
- Add telemetry with backpressure to keep the system stable under load.
Success Criteria: The platform keeps running under simulated crashes, recovers via supervision, and shows live metrics during recovery.
From Learning to Production: What Is Next
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| Project 1 | Phoenix Presence + Channels | Auth, persistence, scaling |
| Project 3 | Riak / Mnesia-based store | Replication, partition handling |
| Project 4 | LiveView ops dashboards | Auth, multi-tenant UI |
| Project 9 | Release handling pipelines | CI/CD automation |
Summary
This learning path covers BEAM systems through 10 hands-on projects.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Supervised Chat System | Elixir | Level 2 | 10-15 hrs |
| 2 | Rate Limiter + Circuit Breaker | Elixir | Level 2 | 12-18 hrs |
| 3 | Distributed KV Store | Erlang/Elixir | Level 3 | 20-30 hrs |
| 4 | LiveView Dashboard | Elixir | Level 2 | 12-20 hrs |
| 5 | GenStage Backpressure Pipeline | Elixir | Level 3 | 18-25 hrs |
| 6 | Fault Injection Harness | Elixir | Level 2 | 10-15 hrs |
| 7 | Presence Service | Erlang/Elixir | Level 3 | 20-30 hrs |
| 8 | ETS Cache Service | Erlang/Elixir | Level 2 | 10-15 hrs |
| 9 | Hot Code Upgrade Drill | Erlang | Level 3 | 15-25 hrs |
| 10 | Telemetry Pipeline | Elixir | Level 3 | 15-25 hrs |
Expected Outcomes
- Design supervision trees that isolate failures
- Build distributed BEAM services with clear failure semantics
- Implement backpressure pipelines and real-time dashboards
Additional Resources and References
Standards and Specifications
- OTP Supervision Principles. citeturn0search5
- Distributed Erlang documentation. citeturn5search4
- Release Handling documentation. citeturn1search0
Industry Analysis
- Discord on scaling Elixir to 5,000,000 concurrent users (2017). citeturn4search0
Books
- “Programming Erlang” by Joe Armstrong - the classic reference
- “Elixir in Action” by Sasa Juric - practical OTP patterns
- “Designing for Scalability with Erlang/OTP” by Cesarini/Thompson - supervision and reliability