Project 16: OTP Control Plane with gen_statem, Dynamic Supervision, Registry, ETS, and Mnesia

Build a workflow control plane using explicit state machines, supervised workers, fast lookup, and replicated metadata.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 2-3 weeks
Main Programming Language Elixir
Alternative Programming Languages Erlang, Shell/JS/Rust (as needed)
Coolness Level Level 4-5
Business Potential Resume Gold to Open Core
Prerequisites Project 15 and OTP design principles
Key Topics gen_statem, supervision strategy, registry, ETS, Mnesia

1. Learning Objectives

By completing this project, you will:

  1. Build a production-like subsystem with explicit failure semantics.
  2. Define measurable success criteria using metrics, traces, and logs.
  3. Validate behavior under stress and fault scenarios.
  4. Document architecture and operational tradeoffs clearly.

2. All Theory Needed (Per-Concept Breakdown)

Core Concepts

  • State machines: explicit transitions and timeout behavior
  • Supervision architecture: restart semantics by failure domain
  • Data locality: ETS hot path and Mnesia replicated metadata

Mental Model Diagram

Requirements -> Architecture -> Implementation -> Instrumentation -> Fault Injection -> Validation

How It Works (Step-by-Step)

  1. Define invariant and SLO targets.
  2. Build smallest working path for core behavior.
  3. Add instrumentation for correctness and latency.
  4. Introduce realistic load and failure scenarios.
  5. Compare observed behavior against stated invariants.

Minimal Concrete Example

project_status: phase: verify slo_latency_p99_ms: within_budget invariant_violations: 0

Common Misconceptions

  • Happy-path success means production readiness.
  • Average latency is enough to judge reliability.

Check-Your-Understanding Questions

  1. Which invariant is most important for this project?
  2. What failure mode can violate it?
  3. What metric proves the fix worked?

Check-Your-Understanding Answers

  1. The invariant tied to safety/data integrity.
  2. The failure mode that bypasses backpressure/retry/consistency policy.
  3. A stable metric delta across repeatable scenarios.

Real-World Applications

  • Multi-tenant SaaS backends
  • Real-time collaboration systems
  • Distributed workflow services

Key Insight

Model the failure path first, then optimize the happy path.


3. Project Specification

3.1 What You Will Build

Build a workflow control plane using explicit state machines, supervised workers, fast lookup, and replicated metadata.

3.2 Functional Requirements

  1. Implement the primary runtime behavior for this project.
  2. Expose observability signals required to validate correctness.
  3. Include one controlled fault scenario and recovery flow.

3.3 Non-Functional Requirements

  • Performance: maintain target p99 latency under defined load.
  • Reliability: recover from injected fault within recovery budget.
  • Operability: produce clear dashboards/log traces for diagnosis.

3.4 Example Usage / Output

deterministic_outcome:

  • workflow transitions observable and deterministic
  • supervised recovery restores active state
  • lookup and metadata replication meet latency target

3.5 Edge Cases

  • Burst traffic and queue growth
  • Partial dependency failure
  • Restart during active workload

4. Solution Architecture

4.1 High-Level Design

Ingress -> Domain Process -> Storage/Cache -> Async Pipeline -> Observability -> Operator Actions

4.2 Key Components

Component Responsibility Key Decision
Ingress Handler Admits work and validates contracts Fast-fail invalid input
Domain Workers Execute core state transitions Supervised and isolated
Observability Layer Emits metrics/traces/logs Low-cardinality schema

4.3 Data Structures (No Full Code)

  • Command envelope: idempotency key, tenant, correlation id
  • State snapshot: versioned domain state
  • Metric tags: route, operation, outcome class

4.4 Algorithm Overview

  1. Validate command and current state.
  2. Apply transition or reject with explicit reason.
  3. Persist and publish state changes.
  4. Emit telemetry and evaluate SLO impact.

5. Implementation Guide

5.1 Development Environment Setup

  • Run deps install.
  • Run database setup if used.
  • Run test baseline.

5.2 Project Structure

  • lib/my_app//runtime.ex
  • lib/my_app//supervisor.ex
  • lib/my_app//telemetry.ex
  • priv/labs/runbook.exs

5.3 The Core Question You Are Answering

Which OTP primitive should own each workflow concern so failures recover predictably?

5.4 Concepts You Must Understand First

  • BEAM process model and failure isolation
  • OTP supervision semantics
  • Runtime observability discipline

5.5 Questions to Guide Your Design

  1. What is your safety invariant?
  2. What is your bounded recovery target?
  3. What is your fallback behavior under stress?

5.6 Milestones

  1. Baseline behavior validated.
  2. Instrumentation and dashboards available.
  3. Fault injection and recovery validated.
  4. Final report with tradeoffs and next steps.

6. Validation and Testing

6.1 Test Strategy

  • Unit tests for core transitions
  • Integration tests for runtime flows
  • Fault scenario tests for recovery

6.2 Verification Steps

  • Run full test suite.
  • Run deterministic lab script.

6.3 Definition of Done

  • Core functionality works on reference scenarios.
  • Failure path behavior matches design policy.
  • SLO metrics are captured and reproducible.
  • Findings are documented with evidence.

7. Production Hardening Checklist

  • Alert thresholds and runbook entries exist.
  • Rollback/degrade strategy documented.
  • Capacity assumptions verified by test.
  • Security and tenant boundaries reviewed.

8. Interview Deep Dive

  1. What tradeoff did you choose and why?
  2. How do you prove correctness under failure?
  3. Which metric is your leading indicator of incident risk?

9. Extensions

  • Add stricter invariants and adversarial load profiles.
  • Add multi-region network-latency simulation.
  • Add automated CI gates for regression thresholds.

10. Resources

  • https://hexdocs.pm/phoenix/overview.html
  • https://hexdocs.pm/telemetry/
  • https://www.erlang.org/doc/