Project 16: OTP Control Plane with gen_statem, Dynamic Supervision, Registry, ETS, and Mnesia

Build a workflow control plane using explicit state machines, supervised workers, fast lookup, and replicated metadata.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	2-3 weeks
Main Programming Language	Elixir
Alternative Programming Languages	Erlang, Shell/JS/Rust (as needed)
Coolness Level	Level 4-5
Business Potential	Resume Gold to Open Core
Prerequisites	Project 15 and OTP design principles
Key Topics	gen_statem, supervision strategy, registry, ETS, Mnesia

1. Learning Objectives

By completing this project, you will:

Build a production-like subsystem with explicit failure semantics.
Define measurable success criteria using metrics, traces, and logs.
Validate behavior under stress and fault scenarios.
Document architecture and operational tradeoffs clearly.

2. All Theory Needed (Per-Concept Breakdown)

Core Concepts

State machines: explicit transitions and timeout behavior
Supervision architecture: restart semantics by failure domain
Data locality: ETS hot path and Mnesia replicated metadata

Mental Model Diagram

Requirements -> Architecture -> Implementation -> Instrumentation -> Fault Injection -> Validation

How It Works (Step-by-Step)

Define invariant and SLO targets.
Build smallest working path for core behavior.
Add instrumentation for correctness and latency.
Introduce realistic load and failure scenarios.
Compare observed behavior against stated invariants.

Minimal Concrete Example

project_status: phase: verify slo_latency_p99_ms: within_budget invariant_violations: 0

Common Misconceptions

Happy-path success means production readiness.
Average latency is enough to judge reliability.

Check-Your-Understanding Questions

Which invariant is most important for this project?
What failure mode can violate it?
What metric proves the fix worked?

Check-Your-Understanding Answers

The invariant tied to safety/data integrity.
The failure mode that bypasses backpressure/retry/consistency policy.
A stable metric delta across repeatable scenarios.

Real-World Applications

Multi-tenant SaaS backends
Real-time collaboration systems
Distributed workflow services

Key Insight

Model the failure path first, then optimize the happy path.

3. Project Specification

3.1 What You Will Build

Build a workflow control plane using explicit state machines, supervised workers, fast lookup, and replicated metadata.

3.2 Functional Requirements

Implement the primary runtime behavior for this project.
Expose observability signals required to validate correctness.
Include one controlled fault scenario and recovery flow.

3.3 Non-Functional Requirements

Performance: maintain target p99 latency under defined load.
Reliability: recover from injected fault within recovery budget.
Operability: produce clear dashboards/log traces for diagnosis.

3.4 Example Usage / Output

deterministic_outcome:

workflow transitions observable and deterministic
supervised recovery restores active state
lookup and metadata replication meet latency target

3.5 Edge Cases

Burst traffic and queue growth
Partial dependency failure
Restart during active workload

4. Solution Architecture

4.1 High-Level Design

Ingress -> Domain Process -> Storage/Cache -> Async Pipeline -> Observability -> Operator Actions

4.2 Key Components

Component	Responsibility	Key Decision
Ingress Handler	Admits work and validates contracts	Fast-fail invalid input
Domain Workers	Execute core state transitions	Supervised and isolated
Observability Layer	Emits metrics/traces/logs	Low-cardinality schema

4.3 Data Structures (No Full Code)

Command envelope: idempotency key, tenant, correlation id
State snapshot: versioned domain state
Metric tags: route, operation, outcome class

4.4 Algorithm Overview

Validate command and current state.
Apply transition or reject with explicit reason.
Persist and publish state changes.
Emit telemetry and evaluate SLO impact.

5. Implementation Guide

5.1 Development Environment Setup

Run deps install.
Run database setup if used.
Run test baseline.

5.2 Project Structure

lib/my_app//runtime.ex
lib/my_app//supervisor.ex
lib/my_app//telemetry.ex
priv/labs/runbook.exs

5.3 The Core Question You Are Answering

Which OTP primitive should own each workflow concern so failures recover predictably?

5.4 Concepts You Must Understand First

BEAM process model and failure isolation
OTP supervision semantics
Runtime observability discipline

5.5 Questions to Guide Your Design

What is your safety invariant?
What is your bounded recovery target?
What is your fallback behavior under stress?

5.6 Milestones

Baseline behavior validated.
Instrumentation and dashboards available.
Fault injection and recovery validated.
Final report with tradeoffs and next steps.

6. Validation and Testing

6.1 Test Strategy

Unit tests for core transitions
Integration tests for runtime flows
Fault scenario tests for recovery

6.2 Verification Steps

Run full test suite.
Run deterministic lab script.

6.3 Definition of Done

Core functionality works on reference scenarios.
Failure path behavior matches design policy.
SLO metrics are captured and reproducible.
Findings are documented with evidence.

7. Production Hardening Checklist

Alert thresholds and runbook entries exist.
Rollback/degrade strategy documented.
Capacity assumptions verified by test.
Security and tenant boundaries reviewed.

8. Interview Deep Dive

What tradeoff did you choose and why?
How do you prove correctness under failure?
Which metric is your leading indicator of incident risk?

9. Extensions

Add stricter invariants and adversarial load profiles.
Add multi-region network-latency simulation.
Add automated CI gates for regression thresholds.

10. Resources

https://hexdocs.pm/phoenix/overview.html
https://hexdocs.pm/telemetry/
https://www.erlang.org/doc/