Project 19: Release Engineering Lab (Hot Upgrades, Runtime Config, Mix vs Distillery Internals)
Build a release workflow with config preflight, canary deployment, hot-upgrade rehearsal, and automated rollback drill.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 2 weeks |
| Main Programming Language | Elixir |
| Alternative Programming Languages | Erlang, Shell/JS/Rust (as needed) |
| Coolness Level | Level 4-5 |
| Business Potential | Resume Gold to Open Core |
| Prerequisites | Projects 15-18 and deploy fundamentals |
| Key Topics | Mix releases, runtime config, hot upgrade strategy, rollback |
1. Learning Objectives
By completing this project, you will:
- Build a production-like subsystem with explicit failure semantics.
- Define measurable success criteria using metrics, traces, and logs.
- Validate behavior under stress and fault scenarios.
- Document architecture and operational tradeoffs clearly.
2. All Theory Needed (Per-Concept Breakdown)
Core Concepts
- Release preflight: config and schema validation
- State migration: version-aware live process conversion
- Rollback discipline: rehearsed safety path
Mental Model Diagram
Requirements -> Architecture -> Implementation -> Instrumentation -> Fault Injection -> Validation
How It Works (Step-by-Step)
- Define invariant and SLO targets.
- Build smallest working path for core behavior.
- Add instrumentation for correctness and latency.
- Introduce realistic load and failure scenarios.
- Compare observed behavior against stated invariants.
Minimal Concrete Example
project_status: phase: verify slo_latency_p99_ms: within_budget invariant_violations: 0
Common Misconceptions
- Happy-path success means production readiness.
- Average latency is enough to judge reliability.
Check-Your-Understanding Questions
- Which invariant is most important for this project?
- What failure mode can violate it?
- What metric proves the fix worked?
Check-Your-Understanding Answers
- The invariant tied to safety/data integrity.
- The failure mode that bypasses backpressure/retry/consistency policy.
- A stable metric delta across repeatable scenarios.
Real-World Applications
- Multi-tenant SaaS backends
- Real-time collaboration systems
- Distributed workflow services
Key Insight
Model the failure path first, then optimize the happy path.
3. Project Specification
3.1 What You Will Build
Build a release workflow with config preflight, canary deployment, hot-upgrade rehearsal, and automated rollback drill.
3.2 Functional Requirements
- Implement the primary runtime behavior for this project.
- Expose observability signals required to validate correctness.
- Include one controlled fault scenario and recovery flow.
3.3 Non-Functional Requirements
- Performance: maintain target p99 latency under defined load.
- Reliability: recover from injected fault within recovery budget.
- Operability: produce clear dashboards/log traces for diagnosis.
3.4 Example Usage / Output
deterministic_outcome:
- canary upgrade completed with health gates
- rollback rehearsal passes within target window
- config errors fail before serving traffic
3.5 Edge Cases
- Burst traffic and queue growth
- Partial dependency failure
- Restart during active workload
4. Solution Architecture
4.1 High-Level Design
Ingress -> Domain Process -> Storage/Cache -> Async Pipeline -> Observability -> Operator Actions
4.2 Key Components
| Component | Responsibility | Key Decision |
|---|---|---|
| Ingress Handler | Admits work and validates contracts | Fast-fail invalid input |
| Domain Workers | Execute core state transitions | Supervised and isolated |
| Observability Layer | Emits metrics/traces/logs | Low-cardinality schema |
4.3 Data Structures (No Full Code)
- Command envelope: idempotency key, tenant, correlation id
- State snapshot: versioned domain state
- Metric tags: route, operation, outcome class
4.4 Algorithm Overview
- Validate command and current state.
- Apply transition or reject with explicit reason.
- Persist and publish state changes.
- Emit telemetry and evaluate SLO impact.
5. Implementation Guide
5.1 Development Environment Setup
- Run deps install.
- Run database setup if used.
- Run test baseline.
5.2 Project Structure
- lib/my_app/
/runtime.ex - lib/my_app/
/supervisor.ex - lib/my_app/
/telemetry.ex - priv/labs/runbook.exs
5.3 The Core Question You Are Answering
Can I safely upgrade and rollback under active traffic without violating correctness?
5.4 Concepts You Must Understand First
- BEAM process model and failure isolation
- OTP supervision semantics
- Runtime observability discipline
5.5 Questions to Guide Your Design
- What is your safety invariant?
- What is your bounded recovery target?
- What is your fallback behavior under stress?
5.6 Milestones
- Baseline behavior validated.
- Instrumentation and dashboards available.
- Fault injection and recovery validated.
- Final report with tradeoffs and next steps.
6. Validation and Testing
6.1 Test Strategy
- Unit tests for core transitions
- Integration tests for runtime flows
- Fault scenario tests for recovery
6.2 Verification Steps
- Run full test suite.
- Run deterministic lab script.
6.3 Definition of Done
- Core functionality works on reference scenarios.
- Failure path behavior matches design policy.
- SLO metrics are captured and reproducible.
- Findings are documented with evidence.
7. Production Hardening Checklist
- Alert thresholds and runbook entries exist.
- Rollback/degrade strategy documented.
- Capacity assumptions verified by test.
- Security and tenant boundaries reviewed.
8. Interview Deep Dive
- What tradeoff did you choose and why?
- How do you prove correctness under failure?
- Which metric is your leading indicator of incident risk?
9. Extensions
- Add stricter invariants and adversarial load profiles.
- Add multi-region network-latency simulation.
- Add automated CI gates for regression thresholds.
10. Resources
- https://hexdocs.pm/phoenix/overview.html
- https://hexdocs.pm/telemetry/
- https://www.erlang.org/doc/