Project 19: Release Engineering Lab (Hot Upgrades, Runtime Config, Mix vs Distillery Internals)

Build a release workflow with config preflight, canary deployment, hot-upgrade rehearsal, and automated rollback drill.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	2 weeks
Main Programming Language	Elixir
Alternative Programming Languages	Erlang, Shell/JS/Rust (as needed)
Coolness Level	Level 4-5
Business Potential	Resume Gold to Open Core
Prerequisites	Projects 15-18 and deploy fundamentals
Key Topics	Mix releases, runtime config, hot upgrade strategy, rollback

1. Learning Objectives

By completing this project, you will:

Build a production-like subsystem with explicit failure semantics.
Define measurable success criteria using metrics, traces, and logs.
Validate behavior under stress and fault scenarios.
Document architecture and operational tradeoffs clearly.

2. All Theory Needed (Per-Concept Breakdown)

Core Concepts

Release preflight: config and schema validation
State migration: version-aware live process conversion
Rollback discipline: rehearsed safety path

Mental Model Diagram

Requirements -> Architecture -> Implementation -> Instrumentation -> Fault Injection -> Validation

How It Works (Step-by-Step)

Define invariant and SLO targets.
Build smallest working path for core behavior.
Add instrumentation for correctness and latency.
Introduce realistic load and failure scenarios.
Compare observed behavior against stated invariants.

Minimal Concrete Example

project_status: phase: verify slo_latency_p99_ms: within_budget invariant_violations: 0

Common Misconceptions

Happy-path success means production readiness.
Average latency is enough to judge reliability.

Check-Your-Understanding Questions

Which invariant is most important for this project?
What failure mode can violate it?
What metric proves the fix worked?

Check-Your-Understanding Answers

The invariant tied to safety/data integrity.
The failure mode that bypasses backpressure/retry/consistency policy.
A stable metric delta across repeatable scenarios.

Real-World Applications

Multi-tenant SaaS backends
Real-time collaboration systems
Distributed workflow services

Key Insight

Model the failure path first, then optimize the happy path.

3. Project Specification

3.1 What You Will Build

Build a release workflow with config preflight, canary deployment, hot-upgrade rehearsal, and automated rollback drill.

3.2 Functional Requirements

Implement the primary runtime behavior for this project.
Expose observability signals required to validate correctness.
Include one controlled fault scenario and recovery flow.

3.3 Non-Functional Requirements

Performance: maintain target p99 latency under defined load.
Reliability: recover from injected fault within recovery budget.
Operability: produce clear dashboards/log traces for diagnosis.

3.4 Example Usage / Output

deterministic_outcome:

canary upgrade completed with health gates
rollback rehearsal passes within target window
config errors fail before serving traffic

3.5 Edge Cases

Burst traffic and queue growth
Partial dependency failure
Restart during active workload

4. Solution Architecture

4.1 High-Level Design

Ingress -> Domain Process -> Storage/Cache -> Async Pipeline -> Observability -> Operator Actions

4.2 Key Components

Component	Responsibility	Key Decision
Ingress Handler	Admits work and validates contracts	Fast-fail invalid input
Domain Workers	Execute core state transitions	Supervised and isolated
Observability Layer	Emits metrics/traces/logs	Low-cardinality schema

4.3 Data Structures (No Full Code)

Command envelope: idempotency key, tenant, correlation id
State snapshot: versioned domain state
Metric tags: route, operation, outcome class

4.4 Algorithm Overview

Validate command and current state.
Apply transition or reject with explicit reason.
Persist and publish state changes.
Emit telemetry and evaluate SLO impact.

5. Implementation Guide

5.1 Development Environment Setup

Run deps install.
Run database setup if used.
Run test baseline.

5.2 Project Structure

lib/my_app//runtime.ex
lib/my_app//supervisor.ex
lib/my_app//telemetry.ex
priv/labs/runbook.exs

5.3 The Core Question You Are Answering

Can I safely upgrade and rollback under active traffic without violating correctness?

5.4 Concepts You Must Understand First

BEAM process model and failure isolation
OTP supervision semantics
Runtime observability discipline

5.5 Questions to Guide Your Design

What is your safety invariant?
What is your bounded recovery target?
What is your fallback behavior under stress?

5.6 Milestones

Baseline behavior validated.
Instrumentation and dashboards available.
Fault injection and recovery validated.
Final report with tradeoffs and next steps.

6. Validation and Testing

6.1 Test Strategy

Unit tests for core transitions
Integration tests for runtime flows
Fault scenario tests for recovery

6.2 Verification Steps

Run full test suite.
Run deterministic lab script.

6.3 Definition of Done

Core functionality works on reference scenarios.
Failure path behavior matches design policy.
SLO metrics are captured and reproducible.
Findings are documented with evidence.

7. Production Hardening Checklist

Alert thresholds and runbook entries exist.
Rollback/degrade strategy documented.
Capacity assumptions verified by test.
Security and tenant boundaries reviewed.

8. Interview Deep Dive

What tradeoff did you choose and why?
How do you prove correctness under failure?
Which metric is your leading indicator of incident risk?

9. Extensions

Add stricter invariants and adversarial load profiles.
Add multi-region network-latency simulation.
Add automated CI gates for regression thresholds.

10. Resources

https://hexdocs.pm/phoenix/overview.html
https://hexdocs.pm/telemetry/
https://www.erlang.org/doc/