Project 8: Production-Ready Rust Service

Build an async Rust service that is observable, configurable, and able to shut down gracefully under load.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Main Programming Language	Rust
Alternative Programming Languages	Go, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	4. The “Open Core” Infrastructure
Prerequisites	Projects 1-7
Key Topics	tracing/log, config management, graceful shutdown, telemetry

1. Learning Objectives

Implement structured logging and tracing for request lifecycle visibility.
Build validated config loading with explicit precedence rules.
Implement deterministic graceful shutdown.
Publish minimal production telemetry contracts.

2. Theoretical Foundation

2.1 Core Concepts

Structured observability: logs + spans + metrics + error events.
Config discipline: validate once at startup; no scattered env reads.
Graceful shutdown: stop intake, drain, flush, exit.
Operational contracts: define what “healthy” and “ready” mean.

2.2 Why This Matters

Most deployment incidents are lifecycle/operability failures, not syntax failures. This project targets those exact failure classes.

2.3 Common Misconceptions

“Println is enough for logging” -> false for distributed diagnostics.
“Graceful shutdown is optional” -> false for rolling deploy reliability.
“Config defaults can hide bad env” -> dangerous in production.

3. Project Specification

3.1 What You Will Build

A small async service with:

typed config schema (default + file + env)
request-scoped tracing spans
health + readiness endpoints
graceful SIGTERM flow with drain budget
telemetry flush before exit

3.2 Functional Requirements

Invalid config prevents startup.
Every request has correlation fields.
SIGTERM triggers controlled drain and clean exit.
Health/readiness reflect true lifecycle state.

3.3 Non-Functional Requirements

Reliability: zero dropped in-flight requests in golden shutdown path.
Observability: actionable events for startup, runtime, shutdown.
Operational clarity: error categories are explicit and searchable.

3.4 Example Usage / Output

$ RUST_ENV=staging ./target/release/ops_service
INFO service.start version=0.1.0 env=staging bind=127.0.0.1:8080
INFO service.ready ready=true

$ curl -s http://127.0.0.1:8080/health
{"status":"ok"}

$ kill -TERM <pid>
INFO shutdown.begin in_flight=3
INFO shutdown.drain_complete in_flight=0
INFO telemetry.flush status=ok
INFO service.exit code=0

3.5 Real World Outcome

An operator can answer three questions immediately: “Is it healthy?”, “What failed?”, and “Can it stop safely now?”

4. Solution Architecture

4.1 High-Level Design

Config Loader -> Service Runtime -> Request Path -> Telemetry Export
      │               │               │                 │
      └------ lifecycle state machine + graceful shutdown ------┘

4.2 Key Components

Component	Responsibility	Key Decision
Config module	Parse/validate precedence	Fail fast on invalid values
Observability layer	Structured logs + spans	Standard field schema
Lifecycle manager	Ready/draining/stopped transitions	Explicit state model
Shutdown coordinator	Drain + flush orchestration	Bounded timeout and metrics

5. Implementation Guide

5.1 The Core Question You’re Answering

“Can this Rust service operate predictably through startup, steady state, and shutdown?”

5.2 Concepts You Must Understand First

Structured logging and tracing spans.
Async task cancellation and join behavior.
Health vs readiness semantics.
Telemetry sink reliability expectations.

5.3 Questions to Guide Your Design

Which fields belong on every critical event?
What config precedence and schema validation rules are required?
What is your maximum acceptable drain time?

5.4 Thinking Exercise

Draw a lifecycle state machine for the service and annotate transitions with concrete triggers (startup complete, SIGTERM received, drain done, flush done).

5.5 The Interview Questions They’ll Ask

“How does graceful shutdown work in async Rust services?”
“How do you prevent serving traffic before dependencies are ready?”
“Why do traces matter beyond logs?”
“What happens if telemetry backend is unavailable during shutdown?”

5.6 Hints in Layers

Hint 1: Centralize config loading and validation.
Hint 2: Add tracing spans at request and dependency boundaries.
Hint 3: Separate intake-stop from task-drain phases.
Hint 4: Rehearse SIGTERM drills with synthetic load.

5.7 Books That Will Help

Topic	Book	Chapter
Production Rust practices	“Rust for Rustaceans”	Reliability-focused chapters
Concurrency foundations	“The Rust Programming Language”	Ch. 16
Telemetry standards	OpenTelemetry Rust docs	Setup + concepts

6. Testing Strategy

Startup tests for invalid/partial config.
Integration tests for health/readiness transitions.
Shutdown drill tests under concurrent request load.
Telemetry contract checks for mandatory fields.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Premature readiness	errors right after deploy	gate readiness on dependencies
Abrupt termination	dropped requests	stop intake then drain with timeout
Unstructured logs	impossible incident correlation	enforce event schema with IDs

8. Self-Assessment Checklist

Config schema is validated before startup completes.
Request path emits correlated events/spans.
Graceful shutdown is deterministic and tested.
Telemetry flush behavior is documented.

9. Completion Criteria

Minimum Viable Completion

Service starts/stops correctly with visible lifecycle events.

Full Completion

Lifecycle tests and shutdown drill evidence included.

Excellence

Includes on-call runbook excerpt and incident simulation results.