Project 8: Production-Ready Rust Service

Build an async Rust service that is observable, configurable, and able to shut down gracefully under load.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks
Main Programming Language Rust
Alternative Programming Languages Go, Java
Coolness Level Level 3: Genuinely Clever
Business Potential 4. The “Open Core” Infrastructure
Prerequisites Projects 1-7
Key Topics tracing/log, config management, graceful shutdown, telemetry

1. Learning Objectives

  1. Implement structured logging and tracing for request lifecycle visibility.
  2. Build validated config loading with explicit precedence rules.
  3. Implement deterministic graceful shutdown.
  4. Publish minimal production telemetry contracts.

2. Theoretical Foundation

2.1 Core Concepts

  • Structured observability: logs + spans + metrics + error events.
  • Config discipline: validate once at startup; no scattered env reads.
  • Graceful shutdown: stop intake, drain, flush, exit.
  • Operational contracts: define what “healthy” and “ready” mean.

2.2 Why This Matters

Most deployment incidents are lifecycle/operability failures, not syntax failures. This project targets those exact failure classes.

2.3 Common Misconceptions

  • “Println is enough for logging” -> false for distributed diagnostics.
  • “Graceful shutdown is optional” -> false for rolling deploy reliability.
  • “Config defaults can hide bad env” -> dangerous in production.

3. Project Specification

3.1 What You Will Build

A small async service with:

  • typed config schema (default + file + env)
  • request-scoped tracing spans
  • health + readiness endpoints
  • graceful SIGTERM flow with drain budget
  • telemetry flush before exit

3.2 Functional Requirements

  1. Invalid config prevents startup.
  2. Every request has correlation fields.
  3. SIGTERM triggers controlled drain and clean exit.
  4. Health/readiness reflect true lifecycle state.

3.3 Non-Functional Requirements

  • Reliability: zero dropped in-flight requests in golden shutdown path.
  • Observability: actionable events for startup, runtime, shutdown.
  • Operational clarity: error categories are explicit and searchable.

3.4 Example Usage / Output

$ RUST_ENV=staging ./target/release/ops_service
INFO service.start version=0.1.0 env=staging bind=127.0.0.1:8080
INFO service.ready ready=true

$ curl -s http://127.0.0.1:8080/health
{"status":"ok"}

$ kill -TERM <pid>
INFO shutdown.begin in_flight=3
INFO shutdown.drain_complete in_flight=0
INFO telemetry.flush status=ok
INFO service.exit code=0

3.5 Real World Outcome

An operator can answer three questions immediately: “Is it healthy?”, “What failed?”, and “Can it stop safely now?”


4. Solution Architecture

4.1 High-Level Design

Config Loader -> Service Runtime -> Request Path -> Telemetry Export
      │               │               │                 │
      └------ lifecycle state machine + graceful shutdown ------┘

4.2 Key Components

Component Responsibility Key Decision
Config module Parse/validate precedence Fail fast on invalid values
Observability layer Structured logs + spans Standard field schema
Lifecycle manager Ready/draining/stopped transitions Explicit state model
Shutdown coordinator Drain + flush orchestration Bounded timeout and metrics

5. Implementation Guide

5.1 The Core Question You’re Answering

“Can this Rust service operate predictably through startup, steady state, and shutdown?”

5.2 Concepts You Must Understand First

  1. Structured logging and tracing spans.
  2. Async task cancellation and join behavior.
  3. Health vs readiness semantics.
  4. Telemetry sink reliability expectations.

5.3 Questions to Guide Your Design

  1. Which fields belong on every critical event?
  2. What config precedence and schema validation rules are required?
  3. What is your maximum acceptable drain time?

5.4 Thinking Exercise

Draw a lifecycle state machine for the service and annotate transitions with concrete triggers (startup complete, SIGTERM received, drain done, flush done).

5.5 The Interview Questions They’ll Ask

  1. “How does graceful shutdown work in async Rust services?”
  2. “How do you prevent serving traffic before dependencies are ready?”
  3. “Why do traces matter beyond logs?”
  4. “What happens if telemetry backend is unavailable during shutdown?”

5.6 Hints in Layers

  • Hint 1: Centralize config loading and validation.
  • Hint 2: Add tracing spans at request and dependency boundaries.
  • Hint 3: Separate intake-stop from task-drain phases.
  • Hint 4: Rehearse SIGTERM drills with synthetic load.

5.7 Books That Will Help

Topic Book Chapter
Production Rust practices “Rust for Rustaceans” Reliability-focused chapters
Concurrency foundations “The Rust Programming Language” Ch. 16
Telemetry standards OpenTelemetry Rust docs Setup + concepts

6. Testing Strategy

  • Startup tests for invalid/partial config.
  • Integration tests for health/readiness transitions.
  • Shutdown drill tests under concurrent request load.
  • Telemetry contract checks for mandatory fields.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Premature readiness errors right after deploy gate readiness on dependencies
Abrupt termination dropped requests stop intake then drain with timeout
Unstructured logs impossible incident correlation enforce event schema with IDs

8. Self-Assessment Checklist

  • Config schema is validated before startup completes.
  • Request path emits correlated events/spans.
  • Graceful shutdown is deterministic and tested.
  • Telemetry flush behavior is documented.

9. Completion Criteria

Minimum Viable Completion

  • Service starts/stops correctly with visible lifecycle events.

Full Completion

  • Lifecycle tests and shutdown drill evidence included.

Excellence

  • Includes on-call runbook excerpt and incident simulation results.