Project 12: Service Watchdog

Build a process supervisor with restart policies, signal escalation, and health checks.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-3 weeks
Language Go (Alternatives: Rust, Python, C)
Prerequisites Projects 1-11
Key Topics supervision, SIGTERM/SIGKILL, restart logic

1. Learning Objectives

By completing this project, you will:

  1. Monitor services and detect crashes.
  2. Implement restart policies with backoff.
  3. Handle graceful shutdown with signal escalation.
  4. Track health checks and expose status.

2. Theoretical Foundation

2.1 Core Concepts

  • Supervision: A parent process restarts children and manages state.
  • Signal escalation: SIGTERM then SIGKILL after timeout.
  • Health checks: Liveness is not the same as health.

2.2 Why This Matters

Supervisors like systemd, supervisord, and container runtimes rely on these principles.

2.3 Historical Context / Background

Init systems have always provided supervision; modern systems formalize restart policies and health checks.

2.4 Common Misconceptions

  • “If it runs, it is healthy”: A hung process still runs.
  • “Kill -9 is cleanup”: It bypasses graceful shutdown.

3. Project Specification

3.1 What You Will Build

A supervisor that manages multiple services defined in a config file, restarts on failure, logs events, and exposes health status.

3.2 Functional Requirements

  1. Start and stop services defined in config.
  2. Detect exits and restart with retry limits.
  3. Implement SIGTERM then SIGKILL escalation.
  4. Record event logs and service status.

3.3 Non-Functional Requirements

  • Reliability: Avoid restart storms with backoff.
  • Safety: Ensure children are reaped.
  • Usability: Clear status and logs.

3.4 Example Usage / Output

$ ./watchdog --config services.yaml
Service web: UP (pid 1234, restarts 0)
Service worker: DOWN (restarts 5, failed)

3.5 Real World Outcome

You will run the watchdog and see service status and restart events. Example:

$ ./watchdog --config services.yaml
Service web: UP (pid 1234, restarts 0)
Service worker: DOWN (restarts 5, failed)

4. Solution Architecture

4.1 High-Level Design

config -> start services -> monitor -> restart/escalate -> report

4.2 Key Components

Component Responsibility Key Decisions
Supervisor Start/stop services One goroutine per service
Policy Restart/backoff Exponential backoff
Health Run checks Command or HTTP
Reporter Status output Table format

4.3 Data Structures

type ServiceState struct { Name string; PID int; Restarts int; Healthy bool }

4.4 Algorithm Overview

Key Algorithm: Restart Policy

  1. Start service.
  2. If exit, increment restart count.
  3. Sleep backoff, restart until limit.

Complexity Analysis:

  • Time: O(1) per event
  • Space: O(n) services

5. Implementation Guide

5.1 Development Environment Setup

go version

5.2 Project Structure

project-root/
├── cmd/watchdog/main.go
├── services.yaml
└── README.md

5.3 The Core Question You’re Answering

“How do I keep critical services running and shutting down safely?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Process groups and signals
  2. wait() and reaping
  3. Backoff strategies

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. How many restarts should be allowed?
  2. What is a reasonable escalation timeout?
  3. How will you define health checks?

5.6 Thinking Exercise

Simulate a crash loop

Create a script that exits immediately. Observe how restart policies prevent tight loops.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you avoid restart storms?”
  2. “What does graceful shutdown mean?”
  3. “How do you detect a hung service?”

5.8 Hints in Layers

Hint 1: Start with one service Then generalize to multiple.

Hint 2: Use context for timeouts Timeouts simplify escalation logic.

Hint 3: Track restarts Persist counts and last start times.

5.9 Books That Will Help

Topic Book Chapter
Daemon management “TLPI” Ch. 37
Process groups “APUE” Ch. 9
Signals “Linux System Programming” Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals:

  • Start and stop services reliably.

Tasks:

  1. Parse config.
  2. Launch a service and track PID.

Checkpoint: Service starts and stops cleanly.

Phase 2: Core Functionality (1 week)

Goals:

  • Add monitoring and restart logic.

Tasks:

  1. Detect exits.
  2. Implement backoff and limits.

Checkpoint: Crash loop is contained.

Phase 3: Polish & Edge Cases (3-4 days)

Goals:

  • Add health checks and logs.

Tasks:

  1. Run health probes.
  2. Log events and status.

Checkpoint: Status output shows health and restarts.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Health check command vs HTTP command Works locally
Restart policy fixed vs backoff backoff Prevent storms

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Restart Crash loop service exits immediately
Escalation SIGTERM timeout unresponsive service
Health Failing check returns non-zero

6.2 Critical Test Cases

  1. Service exits -> restart occurs.
  2. Service ignores SIGTERM -> SIGKILL issued.
  3. Restart limit reached -> service marked failed.

6.3 Test Data

Service: /bin/sleep 1

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing wait Zombies Reap children
No backoff Restart storm Add delay
Killing only parent Orphaned children Use process groups

7.2 Debugging Strategies

  • Log every state transition.
  • Use ps -o pid,ppid,stat to confirm cleanup.

7.3 Performance Traps

Aggressive health checks can add load; space them out.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a simple CLI status command.
  • Add restart counters per service.

8.2 Intermediate Extensions

  • Add HTTP health checks.
  • Add rolling restarts.

8.3 Advanced Extensions

  • Integrate with systemd unit files.
  • Add metrics endpoint for Prometheus.

9. Real-World Connections

9.1 Industry Applications

  • Process supervision in production systems and container runtimes.
  • systemd: https://systemd.io
  • supervisord: http://supervisord.org

9.3 Interview Relevance

  • Supervision and graceful shutdown are core infra topics.

10. Resources

10.1 Essential Reading

  • signal(7) - man 7 signal
  • wait(2) - man 2 wait

10.2 Video Resources

  • Service supervision talks (search “process supervisor Linux”)

10.3 Tools & Documentation

  • systemd.service(5) - man 5 systemd.service

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain supervision and backoff.
  • I can describe signal escalation.
  • I can implement health checks.

11.2 Implementation

  • Services restart on crash.
  • Escalation works reliably.
  • Status output is clear.

11.3 Growth

  • I can adapt the watchdog to real services.
  • I can describe this design in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Supervisor starts, monitors, and restarts a service.

Full Completion:

  • Add escalation and health checks.

Excellence (Going Above & Beyond):

  • Add metrics endpoint and systemd integration.

This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.