Project 12: Service Watchdog

Build a process supervisor with restart policies, signal escalation, and health checks.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-3 weeks
Language	Go (Alternatives: Rust, Python, C)
Prerequisites	Projects 1-11
Key Topics	supervision, SIGTERM/SIGKILL, restart logic

1. Learning Objectives

By completing this project, you will:

Monitor services and detect crashes.
Implement restart policies with backoff.
Handle graceful shutdown with signal escalation.
Track health checks and expose status.

2. Theoretical Foundation

2.1 Core Concepts

Supervision: A parent process restarts children and manages state.
Signal escalation: SIGTERM then SIGKILL after timeout.
Health checks: Liveness is not the same as health.

2.2 Why This Matters

Supervisors like systemd, supervisord, and container runtimes rely on these principles.

2.3 Historical Context / Background

Init systems have always provided supervision; modern systems formalize restart policies and health checks.

2.4 Common Misconceptions

“If it runs, it is healthy”: A hung process still runs.
“Kill -9 is cleanup”: It bypasses graceful shutdown.

3. Project Specification

3.1 What You Will Build

A supervisor that manages multiple services defined in a config file, restarts on failure, logs events, and exposes health status.

3.2 Functional Requirements

Start and stop services defined in config.
Detect exits and restart with retry limits.
Implement SIGTERM then SIGKILL escalation.
Record event logs and service status.

3.3 Non-Functional Requirements

Reliability: Avoid restart storms with backoff.
Safety: Ensure children are reaped.
Usability: Clear status and logs.

3.4 Example Usage / Output

$ ./watchdog --config services.yaml
Service web: UP (pid 1234, restarts 0)
Service worker: DOWN (restarts 5, failed)

3.5 Real World Outcome

You will run the watchdog and see service status and restart events. Example:

$ ./watchdog --config services.yaml
Service web: UP (pid 1234, restarts 0)
Service worker: DOWN (restarts 5, failed)

4. Solution Architecture

4.1 High-Level Design

config -> start services -> monitor -> restart/escalate -> report

4.2 Key Components

Component	Responsibility	Key Decisions
Supervisor	Start/stop services	One goroutine per service
Policy	Restart/backoff	Exponential backoff
Health	Run checks	Command or HTTP
Reporter	Status output	Table format

4.3 Data Structures

type ServiceState struct { Name string; PID int; Restarts int; Healthy bool }

4.4 Algorithm Overview

Key Algorithm: Restart Policy

Start service.
If exit, increment restart count.
Sleep backoff, restart until limit.

Complexity Analysis:

Time: O(1) per event
Space: O(n) services

5. Implementation Guide

5.1 Development Environment Setup

go version

5.2 Project Structure

project-root/
├── cmd/watchdog/main.go
├── services.yaml
└── README.md

5.3 The Core Question You’re Answering

“How do I keep critical services running and shutting down safely?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Process groups and signals
wait() and reaping
Backoff strategies

5.5 Questions to Guide Your Design

Before implementing, think through these:

How many restarts should be allowed?
What is a reasonable escalation timeout?
How will you define health checks?

5.6 Thinking Exercise

Simulate a crash loop

Create a script that exits immediately. Observe how restart policies prevent tight loops.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

“How do you avoid restart storms?”
“What does graceful shutdown mean?”
“How do you detect a hung service?”

5.8 Hints in Layers

Hint 1: Start with one service Then generalize to multiple.

Hint 2: Use context for timeouts Timeouts simplify escalation logic.

Hint 3: Track restarts Persist counts and last start times.

5.9 Books That Will Help

Topic	Book	Chapter
Daemon management	“TLPI”	Ch. 37
Process groups	“APUE”	Ch. 9
Signals	“Linux System Programming”	Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals:

Start and stop services reliably.

Tasks:

Parse config.
Launch a service and track PID.

Checkpoint: Service starts and stops cleanly.

Phase 2: Core Functionality (1 week)

Goals:

Add monitoring and restart logic.

Tasks:

Detect exits.
Implement backoff and limits.

Checkpoint: Crash loop is contained.

Phase 3: Polish & Edge Cases (3-4 days)

Goals:

Add health checks and logs.

Tasks:

Run health probes.
Log events and status.

Checkpoint: Status output shows health and restarts.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Health check	command vs HTTP	command	Works locally
Restart policy	fixed vs backoff	backoff	Prevent storms

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Restart	Crash loop	service exits immediately
Escalation	SIGTERM timeout	unresponsive service
Health	Failing check	returns non-zero

6.2 Critical Test Cases

Service exits -> restart occurs.
Service ignores SIGTERM -> SIGKILL issued.
Restart limit reached -> service marked failed.

6.3 Test Data

Service: /bin/sleep 1

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Missing wait	Zombies	Reap children
No backoff	Restart storm	Add delay
Killing only parent	Orphaned children	Use process groups

7.2 Debugging Strategies

Log every state transition.
Use ps -o pid,ppid,stat to confirm cleanup.

7.3 Performance Traps

Aggressive health checks can add load; space them out.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a simple CLI status command.
Add restart counters per service.

8.2 Intermediate Extensions

Add HTTP health checks.
Add rolling restarts.

8.3 Advanced Extensions

Integrate with systemd unit files.
Add metrics endpoint for Prometheus.

9. Real-World Connections

9.1 Industry Applications

Process supervision in production systems and container runtimes.

systemd: https://systemd.io
supervisord: http://supervisord.org

9.3 Interview Relevance

Supervision and graceful shutdown are core infra topics.

10. Resources

10.1 Essential Reading

signal(7) - man 7 signal
wait(2) - man 2 wait

10.2 Video Resources

Service supervision talks (search “process supervisor Linux”)

10.3 Tools & Documentation

systemd.service(5) - man 5 systemd.service

Process Debugging Toolkit: use it for watchdog diagnosis.

11. Self-Assessment Checklist

11.1 Understanding

I can explain supervision and backoff.
I can describe signal escalation.
I can implement health checks.

11.2 Implementation

Services restart on crash.
Escalation works reliably.
Status output is clear.

11.3 Growth

I can adapt the watchdog to real services.
I can describe this design in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

Supervisor starts, monitors, and restarts a service.

Full Completion:

Add escalation and health checks.

Excellence (Going Above & Beyond):

Add metrics endpoint and systemd integration.

This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.

Project 12: Service Watchdog

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

Simulate a crash loop

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Phase 2: Core Functionality (1 week)

Phase 3: Polish & Edge Cases (3-4 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria