Project 12: Service Watchdog
Build a process supervisor with restart policies, signal escalation, and health checks.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Language | Go (Alternatives: Rust, Python, C) |
| Prerequisites | Projects 1-11 |
| Key Topics | supervision, SIGTERM/SIGKILL, restart logic |
1. Learning Objectives
By completing this project, you will:
- Monitor services and detect crashes.
- Implement restart policies with backoff.
- Handle graceful shutdown with signal escalation.
- Track health checks and expose status.
2. Theoretical Foundation
2.1 Core Concepts
- Supervision: A parent process restarts children and manages state.
- Signal escalation: SIGTERM then SIGKILL after timeout.
- Health checks: Liveness is not the same as health.
2.2 Why This Matters
Supervisors like systemd, supervisord, and container runtimes rely on these principles.
2.3 Historical Context / Background
Init systems have always provided supervision; modern systems formalize restart policies and health checks.
2.4 Common Misconceptions
- “If it runs, it is healthy”: A hung process still runs.
- “Kill -9 is cleanup”: It bypasses graceful shutdown.
3. Project Specification
3.1 What You Will Build
A supervisor that manages multiple services defined in a config file, restarts on failure, logs events, and exposes health status.
3.2 Functional Requirements
- Start and stop services defined in config.
- Detect exits and restart with retry limits.
- Implement SIGTERM then SIGKILL escalation.
- Record event logs and service status.
3.3 Non-Functional Requirements
- Reliability: Avoid restart storms with backoff.
- Safety: Ensure children are reaped.
- Usability: Clear status and logs.
3.4 Example Usage / Output
$ ./watchdog --config services.yaml
Service web: UP (pid 1234, restarts 0)
Service worker: DOWN (restarts 5, failed)
3.5 Real World Outcome
You will run the watchdog and see service status and restart events. Example:
$ ./watchdog --config services.yaml
Service web: UP (pid 1234, restarts 0)
Service worker: DOWN (restarts 5, failed)
4. Solution Architecture
4.1 High-Level Design
config -> start services -> monitor -> restart/escalate -> report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Supervisor | Start/stop services | One goroutine per service |
| Policy | Restart/backoff | Exponential backoff |
| Health | Run checks | Command or HTTP |
| Reporter | Status output | Table format |
4.3 Data Structures
type ServiceState struct { Name string; PID int; Restarts int; Healthy bool }
4.4 Algorithm Overview
Key Algorithm: Restart Policy
- Start service.
- If exit, increment restart count.
- Sleep backoff, restart until limit.
Complexity Analysis:
- Time: O(1) per event
- Space: O(n) services
5. Implementation Guide
5.1 Development Environment Setup
go version
5.2 Project Structure
project-root/
├── cmd/watchdog/main.go
├── services.yaml
└── README.md
5.3 The Core Question You’re Answering
“How do I keep critical services running and shutting down safely?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Process groups and signals
- wait() and reaping
- Backoff strategies
5.5 Questions to Guide Your Design
Before implementing, think through these:
- How many restarts should be allowed?
- What is a reasonable escalation timeout?
- How will you define health checks?
5.6 Thinking Exercise
Simulate a crash loop
Create a script that exits immediately. Observe how restart policies prevent tight loops.
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you avoid restart storms?”
- “What does graceful shutdown mean?”
- “How do you detect a hung service?”
5.8 Hints in Layers
Hint 1: Start with one service Then generalize to multiple.
Hint 2: Use context for timeouts Timeouts simplify escalation logic.
Hint 3: Track restarts Persist counts and last start times.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Daemon management | “TLPI” | Ch. 37 |
| Process groups | “APUE” | Ch. 9 |
| Signals | “Linux System Programming” | Ch. 6 |
5.10 Implementation Phases
Phase 1: Foundation (4-5 days)
Goals:
- Start and stop services reliably.
Tasks:
- Parse config.
- Launch a service and track PID.
Checkpoint: Service starts and stops cleanly.
Phase 2: Core Functionality (1 week)
Goals:
- Add monitoring and restart logic.
Tasks:
- Detect exits.
- Implement backoff and limits.
Checkpoint: Crash loop is contained.
Phase 3: Polish & Edge Cases (3-4 days)
Goals:
- Add health checks and logs.
Tasks:
- Run health probes.
- Log events and status.
Checkpoint: Status output shows health and restarts.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Health check | command vs HTTP | command | Works locally |
| Restart policy | fixed vs backoff | backoff | Prevent storms |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Restart | Crash loop | service exits immediately |
| Escalation | SIGTERM timeout | unresponsive service |
| Health | Failing check | returns non-zero |
6.2 Critical Test Cases
- Service exits -> restart occurs.
- Service ignores SIGTERM -> SIGKILL issued.
- Restart limit reached -> service marked failed.
6.3 Test Data
Service: /bin/sleep 1
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Missing wait | Zombies | Reap children |
| No backoff | Restart storm | Add delay |
| Killing only parent | Orphaned children | Use process groups |
7.2 Debugging Strategies
- Log every state transition.
- Use
ps -o pid,ppid,statto confirm cleanup.
7.3 Performance Traps
Aggressive health checks can add load; space them out.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a simple CLI status command.
- Add restart counters per service.
8.2 Intermediate Extensions
- Add HTTP health checks.
- Add rolling restarts.
8.3 Advanced Extensions
- Integrate with systemd unit files.
- Add metrics endpoint for Prometheus.
9. Real-World Connections
9.1 Industry Applications
- Process supervision in production systems and container runtimes.
9.2 Related Open Source Projects
- systemd: https://systemd.io
- supervisord: http://supervisord.org
9.3 Interview Relevance
- Supervision and graceful shutdown are core infra topics.
10. Resources
10.1 Essential Reading
- signal(7) -
man 7 signal - wait(2) -
man 2 wait
10.2 Video Resources
- Service supervision talks (search “process supervisor Linux”)
10.3 Tools & Documentation
- systemd.service(5) -
man 5 systemd.service
10.4 Related Projects in This Series
- Process Debugging Toolkit: use it for watchdog diagnosis.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain supervision and backoff.
- I can describe signal escalation.
- I can implement health checks.
11.2 Implementation
- Services restart on crash.
- Escalation works reliably.
- Status output is clear.
11.3 Growth
- I can adapt the watchdog to real services.
- I can describe this design in interviews.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Supervisor starts, monitors, and restarts a service.
Full Completion:
- Add escalation and health checks.
Excellence (Going Above & Beyond):
- Add metrics endpoint and systemd integration.
This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.