Project 16: Device Health Monitor and Self-Healing Service
Build a watchdog-style service that monitors system health and recovers from failures safely.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1–2 weekends |
| Main Programming Language | Python (Alternatives: Go, Rust, Bash) |
| Alternative Programming Languages | Go, Rust, Bash |
| Coolness Level | High |
| Business Potential | High |
| Prerequisites | systemd basics, Linux monitoring |
| Key Topics | Health checks, restart policies, backoff, logging |
1. Learning Objectives
By completing this project, you will:
- Define health metrics for a headless device.
- Implement a monitoring loop and thresholds.
- Trigger controlled restarts without reboot loops.
- Log and report recovery actions.
2. All Theory Needed (Per-Concept Breakdown)
Concept 1: Health Checks, Thresholds, and Self-Healing Strategies
Fundamentals
A headless device must detect when it is unhealthy and recover automatically. Health checks are measurements of system state such as CPU load, memory usage, disk space, temperature, and application liveness. A self-healing service uses these checks to decide when to restart an application or reboot the system. If thresholds are too aggressive, the device will thrash; if too lax, failures persist. The goal is a balanced, deterministic policy with clear logging and backoff.
Deep Dive into the concept
Health monitoring starts with defining what “healthy” means. For a device like a Pi Zero 2 W, core metrics include CPU load, memory usage, disk usage, temperature, and network connectivity. Application-specific checks include process liveness (is the sensor app running?) and output freshness (has it produced data in the last N minutes?). Each metric needs a threshold and a time window. For example, CPU > 90% for 5 minutes may be unhealthy, but short spikes are normal. This is why you need hysteresis and time-based thresholds.
Self-healing strategies include restarting a process, restarting a service, or rebooting the device. systemd already provides restart policies (Restart=on-failure) and watchdog support (WatchdogSec). You can integrate with systemd by defining a watchdog service that reports health and triggers actions when thresholds are exceeded. But you must also prevent restart loops: if a service keeps crashing, you need a cooldown and a maximum number of restarts before escalating to a reboot or safe mode.
A robust health monitor should implement a rate limiter. For example, allow at most 3 restarts per hour. If the threshold is exceeded, the service should stop attempting restarts and log a critical error. This prevents endless loops that wear storage and reduce uptime. Logging is essential: every recovery action should be recorded with a timestamp, metric values, and reason. This log becomes the forensic record when diagnosing failures.
Because this is a headless device, you should also expose a health summary via CLI or HTTP. This summary should include current metrics, last recovery action, and uptime. The health monitor must run reliably at boot; systemd ensures this by restarting it if it crashes. However, if the monitor itself is broken, you need a fallback such as systemd’s own watchdog or a hardware watchdog (not required for this project).
How this fit on projects
This concept is used in §3 and §5.10 and ties into Project 13 (boot readiness) and the capstone project.
Definitions & key terms
- Health check: Measurement used to determine system health.
- Threshold: Limit that triggers action when exceeded.
- Backoff: Delay between repeated recovery attempts.
- Watchdog: Mechanism to detect unresponsive services.
Mental model diagram (ASCII)
Metrics -> Health Check -> Decision -> Restart/Log
How it works (step-by-step, with invariants and failure modes)
- Collect metrics on interval.
- Compare to thresholds.
- If unhealthy, attempt recovery with backoff.
- Log action and update health state.
Failure modes:
- Threshold too low -> false positives.
- No backoff -> restart loops.
- Missing logs -> no diagnosis.
Minimal concrete example
if cpu > 90 and duration > 300:
restart("sensor_app")
Common misconceptions
- “Restarting is always safe.” It can cause data loss.
- “Health is only CPU usage.” It includes liveness and data freshness.
Check-your-understanding questions
- Why do you need hysteresis in health checks?
- What is the risk of unlimited restart attempts?
- How does systemd watchdog help?
Check-your-understanding answers
- To prevent flapping between healthy/unhealthy states.
- It can create endless loops and hide root cause.
- It restarts unresponsive services automatically.
Real-world applications
- Remote monitoring devices, industrial IoT, kiosks.
Where you’ll apply it
- This project: §3.2, §5.10.
- Other projects: Project 13, Project 17.
References
- Google SRE book, health checks chapter
- systemd watchdog documentation
Key insights
Health monitoring is a policy decision; the thresholds define your system behavior.
Summary
A self-healing service must monitor, decide, and recover with backoff and clear logs.
Homework/Exercises to practice the concept
- Define thresholds for CPU, memory, disk.
- Simulate a crash and confirm one restart.
- Implement a restart cooldown.
Solutions to the homework/exercises
- Use realistic thresholds (CPU > 90% for 5 min).
- Kill the process and observe restart.
- Add a timestamp check before restarting again.
3. Project Specification
3.1 What You Will Build
A health monitor service that checks system metrics and restarts a failing app with backoff.
3.2 Functional Requirements
- Collect CPU, memory, disk, temperature metrics.
- Detect app failure or staleness.
- Restart services with cooldown.
- Log every recovery action.
3.3 Non-Functional Requirements
- Performance: Health check loop < 1% CPU.
- Reliability: No restart loops.
- Usability: Clear health status output.
3.4 Example Usage / Output
$ ./health_monitor
Service: sensor_app Status: OK Uptime: 2h14m
CPU: 22% Memory: 38% Temp: 51 C
Recovery actions taken: 0
3.5 Data Formats / Schemas / Protocols
Health report JSON:
{"service":"sensor_app","status":"OK","cpu":22,"mem":38,"temp":51,"recoveries":0}
3.6 Edge Cases
- Service crashes repeatedly.
- Disk full prevents log writing.
- Metrics unavailable due to permissions.
3.7 Real World Outcome
The device recovers from failures without manual intervention and avoids restart loops.
3.7.1 How to Run (Copy/Paste)
python3 health_monitor.py --service sensor_app --interval 10
3.7.2 Golden Path Demo (Deterministic)
export FIXED_TIME="2026-01-01T14:00:00Z"
python3 health_monitor.py --simulate --healthy
Expected output:
[2026-01-01T14:00:00Z] Status OK, recoveries=0
3.7.3 Failure Demo (Deterministic)
python3 health_monitor.py --simulate --crash
Expected output:
[ERROR] Service crash detected, restart initiated
Exit code: 161
3.7.4 CLI Exit Codes
0: Success160: Metrics unavailable161: Restart failed
4. Solution Architecture
4.1 High-Level Design
Metrics -> Health Policy -> Recovery Actions -> Logs
4.2 Key Components
| Component | Responsibility | Key Decisions | |—|—|—| | Metrics Collector | Read /proc and sensors | Sampling interval | | Policy Engine | Evaluate thresholds | Hysteresis | | Recovery | Restart services | Backoff strategy | | Reporter | Output health status | JSON vs text |
4.3 Data Structures (No Full Code)
last_restart = 0
restart_count = 0
4.4 Algorithm Overview
Key Algorithm: Backoff Restart
- Detect failure.
- Check cooldown timer.
- Restart service or escalate.
Complexity Analysis:
- Time: O(1) per loop
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install -y systemd
5.2 Project Structure
project-root/
├── health_monitor.py
├── policy.py
└── README.md
5.3 The Core Question You’re Answering
“How do you keep a headless device healthy without human supervision?”
5.4 Concepts You Must Understand First
- Health check thresholds.
- systemd service supervision.
- Backoff and cooldown logic.
5.5 Questions to Guide Your Design
- When do you restart vs reboot?
- What metrics define “healthy” for your device?
5.6 Thinking Exercise
Define a failure policy for a sensor service and simulate it.
5.7 The Interview Questions They’ll Ask
- What is the risk of auto-restart loops?
- How does systemd supervise services?
- When should you reboot a device?
5.8 Hints in Layers
Hint 1: Start with CPU and memory metrics.
Hint 2: Add a restart cooldown.
Hint 3: Log every recovery action.
5.9 Books That Will Help
| Topic | Book | Chapter | |—|—|—| | Reliability | Site Reliability Engineering | Ch. 5 | | Processes | The Linux Programming Interface | Ch. 6 |
5.10 Implementation Phases
Phase 1: Metrics (3 hours)
- Gather CPU, memory, disk.
Phase 2: Policy (4 hours)
- Implement thresholds and cooldown.
Phase 3: Recovery (3 hours)
- Restart services and log actions.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Recovery | Restart / Reboot | Restart first | Less disruptive | | Cooldown | Fixed / Exponential | Exponential | Avoid loops |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |—|—|—| | Unit Tests | Policy evaluation | Threshold crossing | | Integration Tests | systemd restart | Service crash | | Edge Case Tests | Disk full | Log failure |
6.2 Critical Test Cases
- Single crash triggers one restart.
- Multiple crashes trigger cooldown.
- Metrics missing -> exit
160.
6.3 Test Data
CPU=95% for 6 min -> restart
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |—|—|—| | No cooldown | Restart loop | Add backoff | | Missing permissions | Metrics unavailable | Use /proc or sudo | | Silent failures | No logs | Log every action |
7.2 Debugging Strategies
- Use
systemctl statusto verify restart. - Inspect logs for timestamps.
7.3 Performance Traps
- Overly frequent checks waste CPU.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add temperature threshold alerts.
8.2 Intermediate Extensions
- Add remote health report via MQTT.
8.3 Advanced Extensions
- Integrate hardware watchdog.
9. Real-World Connections
9.1 Industry Applications
- Industrial controllers, field sensors, kiosks.
9.2 Related Open Source Projects
- systemd watchdog examples.
9.3 Interview Relevance
- Reliability engineering and recovery logic.
10. Resources
10.1 Essential Reading
- SRE book health checks chapter.
10.2 Video Resources
- systemd watchdog tutorials.
10.3 Tools & Documentation
systemctl,journalctldocumentation.
10.4 Related Projects in This Series
- Previous: Project 15
- Next: Project 17
11. Self-Assessment Checklist
11.1 Understanding
- I can explain health check thresholds and backoff.
- I can explain systemd restart policies.
11.2 Implementation
- Health monitor runs and logs recoveries.
- Restart loops are prevented.
11.3 Growth
- I can discuss self-healing systems in interviews.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Health monitor reads metrics and logs status.
Full Completion:
- Automatic restart with cooldown and logging.
Excellence (Going Above & Beyond):
- Hardware watchdog integration and remote reporting.