Project 16: Device Health Monitor and Self-Healing Service

Build a watchdog-style service that monitors system health and recovers from failures safely.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1–2 weekends
Main Programming Language	Python (Alternatives: Go, Rust, Bash)
Alternative Programming Languages	Go, Rust, Bash
Coolness Level	High
Business Potential	High
Prerequisites	systemd basics, Linux monitoring
Key Topics	Health checks, restart policies, backoff, logging

1. Learning Objectives

By completing this project, you will:

Define health metrics for a headless device.
Implement a monitoring loop and thresholds.
Trigger controlled restarts without reboot loops.
Log and report recovery actions.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Health Checks, Thresholds, and Self-Healing Strategies

Fundamentals

A headless device must detect when it is unhealthy and recover automatically. Health checks are measurements of system state such as CPU load, memory usage, disk space, temperature, and application liveness. A self-healing service uses these checks to decide when to restart an application or reboot the system. If thresholds are too aggressive, the device will thrash; if too lax, failures persist. The goal is a balanced, deterministic policy with clear logging and backoff.

Deep Dive into the concept

Health monitoring starts with defining what “healthy” means. For a device like a Pi Zero 2 W, core metrics include CPU load, memory usage, disk usage, temperature, and network connectivity. Application-specific checks include process liveness (is the sensor app running?) and output freshness (has it produced data in the last N minutes?). Each metric needs a threshold and a time window. For example, CPU > 90% for 5 minutes may be unhealthy, but short spikes are normal. This is why you need hysteresis and time-based thresholds.

Self-healing strategies include restarting a process, restarting a service, or rebooting the device. systemd already provides restart policies (Restart=on-failure) and watchdog support (WatchdogSec). You can integrate with systemd by defining a watchdog service that reports health and triggers actions when thresholds are exceeded. But you must also prevent restart loops: if a service keeps crashing, you need a cooldown and a maximum number of restarts before escalating to a reboot or safe mode.

A robust health monitor should implement a rate limiter. For example, allow at most 3 restarts per hour. If the threshold is exceeded, the service should stop attempting restarts and log a critical error. This prevents endless loops that wear storage and reduce uptime. Logging is essential: every recovery action should be recorded with a timestamp, metric values, and reason. This log becomes the forensic record when diagnosing failures.

Because this is a headless device, you should also expose a health summary via CLI or HTTP. This summary should include current metrics, last recovery action, and uptime. The health monitor must run reliably at boot; systemd ensures this by restarting it if it crashes. However, if the monitor itself is broken, you need a fallback such as systemd’s own watchdog or a hardware watchdog (not required for this project).

How this fit on projects

This concept is used in §3 and §5.10 and ties into Project 13 (boot readiness) and the capstone project.

Definitions & key terms

Health check: Measurement used to determine system health.
Threshold: Limit that triggers action when exceeded.
Backoff: Delay between repeated recovery attempts.
Watchdog: Mechanism to detect unresponsive services.

Mental model diagram (ASCII)

Metrics -> Health Check -> Decision -> Restart/Log

How it works (step-by-step, with invariants and failure modes)

Collect metrics on interval.
Compare to thresholds.
If unhealthy, attempt recovery with backoff.
Log action and update health state.

Failure modes:

Threshold too low -> false positives.
No backoff -> restart loops.
Missing logs -> no diagnosis.

Minimal concrete example

if cpu > 90 and duration > 300:
    restart("sensor_app")

Common misconceptions

“Restarting is always safe.” It can cause data loss.
“Health is only CPU usage.” It includes liveness and data freshness.

Check-your-understanding questions

Why do you need hysteresis in health checks?
What is the risk of unlimited restart attempts?
How does systemd watchdog help?

Check-your-understanding answers

To prevent flapping between healthy/unhealthy states.
It can create endless loops and hide root cause.
It restarts unresponsive services automatically.

Real-world applications

Remote monitoring devices, industrial IoT, kiosks.

Where you’ll apply it

This project: §3.2, §5.10.
Other projects: Project 13, Project 17.

References

Google SRE book, health checks chapter
systemd watchdog documentation

Key insights

Health monitoring is a policy decision; the thresholds define your system behavior.

Summary

A self-healing service must monitor, decide, and recover with backoff and clear logs.

Homework/Exercises to practice the concept

Define thresholds for CPU, memory, disk.
Simulate a crash and confirm one restart.
Implement a restart cooldown.

Solutions to the homework/exercises

Use realistic thresholds (CPU > 90% for 5 min).
Kill the process and observe restart.
Add a timestamp check before restarting again.

3. Project Specification

3.1 What You Will Build

A health monitor service that checks system metrics and restarts a failing app with backoff.

3.2 Functional Requirements

Collect CPU, memory, disk, temperature metrics.
Detect app failure or staleness.
Restart services with cooldown.
Log every recovery action.

3.3 Non-Functional Requirements

Performance: Health check loop < 1% CPU.
Reliability: No restart loops.
Usability: Clear health status output.

3.4 Example Usage / Output

$ ./health_monitor
Service: sensor_app  Status: OK  Uptime: 2h14m
CPU: 22%  Memory: 38%  Temp: 51 C
Recovery actions taken: 0

3.5 Data Formats / Schemas / Protocols

Health report JSON:

{"service":"sensor_app","status":"OK","cpu":22,"mem":38,"temp":51,"recoveries":0}

3.6 Edge Cases

Service crashes repeatedly.
Disk full prevents log writing.
Metrics unavailable due to permissions.

3.7 Real World Outcome

The device recovers from failures without manual intervention and avoids restart loops.

3.7.1 How to Run (Copy/Paste)

python3 health_monitor.py --service sensor_app --interval 10

3.7.2 Golden Path Demo (Deterministic)

export FIXED_TIME="2026-01-01T14:00:00Z"
python3 health_monitor.py --simulate --healthy

Expected output:

[2026-01-01T14:00:00Z] Status OK, recoveries=0

3.7.3 Failure Demo (Deterministic)

python3 health_monitor.py --simulate --crash

Expected output:

[ERROR] Service crash detected, restart initiated

Exit code: 161

3.7.4 CLI Exit Codes

0: Success
160: Metrics unavailable
161: Restart failed

4. Solution Architecture

4.1 High-Level Design

Metrics -> Health Policy -> Recovery Actions -> Logs

4.2 Key Components

4.3 Data Structures (No Full Code)

last_restart = 0
restart_count = 0

4.4 Algorithm Overview

Key Algorithm: Backoff Restart

Detect failure.
Check cooldown timer.
Restart service or escalate.

Complexity Analysis:

Time: O(1) per loop
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y systemd

5.2 Project Structure

project-root/
├── health_monitor.py
├── policy.py
└── README.md

5.3 The Core Question You’re Answering

“How do you keep a headless device healthy without human supervision?”

5.4 Concepts You Must Understand First

Health check thresholds.
systemd service supervision.
Backoff and cooldown logic.

5.5 Questions to Guide Your Design

When do you restart vs reboot?
What metrics define “healthy” for your device?

5.6 Thinking Exercise

Define a failure policy for a sensor service and simulate it.

5.7 The Interview Questions They’ll Ask

What is the risk of auto-restart loops?
How does systemd supervise services?
When should you reboot a device?

5.8 Hints in Layers

Hint 1: Start with CPU and memory metrics.

Hint 2: Add a restart cooldown.

Hint 3: Log every recovery action.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Reliability | Site Reliability Engineering | Ch. 5 | | Processes | The Linux Programming Interface | Ch. 6 |

5.10 Implementation Phases

Phase 1: Metrics (3 hours)

Gather CPU, memory, disk.

Phase 2: Policy (4 hours)

Implement thresholds and cooldown.

Phase 3: Recovery (3 hours)

Restart services and log actions.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Single crash triggers one restart.
Multiple crashes trigger cooldown.
Metrics missing -> exit 160.

6.3 Test Data

CPU=95% for 6 min -> restart

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use systemctl status to verify restart.
Inspect logs for timestamps.

7.3 Performance Traps

Overly frequent checks waste CPU.

8. Extensions & Challenges

8.1 Beginner Extensions

Add temperature threshold alerts.

8.2 Intermediate Extensions

Add remote health report via MQTT.

8.3 Advanced Extensions

Integrate hardware watchdog.

9. Real-World Connections

9.1 Industry Applications

Industrial controllers, field sensors, kiosks.

systemd watchdog examples.

9.3 Interview Relevance

Reliability engineering and recovery logic.

10. Resources

10.1 Essential Reading

SRE book health checks chapter.

10.2 Video Resources

systemd watchdog tutorials.

10.3 Tools & Documentation

systemctl, journalctl documentation.

Previous: Project 15
Next: Project 17

11. Self-Assessment Checklist

11.1 Understanding

I can explain health check thresholds and backoff.
I can explain systemd restart policies.

11.2 Implementation

Health monitor runs and logs recoveries.
Restart loops are prevented.

11.3 Growth

I can discuss self-healing systems in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

Health monitor reads metrics and logs status.

Full Completion:

Automatic restart with cooldown and logging.

Excellence (Going Above & Beyond):

Hardware watchdog integration and remote reporting.