Project 16: Device Health Monitor and Self-Healing Service

Build a watchdog-style service that monitors system health and recovers from failures safely.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1–2 weekends
Main Programming Language Python (Alternatives: Go, Rust, Bash)
Alternative Programming Languages Go, Rust, Bash
Coolness Level High
Business Potential High
Prerequisites systemd basics, Linux monitoring
Key Topics Health checks, restart policies, backoff, logging

1. Learning Objectives

By completing this project, you will:

  1. Define health metrics for a headless device.
  2. Implement a monitoring loop and thresholds.
  3. Trigger controlled restarts without reboot loops.
  4. Log and report recovery actions.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Health Checks, Thresholds, and Self-Healing Strategies

Fundamentals

A headless device must detect when it is unhealthy and recover automatically. Health checks are measurements of system state such as CPU load, memory usage, disk space, temperature, and application liveness. A self-healing service uses these checks to decide when to restart an application or reboot the system. If thresholds are too aggressive, the device will thrash; if too lax, failures persist. The goal is a balanced, deterministic policy with clear logging and backoff.

Deep Dive into the concept

Health monitoring starts with defining what “healthy” means. For a device like a Pi Zero 2 W, core metrics include CPU load, memory usage, disk usage, temperature, and network connectivity. Application-specific checks include process liveness (is the sensor app running?) and output freshness (has it produced data in the last N minutes?). Each metric needs a threshold and a time window. For example, CPU > 90% for 5 minutes may be unhealthy, but short spikes are normal. This is why you need hysteresis and time-based thresholds.

Self-healing strategies include restarting a process, restarting a service, or rebooting the device. systemd already provides restart policies (Restart=on-failure) and watchdog support (WatchdogSec). You can integrate with systemd by defining a watchdog service that reports health and triggers actions when thresholds are exceeded. But you must also prevent restart loops: if a service keeps crashing, you need a cooldown and a maximum number of restarts before escalating to a reboot or safe mode.

A robust health monitor should implement a rate limiter. For example, allow at most 3 restarts per hour. If the threshold is exceeded, the service should stop attempting restarts and log a critical error. This prevents endless loops that wear storage and reduce uptime. Logging is essential: every recovery action should be recorded with a timestamp, metric values, and reason. This log becomes the forensic record when diagnosing failures.

Because this is a headless device, you should also expose a health summary via CLI or HTTP. This summary should include current metrics, last recovery action, and uptime. The health monitor must run reliably at boot; systemd ensures this by restarting it if it crashes. However, if the monitor itself is broken, you need a fallback such as systemd’s own watchdog or a hardware watchdog (not required for this project).

How this fit on projects

This concept is used in §3 and §5.10 and ties into Project 13 (boot readiness) and the capstone project.

Definitions & key terms

  • Health check: Measurement used to determine system health.
  • Threshold: Limit that triggers action when exceeded.
  • Backoff: Delay between repeated recovery attempts.
  • Watchdog: Mechanism to detect unresponsive services.

Mental model diagram (ASCII)

Metrics -> Health Check -> Decision -> Restart/Log

How it works (step-by-step, with invariants and failure modes)

  1. Collect metrics on interval.
  2. Compare to thresholds.
  3. If unhealthy, attempt recovery with backoff.
  4. Log action and update health state.

Failure modes:

  • Threshold too low -> false positives.
  • No backoff -> restart loops.
  • Missing logs -> no diagnosis.

Minimal concrete example

if cpu > 90 and duration > 300:
    restart("sensor_app")

Common misconceptions

  • “Restarting is always safe.” It can cause data loss.
  • “Health is only CPU usage.” It includes liveness and data freshness.

Check-your-understanding questions

  1. Why do you need hysteresis in health checks?
  2. What is the risk of unlimited restart attempts?
  3. How does systemd watchdog help?

Check-your-understanding answers

  1. To prevent flapping between healthy/unhealthy states.
  2. It can create endless loops and hide root cause.
  3. It restarts unresponsive services automatically.

Real-world applications

  • Remote monitoring devices, industrial IoT, kiosks.

Where you’ll apply it

References

  • Google SRE book, health checks chapter
  • systemd watchdog documentation

Key insights

Health monitoring is a policy decision; the thresholds define your system behavior.

Summary

A self-healing service must monitor, decide, and recover with backoff and clear logs.

Homework/Exercises to practice the concept

  1. Define thresholds for CPU, memory, disk.
  2. Simulate a crash and confirm one restart.
  3. Implement a restart cooldown.

Solutions to the homework/exercises

  1. Use realistic thresholds (CPU > 90% for 5 min).
  2. Kill the process and observe restart.
  3. Add a timestamp check before restarting again.

3. Project Specification

3.1 What You Will Build

A health monitor service that checks system metrics and restarts a failing app with backoff.

3.2 Functional Requirements

  1. Collect CPU, memory, disk, temperature metrics.
  2. Detect app failure or staleness.
  3. Restart services with cooldown.
  4. Log every recovery action.

3.3 Non-Functional Requirements

  • Performance: Health check loop < 1% CPU.
  • Reliability: No restart loops.
  • Usability: Clear health status output.

3.4 Example Usage / Output

$ ./health_monitor
Service: sensor_app  Status: OK  Uptime: 2h14m
CPU: 22%  Memory: 38%  Temp: 51 C
Recovery actions taken: 0

3.5 Data Formats / Schemas / Protocols

Health report JSON:

{"service":"sensor_app","status":"OK","cpu":22,"mem":38,"temp":51,"recoveries":0}

3.6 Edge Cases

  • Service crashes repeatedly.
  • Disk full prevents log writing.
  • Metrics unavailable due to permissions.

3.7 Real World Outcome

The device recovers from failures without manual intervention and avoids restart loops.

3.7.1 How to Run (Copy/Paste)

python3 health_monitor.py --service sensor_app --interval 10

3.7.2 Golden Path Demo (Deterministic)

export FIXED_TIME="2026-01-01T14:00:00Z"
python3 health_monitor.py --simulate --healthy

Expected output:

[2026-01-01T14:00:00Z] Status OK, recoveries=0

3.7.3 Failure Demo (Deterministic)

python3 health_monitor.py --simulate --crash

Expected output:

[ERROR] Service crash detected, restart initiated

Exit code: 161

3.7.4 CLI Exit Codes

  • 0: Success
  • 160: Metrics unavailable
  • 161: Restart failed

4. Solution Architecture

4.1 High-Level Design

Metrics -> Health Policy -> Recovery Actions -> Logs

4.2 Key Components

| Component | Responsibility | Key Decisions | |—|—|—| | Metrics Collector | Read /proc and sensors | Sampling interval | | Policy Engine | Evaluate thresholds | Hysteresis | | Recovery | Restart services | Backoff strategy | | Reporter | Output health status | JSON vs text |

4.3 Data Structures (No Full Code)

last_restart = 0
restart_count = 0

4.4 Algorithm Overview

Key Algorithm: Backoff Restart

  1. Detect failure.
  2. Check cooldown timer.
  3. Restart service or escalate.

Complexity Analysis:

  • Time: O(1) per loop
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y systemd

5.2 Project Structure

project-root/
├── health_monitor.py
├── policy.py
└── README.md

5.3 The Core Question You’re Answering

“How do you keep a headless device healthy without human supervision?”

5.4 Concepts You Must Understand First

  1. Health check thresholds.
  2. systemd service supervision.
  3. Backoff and cooldown logic.

5.5 Questions to Guide Your Design

  1. When do you restart vs reboot?
  2. What metrics define “healthy” for your device?

5.6 Thinking Exercise

Define a failure policy for a sensor service and simulate it.

5.7 The Interview Questions They’ll Ask

  1. What is the risk of auto-restart loops?
  2. How does systemd supervise services?
  3. When should you reboot a device?

5.8 Hints in Layers

Hint 1: Start with CPU and memory metrics.

Hint 2: Add a restart cooldown.

Hint 3: Log every recovery action.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Reliability | Site Reliability Engineering | Ch. 5 | | Processes | The Linux Programming Interface | Ch. 6 |

5.10 Implementation Phases

Phase 1: Metrics (3 hours)

  • Gather CPU, memory, disk.

Phase 2: Policy (4 hours)

  • Implement thresholds and cooldown.

Phase 3: Recovery (3 hours)

  • Restart services and log actions.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Recovery | Restart / Reboot | Restart first | Less disruptive | | Cooldown | Fixed / Exponential | Exponential | Avoid loops |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |—|—|—| | Unit Tests | Policy evaluation | Threshold crossing | | Integration Tests | systemd restart | Service crash | | Edge Case Tests | Disk full | Log failure |

6.2 Critical Test Cases

  1. Single crash triggers one restart.
  2. Multiple crashes trigger cooldown.
  3. Metrics missing -> exit 160.

6.3 Test Data

CPU=95% for 6 min -> restart

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |—|—|—| | No cooldown | Restart loop | Add backoff | | Missing permissions | Metrics unavailable | Use /proc or sudo | | Silent failures | No logs | Log every action |

7.2 Debugging Strategies

  • Use systemctl status to verify restart.
  • Inspect logs for timestamps.

7.3 Performance Traps

  • Overly frequent checks waste CPU.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add temperature threshold alerts.

8.2 Intermediate Extensions

  • Add remote health report via MQTT.

8.3 Advanced Extensions

  • Integrate hardware watchdog.

9. Real-World Connections

9.1 Industry Applications

  • Industrial controllers, field sensors, kiosks.
  • systemd watchdog examples.

9.3 Interview Relevance

  • Reliability engineering and recovery logic.

10. Resources

10.1 Essential Reading

  • SRE book health checks chapter.

10.2 Video Resources

  • systemd watchdog tutorials.

10.3 Tools & Documentation

  • systemctl, journalctl documentation.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain health check thresholds and backoff.
  • I can explain systemd restart policies.

11.2 Implementation

  • Health monitor runs and logs recoveries.
  • Restart loops are prevented.

11.3 Growth

  • I can discuss self-healing systems in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Health monitor reads metrics and logs status.

Full Completion:

  • Automatic restart with cooldown and logging.

Excellence (Going Above & Beyond):

  • Hardware watchdog integration and remote reporting.