Project 11: FDIR Watchdog (The Dead Man’s Switch)

Build a fault detection, isolation, and recovery (FDIR) watchdog that monitors task heartbeats, escalates recovery actions, and enters safe mode on persistent faults.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Main Programming Language	C
Alternative Programming Languages	Python
Coolness Level	Level 3: The Survivor
Business Potential	Level 3: Reliability Patterns
Prerequisites	Embedded scheduling, timers, state machines
Key Topics	Watchdog timers, fault escalation, safe mode

1. Learning Objectives

By completing this project, you will:

Implement a heartbeat-based watchdog for multiple tasks.
Design a fault escalation ladder (restart -> reboot -> safe mode).
Log fault reason codes and recovery actions.
Prevent reboot loops using cooldowns and counters.
Validate FDIR behavior with deterministic fault injection.

2. All Theory Needed (Per-Concept Breakdown)

Watchdog Timers and Heartbeat Monitoring

Fundamentals A watchdog timer is a safety mechanism that resets or recovers a system when software becomes unresponsive. In flight software, you cannot rely on manual intervention; a hung task can end a mission. A heartbeat is a periodic signal from each task that indicates it is alive. The watchdog checks these heartbeats and triggers recovery if they stop. Designing heartbeat intervals and watchdog timeouts is a balance: too short, and you get false resets; too long, and faults go undetected.

Deep Dive into the concept Watchdogs come in two forms: hardware watchdogs and software watchdogs. Hardware watchdogs are independent timers that reset the processor if not “kicked.” Software watchdogs run inside the system and can monitor individual tasks. In flight software, you often combine both: each critical task updates a heartbeat counter; a supervisory task checks these counters. If a counter has not advanced within a timeout, it indicates a hang or scheduling failure.

Selecting heartbeat intervals depends on task rates. A 1 Hz task might update its heartbeat every second; a 10 Hz task might update every 100 ms. The watchdog timeout should be larger than the expected period plus margin. For example, a 1 Hz task with occasional jitter might use a 3-second timeout. The margin must account for CPU load spikes and scheduling jitter; otherwise, you will get false positives.

When a fault is detected, you should distinguish between transient and persistent failures. A single missed heartbeat might be a transient glitch; repeated misses indicate a deeper problem. Therefore, the watchdog should maintain a fault counter and only escalate after N consecutive misses. This improves robustness. The watchdog should also record timestamped fault events with reason codes that operators can inspect.

In addition, the watchdog must be deterministic. Heartbeat checks should occur at fixed intervals, and the algorithm should produce consistent results given the same heartbeat trace. This is crucial for testing and for flight certification.

How this fit on projects This concept drives Section 3.2 watchdog requirements and Section 6 test cases, and ties to P02 mode management.

Definitions & key terms

Heartbeat -> Periodic signal indicating a task is alive.
Timeout -> Maximum allowed interval between heartbeats.
Watchdog -> Monitor that triggers recovery on timeout.

Mental model diagram (ASCII)

Tasks -> Heartbeats -> Watchdog -> Fault Event -> Recovery

How it works (step-by-step, with invariants and failure modes)

Each task increments its heartbeat counter.
Watchdog checks counters at fixed intervals.
If no increment within timeout, declare fault.
Escalate recovery based on counters.

Invariants: heartbeat intervals deterministic; timeout > task period; counters monotonically increase.

Failure modes: false positives, missed detection, watchdog starvation.

Minimal concrete example

if (now - last_heartbeat[task] > timeout) fault(task);

Common misconceptions

“Hardware watchdog is enough” -> It can’t identify which task failed.
“Shorter timeouts are always better” -> They cause false resets.

Check-your-understanding questions

Why use per-task heartbeats instead of a global one?
How do you choose a timeout margin?
What is the risk of false positives?

Check-your-understanding answers

You can isolate which task failed.
Add jitter margin to worst-case period.
Unnecessary resets and lost mission time.

Real-world applications

Flight software watchdogs in CubeSats.
Safety monitors in automotive and avionics.

Where you’ll apply it

See Section 3.2 and Section 6.2.
Also used in: P02-the-flight-state-machine-the-life-cycle.md

References

NASA GSFC-HDBK-8007 (fault management)
Elecia White, Making Embedded Systems (reliability)

Key insights A watchdog is a time-based contract between tasks and the system.

Summary Heartbeat monitoring enables fast detection of hung tasks.

Homework/Exercises to practice the concept

Choose heartbeat intervals and timeouts for 1 Hz and 10 Hz tasks.

Solutions to the homework/exercises

Example: 1 Hz timeout 3s, 10 Hz timeout 0.5s.

Fault Escalation and Safe Mode Entry

Fundamentals Not all faults require the same response. A fault escalation ladder starts with the least disruptive recovery (restart a task), then escalates to rebooting a subsystem, and finally to entering safe mode. This avoids unnecessary resets while ensuring survival. Safe mode is the last-resort configuration that minimizes power and risk.

Deep Dive into the concept Fault escalation is about balancing recovery with mission continuity. If a single task stalls, restarting it is often sufficient. If multiple restarts fail, the subsystem might be corrupted; rebooting the subsystem could restore functionality. If the failure persists, the system should enter safe mode to preserve power and thermal stability while awaiting ground intervention.

A well-designed escalation ladder includes thresholds and cooldowns. For example: after 1 missed heartbeat, restart the task; after 3 consecutive failures, reboot the subsystem; after 5 failures in 10 minutes, enter safe mode. Cooldowns prevent rapid oscillation between resets. You also need a reset counter to avoid endless reboot loops. If the system reboots repeatedly within a short window, you should enter safe mode and wait.

Safe mode configuration should be explicit: disable payload, reduce comms, enable beacon, and set ADCS to sun-pointing. The watchdog should coordinate with the mode manager (P02) so that entering safe mode is a controlled transition with proper telemetry. All escalation steps should be logged with reason codes and timestamps.

Fault isolation is also important. If only the radio task is failing, you may reboot the radio subsystem without affecting attitude control. This requires mapping tasks to subsystems and defining recovery actions for each. In this project, you can simulate this with a mapping table and stub recovery functions.

Deterministic testing is crucial. Create scripted fault injections (e.g., freeze COMMS at t=100) and verify that the watchdog escalates exactly as designed. A correct system must produce identical logs for identical fault scripts.

How this fit on projects This concept drives Section 3.2 recovery requirements and Section 7 debugging, and integrates with P02 mode logic.

Definitions & key terms

Escalation ladder -> Ordered set of recovery actions.
Safe mode -> Minimal survival configuration.
Cooldown -> Minimum time between resets.

Mental model diagram (ASCII)

Fault -> Restart Task -> Reboot Subsystem -> Safe Mode

How it works (step-by-step, with invariants and failure modes)

Detect fault via heartbeat timeout.
Attempt task restart; increment fault counter.
If repeated, reboot subsystem.
If persistent, enter safe mode and log.

Invariants: escalation order fixed; cooldown enforced; safe mode is terminal state.

Failure modes: infinite reboot loop, missing reason codes, unsafe recovery actions.

Minimal concrete example

if (fail_count == 1) restart_task();
else if (fail_count == 3) reboot_subsystem();
else if (fail_count >= 5) enter_safe_mode();

Common misconceptions

“Resetting is always safe” -> Resets can cause data loss or system instability.
“Safe mode is optional” -> It is the ultimate survival fallback.

Check-your-understanding questions

Why use a recovery ladder instead of always rebooting?
How do cooldowns prevent reboot loops?
What should safe mode always preserve?

Check-your-understanding answers

To minimize disruption and preserve mission continuity.
They prevent immediate repeated resets.
Power, thermal safety, and basic communications.

Real-world applications

Fault protection in CubeSat missions.
Autonomous recovery in deep-space probes.

Where you’ll apply it

See Section 3.2 and Section 6.2.
Also used in: P02-the-flight-state-machine-the-life-cycle.md

References

NASA GSFC-HDBK-8007 (fault management)
Space Mission Engineering (operations)

Key insights Escalation is about resilience: recover fast, but don’t thrash.

Summary A disciplined escalation ladder keeps the mission alive without overreacting.

Homework/Exercises to practice the concept

Design a recovery ladder for a COMMS task and justify thresholds.

Solutions to the homework/exercises

Example: restart after 1 miss, reboot after 3, safe mode after 5 in 10 min.

3. Project Specification

3.1 What You Will Build

A watchdog system that monitors task heartbeats, detects faults, escalates recovery actions, and enters safe mode on persistent failures.

3.2 Functional Requirements

Heartbeat tracking for multiple tasks.
Timeout detection with configurable thresholds.
Escalation ladder with restart/reboot/safe mode.
Logging of fault codes and recovery actions.

3.3 Non-Functional Requirements

Determinism: fixed timing and deterministic fault injection.
Reliability: no infinite reboot loops.
Transparency: clear reason codes for every action.

3.4 Example Usage / Output

$ ./fdir_watchdog --sim
[OK] All tasks alive
[FAULT] COMMS task stalled
[RECOVER] Rebooted radio
[SAFE] Entering safe mode

3.5 Data Formats / Schemas / Protocols

Fault log JSON:

{"t":120,"task":"COMMS","action":"REBOOT","reason":"TIMEOUT"}

3.6 Edge Cases

Multiple tasks failing simultaneously.
Heartbeat jitter causing false positives.
Reboot loop without cooldown.

3.7 Real World Outcome

A simulation trace showing watchdog responses and safe mode entry under persistent faults.

3.7.1 How to Run (Copy/Paste)

./fdir_watchdog --sim --seed 42

3.7.2 Golden Path Demo (Deterministic)

Use traces/nominal.json with no faults.
Expect zero recovery actions.

3.7.3 Failure Demo (Deterministic)

./fdir_watchdog --sim --trace traces/comms_stall.json

Expected: restart -> reboot -> safe mode sequence; exit code 3.

3.7.4 If CLI: Exact Terminal Transcript

$ ./fdir_watchdog --sim --trace traces/comms_stall.json
[FAULT] COMMS timeout
[RECOVER] Restarted COMMS
[FAULT] COMMS timeout
[RECOVER] Rebooted COMMS subsystem
[SAFE] Entering safe mode
ExitCode=3

4. Solution Architecture

4.1 High-Level Design

Task Heartbeats -> Watchdog -> Escalation Engine -> Recovery Actions -> Logs

4.2 Key Components

Component	Responsibility	Key Decisions
Heartbeat Table	Track last update times	Per-task timeouts
Escalation Engine	Decide recovery action	Threshold policy
Recovery Actions	Restart/reboot/safe	Mocked actions
Logger	Fault events	Reason codes

4.3 Data Structures (No Full Code)

typedef struct { uint32_t last_tick; uint8_t fail_count; } task_state_t;

4.4 Algorithm Overview

Key Algorithm: Watchdog loop

Check each task’s heartbeat age.
If timeout, increment fail count.
Choose recovery action based on fail count.
Log action and apply cooldown.

Complexity Analysis:

Time: O(tasks) per tick.
Space: O(tasks).

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o fdir_watchdog src/*.c

5.2 Project Structure

project-root/
+-- src/
|   +-- watchdog.c
|   +-- escalation.c
|   +-- main.c
+-- traces/
+-- README.md

5.3 The Core Question You’re Answering

“How does a spacecraft recover when the software is hung?”

5.4 Concepts You Must Understand First

Heartbeat monitoring.
Fault escalation.
Safe mode constraints.

5.5 Questions to Guide Your Design

What timeout margins avoid false positives?
How many retries before reboot?
What telemetry should be logged for operators?

5.6 Thinking Exercise

Design a recovery ladder for a payload task with low criticality.

5.7 The Interview Questions They’ll Ask

“What is the difference between hardware and software watchdogs?”
“How do you avoid reboot loops?”
“What should safe mode preserve?”

5.8 Hints in Layers

Hint 1: Start with a single watchdog timer.

Hint 2: Add per-task heartbeat tracking.

Hint 3: Add escalation and cooldown counters.

Hint 4: Add structured fault logs.

5.9 Books That Will Help

Topic	Book	Chapter
Fault management	NASA GSFC-HDBK-8007	Fault protection
Embedded reliability	Elecia White	Reliability
Systems ops	Space Mission Engineering	Recovery

5.10 Implementation Phases

Phase 1: Heartbeat Monitor (3-4 days)

Goals: detect missed heartbeats. Tasks: implement heartbeat table and timeout checks. Checkpoint: missing heartbeat triggers fault log.

Phase 2: Escalation Ladder (3-4 days)

Goals: restart/reboot/safe logic. Tasks: implement fail counters and cooldowns. Checkpoint: faults escalate in correct order.

Phase 3: Fault Injection (2-3 days)

Goals: deterministic validation. Tasks: create fault traces and run tests. Checkpoint: logs match golden outputs.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Timeout	static / per-task	per-task	Matches task rate
Escalation	fixed / adaptive	fixed	Deterministic
Logging	text / JSON	JSON	Machine-readable

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Timeout logic	synthetic timers
Integration Tests	Fault traces	comms_stall.json
Edge Case Tests	Multi-fault	simultaneous stalls

6.2 Critical Test Cases

Single stall: restart only.
Repeated stall: reboot after threshold.
Persistent failure: safe mode entry.

6.3 Test Data

traces/comms_stall.json
traces/multi_fault.json

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
No cooldown	Reboot loop	Add cooldown timers
Wrong timeout	False faults	Increase margin
Missing logs	Ops blind	Log reason codes

7.2 Debugging Strategies

Replay fault traces deterministically.
Verify escalation order in logs.

7.3 Performance Traps

Minimal; watchdog runs at low rate.

8. Extensions & Challenges

8.1 Beginner Extensions

Add heartbeat jitter simulation.

8.2 Intermediate Extensions

Add subsystem dependency checks (e.g., COMMS depends on EPS).

8.3 Advanced Extensions

Integrate with a mode manager and simulate safe mode actions.

9. Real-World Connections

9.1 Industry Applications

Fault protection in CubeSat missions.
Safety systems in autonomous robots.

cFS fault management apps.
RTEMS watchdog examples.

9.3 Interview Relevance

Demonstrates safety engineering and reliability design.

10. Resources

10.1 Essential Reading

NASA GSFC-HDBK-8007.
Elecia White, Making Embedded Systems.

10.2 Video Resources

Fault management lectures.

10.3 Tools & Documentation

C timers and watchdog APIs.

P02-the-flight-state-machine-the-life-cycle.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain watchdog timeout selection.
I can design escalation ladders.
I can explain safe mode entry criteria.

11.2 Implementation

Fault traces produce expected logs.
Cooldowns prevent reboot loops.
Deterministic outputs with fixed traces.

11.3 Growth

I can integrate watchdog with mode management.

12. Submission / Completion Criteria

Minimum Viable Completion:

Heartbeat monitoring and basic timeout detection.

Full Completion:

Escalation ladder with restart/reboot/safe mode.

Excellence (Going Above & Beyond):

Integration with mode manager and dependency-aware fault handling.