Project 11: FDIR Watchdog (The Dead Man’s Switch)
Build a fault detection, isolation, and recovery (FDIR) watchdog that monitors task heartbeats, escalates recovery actions, and enters safe mode on persistent faults.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C |
| Alternative Programming Languages | Python |
| Coolness Level | Level 3: The Survivor |
| Business Potential | Level 3: Reliability Patterns |
| Prerequisites | Embedded scheduling, timers, state machines |
| Key Topics | Watchdog timers, fault escalation, safe mode |
1. Learning Objectives
By completing this project, you will:
- Implement a heartbeat-based watchdog for multiple tasks.
- Design a fault escalation ladder (restart -> reboot -> safe mode).
- Log fault reason codes and recovery actions.
- Prevent reboot loops using cooldowns and counters.
- Validate FDIR behavior with deterministic fault injection.
2. All Theory Needed (Per-Concept Breakdown)
Watchdog Timers and Heartbeat Monitoring
Fundamentals A watchdog timer is a safety mechanism that resets or recovers a system when software becomes unresponsive. In flight software, you cannot rely on manual intervention; a hung task can end a mission. A heartbeat is a periodic signal from each task that indicates it is alive. The watchdog checks these heartbeats and triggers recovery if they stop. Designing heartbeat intervals and watchdog timeouts is a balance: too short, and you get false resets; too long, and faults go undetected.
Deep Dive into the concept Watchdogs come in two forms: hardware watchdogs and software watchdogs. Hardware watchdogs are independent timers that reset the processor if not “kicked.” Software watchdogs run inside the system and can monitor individual tasks. In flight software, you often combine both: each critical task updates a heartbeat counter; a supervisory task checks these counters. If a counter has not advanced within a timeout, it indicates a hang or scheduling failure.
Selecting heartbeat intervals depends on task rates. A 1 Hz task might update its heartbeat every second; a 10 Hz task might update every 100 ms. The watchdog timeout should be larger than the expected period plus margin. For example, a 1 Hz task with occasional jitter might use a 3-second timeout. The margin must account for CPU load spikes and scheduling jitter; otherwise, you will get false positives.
When a fault is detected, you should distinguish between transient and persistent failures. A single missed heartbeat might be a transient glitch; repeated misses indicate a deeper problem. Therefore, the watchdog should maintain a fault counter and only escalate after N consecutive misses. This improves robustness. The watchdog should also record timestamped fault events with reason codes that operators can inspect.
In addition, the watchdog must be deterministic. Heartbeat checks should occur at fixed intervals, and the algorithm should produce consistent results given the same heartbeat trace. This is crucial for testing and for flight certification.
How this fit on projects This concept drives Section 3.2 watchdog requirements and Section 6 test cases, and ties to P02 mode management.
Definitions & key terms
- Heartbeat -> Periodic signal indicating a task is alive.
- Timeout -> Maximum allowed interval between heartbeats.
- Watchdog -> Monitor that triggers recovery on timeout.
Mental model diagram (ASCII)
Tasks -> Heartbeats -> Watchdog -> Fault Event -> Recovery
How it works (step-by-step, with invariants and failure modes)
- Each task increments its heartbeat counter.
- Watchdog checks counters at fixed intervals.
- If no increment within timeout, declare fault.
- Escalate recovery based on counters.
Invariants: heartbeat intervals deterministic; timeout > task period; counters monotonically increase.
Failure modes: false positives, missed detection, watchdog starvation.
Minimal concrete example
if (now - last_heartbeat[task] > timeout) fault(task);
Common misconceptions
- “Hardware watchdog is enough” -> It can’t identify which task failed.
- “Shorter timeouts are always better” -> They cause false resets.
Check-your-understanding questions
- Why use per-task heartbeats instead of a global one?
- How do you choose a timeout margin?
- What is the risk of false positives?
Check-your-understanding answers
- You can isolate which task failed.
- Add jitter margin to worst-case period.
- Unnecessary resets and lost mission time.
Real-world applications
- Flight software watchdogs in CubeSats.
- Safety monitors in automotive and avionics.
Where you’ll apply it
- See Section 3.2 and Section 6.2.
- Also used in: P02-the-flight-state-machine-the-life-cycle.md
References
- NASA GSFC-HDBK-8007 (fault management)
- Elecia White, Making Embedded Systems (reliability)
Key insights A watchdog is a time-based contract between tasks and the system.
Summary Heartbeat monitoring enables fast detection of hung tasks.
Homework/Exercises to practice the concept
- Choose heartbeat intervals and timeouts for 1 Hz and 10 Hz tasks.
Solutions to the homework/exercises
- Example: 1 Hz timeout 3s, 10 Hz timeout 0.5s.
Fault Escalation and Safe Mode Entry
Fundamentals Not all faults require the same response. A fault escalation ladder starts with the least disruptive recovery (restart a task), then escalates to rebooting a subsystem, and finally to entering safe mode. This avoids unnecessary resets while ensuring survival. Safe mode is the last-resort configuration that minimizes power and risk.
Deep Dive into the concept Fault escalation is about balancing recovery with mission continuity. If a single task stalls, restarting it is often sufficient. If multiple restarts fail, the subsystem might be corrupted; rebooting the subsystem could restore functionality. If the failure persists, the system should enter safe mode to preserve power and thermal stability while awaiting ground intervention.
A well-designed escalation ladder includes thresholds and cooldowns. For example: after 1 missed heartbeat, restart the task; after 3 consecutive failures, reboot the subsystem; after 5 failures in 10 minutes, enter safe mode. Cooldowns prevent rapid oscillation between resets. You also need a reset counter to avoid endless reboot loops. If the system reboots repeatedly within a short window, you should enter safe mode and wait.
Safe mode configuration should be explicit: disable payload, reduce comms, enable beacon, and set ADCS to sun-pointing. The watchdog should coordinate with the mode manager (P02) so that entering safe mode is a controlled transition with proper telemetry. All escalation steps should be logged with reason codes and timestamps.
Fault isolation is also important. If only the radio task is failing, you may reboot the radio subsystem without affecting attitude control. This requires mapping tasks to subsystems and defining recovery actions for each. In this project, you can simulate this with a mapping table and stub recovery functions.
Deterministic testing is crucial. Create scripted fault injections (e.g., freeze COMMS at t=100) and verify that the watchdog escalates exactly as designed. A correct system must produce identical logs for identical fault scripts.
How this fit on projects This concept drives Section 3.2 recovery requirements and Section 7 debugging, and integrates with P02 mode logic.
Definitions & key terms
- Escalation ladder -> Ordered set of recovery actions.
- Safe mode -> Minimal survival configuration.
- Cooldown -> Minimum time between resets.
Mental model diagram (ASCII)
Fault -> Restart Task -> Reboot Subsystem -> Safe Mode
How it works (step-by-step, with invariants and failure modes)
- Detect fault via heartbeat timeout.
- Attempt task restart; increment fault counter.
- If repeated, reboot subsystem.
- If persistent, enter safe mode and log.
Invariants: escalation order fixed; cooldown enforced; safe mode is terminal state.
Failure modes: infinite reboot loop, missing reason codes, unsafe recovery actions.
Minimal concrete example
if (fail_count == 1) restart_task();
else if (fail_count == 3) reboot_subsystem();
else if (fail_count >= 5) enter_safe_mode();
Common misconceptions
- “Resetting is always safe” -> Resets can cause data loss or system instability.
- “Safe mode is optional” -> It is the ultimate survival fallback.
Check-your-understanding questions
- Why use a recovery ladder instead of always rebooting?
- How do cooldowns prevent reboot loops?
- What should safe mode always preserve?
Check-your-understanding answers
- To minimize disruption and preserve mission continuity.
- They prevent immediate repeated resets.
- Power, thermal safety, and basic communications.
Real-world applications
- Fault protection in CubeSat missions.
- Autonomous recovery in deep-space probes.
Where you’ll apply it
- See Section 3.2 and Section 6.2.
- Also used in: P02-the-flight-state-machine-the-life-cycle.md
References
- NASA GSFC-HDBK-8007 (fault management)
- Space Mission Engineering (operations)
Key insights Escalation is about resilience: recover fast, but don’t thrash.
Summary A disciplined escalation ladder keeps the mission alive without overreacting.
Homework/Exercises to practice the concept
- Design a recovery ladder for a COMMS task and justify thresholds.
Solutions to the homework/exercises
- Example: restart after 1 miss, reboot after 3, safe mode after 5 in 10 min.
3. Project Specification
3.1 What You Will Build
A watchdog system that monitors task heartbeats, detects faults, escalates recovery actions, and enters safe mode on persistent failures.
3.2 Functional Requirements
- Heartbeat tracking for multiple tasks.
- Timeout detection with configurable thresholds.
- Escalation ladder with restart/reboot/safe mode.
- Logging of fault codes and recovery actions.
3.3 Non-Functional Requirements
- Determinism: fixed timing and deterministic fault injection.
- Reliability: no infinite reboot loops.
- Transparency: clear reason codes for every action.
3.4 Example Usage / Output
$ ./fdir_watchdog --sim
[OK] All tasks alive
[FAULT] COMMS task stalled
[RECOVER] Rebooted radio
[SAFE] Entering safe mode
3.5 Data Formats / Schemas / Protocols
Fault log JSON:
{"t":120,"task":"COMMS","action":"REBOOT","reason":"TIMEOUT"}
3.6 Edge Cases
- Multiple tasks failing simultaneously.
- Heartbeat jitter causing false positives.
- Reboot loop without cooldown.
3.7 Real World Outcome
A simulation trace showing watchdog responses and safe mode entry under persistent faults.
3.7.1 How to Run (Copy/Paste)
./fdir_watchdog --sim --seed 42
3.7.2 Golden Path Demo (Deterministic)
- Use
traces/nominal.jsonwith no faults. - Expect zero recovery actions.
3.7.3 Failure Demo (Deterministic)
./fdir_watchdog --sim --trace traces/comms_stall.json
Expected: restart -> reboot -> safe mode sequence; exit code 3.
3.7.4 If CLI: Exact Terminal Transcript
$ ./fdir_watchdog --sim --trace traces/comms_stall.json
[FAULT] COMMS timeout
[RECOVER] Restarted COMMS
[FAULT] COMMS timeout
[RECOVER] Rebooted COMMS subsystem
[SAFE] Entering safe mode
ExitCode=3
4. Solution Architecture
4.1 High-Level Design
Task Heartbeats -> Watchdog -> Escalation Engine -> Recovery Actions -> Logs
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Heartbeat Table | Track last update times | Per-task timeouts |
| Escalation Engine | Decide recovery action | Threshold policy |
| Recovery Actions | Restart/reboot/safe | Mocked actions |
| Logger | Fault events | Reason codes |
4.3 Data Structures (No Full Code)
typedef struct { uint32_t last_tick; uint8_t fail_count; } task_state_t;
4.4 Algorithm Overview
Key Algorithm: Watchdog loop
- Check each task’s heartbeat age.
- If timeout, increment fail count.
- Choose recovery action based on fail count.
- Log action and apply cooldown.
Complexity Analysis:
- Time: O(tasks) per tick.
- Space: O(tasks).
5. Implementation Guide
5.1 Development Environment Setup
cc -O2 -Wall -Wextra -o fdir_watchdog src/*.c
5.2 Project Structure
project-root/
+-- src/
| +-- watchdog.c
| +-- escalation.c
| +-- main.c
+-- traces/
+-- README.md
5.3 The Core Question You’re Answering
“How does a spacecraft recover when the software is hung?”
5.4 Concepts You Must Understand First
- Heartbeat monitoring.
- Fault escalation.
- Safe mode constraints.
5.5 Questions to Guide Your Design
- What timeout margins avoid false positives?
- How many retries before reboot?
- What telemetry should be logged for operators?
5.6 Thinking Exercise
Design a recovery ladder for a payload task with low criticality.
5.7 The Interview Questions They’ll Ask
- “What is the difference between hardware and software watchdogs?”
- “How do you avoid reboot loops?”
- “What should safe mode preserve?”
5.8 Hints in Layers
Hint 1: Start with a single watchdog timer.
Hint 2: Add per-task heartbeat tracking.
Hint 3: Add escalation and cooldown counters.
Hint 4: Add structured fault logs.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Fault management | NASA GSFC-HDBK-8007 | Fault protection |
| Embedded reliability | Elecia White | Reliability |
| Systems ops | Space Mission Engineering | Recovery |
5.10 Implementation Phases
Phase 1: Heartbeat Monitor (3-4 days)
Goals: detect missed heartbeats. Tasks: implement heartbeat table and timeout checks. Checkpoint: missing heartbeat triggers fault log.
Phase 2: Escalation Ladder (3-4 days)
Goals: restart/reboot/safe logic. Tasks: implement fail counters and cooldowns. Checkpoint: faults escalate in correct order.
Phase 3: Fault Injection (2-3 days)
Goals: deterministic validation. Tasks: create fault traces and run tests. Checkpoint: logs match golden outputs.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Timeout | static / per-task | per-task | Matches task rate |
| Escalation | fixed / adaptive | fixed | Deterministic |
| Logging | text / JSON | JSON | Machine-readable |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Timeout logic | synthetic timers |
| Integration Tests | Fault traces | comms_stall.json |
| Edge Case Tests | Multi-fault | simultaneous stalls |
6.2 Critical Test Cases
- Single stall: restart only.
- Repeated stall: reboot after threshold.
- Persistent failure: safe mode entry.
6.3 Test Data
traces/comms_stall.json
traces/multi_fault.json
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| No cooldown | Reboot loop | Add cooldown timers |
| Wrong timeout | False faults | Increase margin |
| Missing logs | Ops blind | Log reason codes |
7.2 Debugging Strategies
- Replay fault traces deterministically.
- Verify escalation order in logs.
7.3 Performance Traps
Minimal; watchdog runs at low rate.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add heartbeat jitter simulation.
8.2 Intermediate Extensions
- Add subsystem dependency checks (e.g., COMMS depends on EPS).
8.3 Advanced Extensions
- Integrate with a mode manager and simulate safe mode actions.
9. Real-World Connections
9.1 Industry Applications
- Fault protection in CubeSat missions.
- Safety systems in autonomous robots.
9.2 Related Open Source Projects
- cFS fault management apps.
- RTEMS watchdog examples.
9.3 Interview Relevance
- Demonstrates safety engineering and reliability design.
10. Resources
10.1 Essential Reading
- NASA GSFC-HDBK-8007.
- Elecia White, Making Embedded Systems.
10.2 Video Resources
- Fault management lectures.
10.3 Tools & Documentation
- C timers and watchdog APIs.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain watchdog timeout selection.
- I can design escalation ladders.
- I can explain safe mode entry criteria.
11.2 Implementation
- Fault traces produce expected logs.
- Cooldowns prevent reboot loops.
- Deterministic outputs with fixed traces.
11.3 Growth
- I can integrate watchdog with mode management.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Heartbeat monitoring and basic timeout detection.
Full Completion:
- Escalation ladder with restart/reboot/safe mode.
Excellence (Going Above & Beyond):
- Integration with mode manager and dependency-aware fault handling.