Project 13: Fault Hunter - Watchdog and Fault Logging
Implement watchdog resets, persistent fault logs, and safe recovery paths.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 20-30 hours |
| Main Programming Language | C++ |
| Alternative Programming Languages | C |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 3. The “Service” Model |
| Prerequisites | C/C++ basics, Teensyduino setup, basic electronics, ability to use a multimeter/logic analyzer |
| Key Topics | watchdog, fault logging, reset causes |
1. Learning Objectives
By completing this project, you will:
- Explain the core question for this project in your own words.
- Implement the main workflow and validate it with measurements.
- Handle at least two failure modes and document recovery.
- Produce a deterministic report that matches hardware behavior.
2. All Theory Needed (Per-Concept Breakdown)
Watchdog Timers, Fault Logging, and Recovery
Fundamentals A watchdog timer resets the system if software fails to feed it in time. This prevents permanent hangs but is only useful if you can diagnose why the reset happened. Fault logging records reset causes and error context in nonvolatile memory so failures can be reproduced. Reliable systems combine watchdogs, logs, and safe recovery.
Deep Dive into the concept Watchdogs are simple but powerful. The timeout must be long enough to avoid false resets but short enough to recover quickly from real faults. Feeding the watchdog only after critical tasks completes turns it into a health indicator. Fault logging must be power-failure tolerant: append-only records with checksums are common. Storage choices (EEPROM emulation vs LittleFS) affect durability and wear. Recovery strategies include safe mode after repeated resets, reduced feature sets, and explicit error reporting. This project has you inject faults, verify logs, and implement deterministic recovery behavior so resets become diagnosable events rather than mysteries.
How this fit on projects This concept directly powers the implementation choices and validation steps in this project.
Definitions & key terms
- Watchdog: Hardware timer that resets the system if not serviced.
- Reset cause: Register indicating why the MCU reset.
- Fault log: Persistent record of failures across reboots.
- Safe mode: Reduced functionality mode after repeated faults.
Mental model diagram (ASCII)
Main loop -> Feed watchdog -> (fault) -> Reset -> Log -> Safe mode
How it works (step-by-step)
- Configure watchdog timeout and enable it.
- Log reset cause on boot.
- Inject a fault and verify reset + log.
- Implement safe mode after repeated faults.
Minimal concrete example
wdt_enable(1000); // 1s
if (healthy) wdt_feed();
Common misconceptions
- Watchdogs remove the need for debugging.
- Any reset is fine as long as it reboots.
- Logging can be done in the last millisecond.
Check-your-understanding questions
- Why should watchdog feeding be conditional?
- How do you log without corrupting storage?
- What is a good fault injection strategy?
Check-your-understanding answers
- It ensures the system only feeds the watchdog when healthy.
- Use append-only records with checksum and flush.
- Force a dead loop or corrupt buffer to trigger reset.
Real-world applications
- Industrial controllers
- Remote sensors
- Automotive ECUs
Where you’ll apply it
- See §3.2 Functional Requirements and §5.10 Implementation Phases in this file.
- Also used in: P12-power-budgeter-battery-life-and-low-power-modes.md, P17-teensy-instrument-platform.md
References
- MCU watchdog documentation
- Embedded reliability engineering texts
Key insights
A watchdog without logs only tells you that something failed, not why.
Summary
Fault logging turns watchdog resets into actionable diagnostics.
Homework/Exercises to practice the concept
- Trigger a watchdog reset and confirm the log entry persists.
- Test safe mode by forcing three consecutive faults.
Solutions to the homework/exercises
- Write a record before reset and read it on next boot.
- Increment a boot counter and branch to safe mode at threshold.
3. Project Specification
3.1 What You Will Build
Implement watchdog resets, persistent fault logs, and safe recovery paths.
3.2 Functional Requirements
- Enable watchdog with safe timeout.
- Log reset cause and context to nonvolatile memory.
- Inject faults to test recovery.
- Implement safe mode after repeated failures.
3.3 Non-Functional Requirements
- Performance: Meet the target timing/throughput for the project.
- Reliability: Detect errors and recover without undefined behavior.
- Usability: Provide clear logs and a repeatable workflow.
3.4 Example Usage / Output
./P13-fault-hunter-watchdog-and-fault-logging --run
3.5 Data Formats / Schemas / Protocols
Log record: [boot_id][reset_cause][error_code][crc]
3.6 Edge Cases
- Power loss during log write
- Watchdog triggers during long task
- Log storage full
3.7 Real World Outcome
You will run the project and see deterministic logs and measurements that match physical hardware behavior.
3.7.1 How to Run (Copy/Paste)
cd project-root
make
./P13-fault-hunter-watchdog-and-fault-logging --run
3.7.2 Golden Path Demo (Deterministic)
Use a fixed input configuration and a known test signal. Capture output for 60 seconds and verify it matches expected values.
3.7.3 If CLI: exact terminal transcript
$ ./P13-fault-hunter-watchdog-and-fault-logging --run --seed 42
[INFO] Fault Hunter - Watchdog and Fault Logging starting
[INFO] Report saved to data/report.csv
[INFO] Status: OK
$ echo $?
0
Failure Demo (Deterministic)
$ ./P13-fault-hunter-watchdog-and-fault-logging --run --missing-device
[ERROR] Device not detected
$ echo $?
2
4. Solution Architecture
4.1 High-Level Design
Inputs -> Acquisition -> Processing -> Output/Log
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Acquisition | Configure peripherals and capture data | Use stable clock settings |
| Processing | Convert raw data to meaningful values | Apply calibration/filters |
| Output/Log | Emit reports and logs | CSV for reproducibility |
4.3 Data Structures (No Full Code)
struct Sample {
uint32_t timestamp_us;
uint32_t value;
uint32_t flags;
};
4.4 Algorithm Overview
Key Algorithm: Measurement + Report
- Initialize hardware and verify configuration.
- Capture data and record timestamps.
- Compute metrics and write report.
Complexity Analysis:
- Time: O(n) in samples
- Space: O(n) for log storage
5. Implementation Guide
5.1 Development Environment Setup
# Arduino IDE + Teensyduino must be installed
# Optional CLI workflow
arduino-cli core update-index
arduino-cli core install teensy:avr
5.2 Project Structure
project-root/
├── src/
│ ├── main.ino
│ ├── hw_config.h
│ └── measurements.cpp
├── tools/
│ └── analyze.py
├── data/
│ └── samples.csv
└── README.md
5.3 The Core Question You’re Answering
“How do I make failures diagnosable instead of mysterious?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- watchdog, fault logging, reset causes
- Data logging and measurement techniques
- Basic timing math and error analysis
5.5 Questions to Guide Your Design
- Where will logs be stored?
- How will you prevent log corruption?
- What triggers safe mode?
5.6 Thinking Exercise
Design a fault record that survives power loss.
5.7 The Interview Questions They’ll Ask
- What is a watchdog used for?
- How do you log reset causes?
- Why is safe mode important?
5.8 Hints in Layers
- Use append-only logging with checksums.
- Feed watchdog only after key tasks complete.
- Test with a deliberate infinite loop.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Reliability | Making Embedded Systems | Ch. 9 | | Fault logging | Designing Embedded Systems | Ch. 7 | | Storage | Embedded Systems Handbook | Ch. 10 |
5.10 Implementation Phases
Phase 1: Foundation (6 hours)
Goals:
- Enable watchdog
- Read reset cause
Tasks:
- Enable watchdog
- Read reset cause
Checkpoint: Watchdog verified
Phase 2: Core Functionality (10 hours)
Goals:
- Implement logging
- Inject faults
Tasks:
- Implement logging
- Inject faults
Checkpoint: Fault logs persistent
Phase 3: Polish (6 hours)
Goals:
- Safe mode
- Recovery report
Tasks:
- Safe mode
- Recovery report
Checkpoint: Full workflow
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Buffering | Single buffer, double buffer | Double buffer | Avoids data loss during processing | | Logging format | CSV, binary | CSV | Human-readable while still scriptable | | Clock speed | Default, overclock | Default | Keeps peripherals in spec |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate math, parsing, and conversions | Timer math, CRC checks | | Integration Tests | Verify peripherals and pipelines | DMA -> buffer -> log | | Edge Case Tests | Handle boundary conditions | Brownout, missing sensor |
6.2 Critical Test Cases
{test_cases}
6.3 Test Data
Use a fixed test input pattern and record outputs to data/report.csv
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| {pitfalls}
7.2 Debugging Strategies
{debug_strats}
7.3 Performance Traps
Large buffers improve stability but increase latency. Measure both throughput and jitter to choose the right size.
8. Extensions & Challenges
8.1 Beginner Extensions
{ex_begin}
8.2 Intermediate Extensions
{ex_inter}
8.3 Advanced Extensions
{ex_adv}
9. Real-World Connections
9.1 Industry Applications
{industry_apps}
9.2 Related Open Source Projects
{open_source}
9.3 Interview Relevance
{interview_rel}
10. Resources
10.1 Essential Reading
{resources}
10.2 Video Resources
- Embedded systems timing walkthrough (YouTube)
- Teensy hardware deep dive (Conference talk)
10.3 Tools & Documentation
- Teensyduino: Toolchain for Teensy boards
- Logic Analyzer: Timing verification
- Multimeter: Voltage and current measurement
10.4 Related Projects in This Series
{related_projects}
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the main concept without notes.
- I can explain why the measurements match (or do not match) expectations.
- I understand at least one tradeoff made in this project.
11.2 Implementation
- All functional requirements are met.
- All critical test cases pass.
- Logs and reports are reproducible.
- Edge cases are handled.
11.3 Growth
- I documented lessons learned.
- I can explain this project in a job interview.
- I identified one improvement for next iteration.
12. Submission / Completion Criteria
Minimum Viable Completion: {comp_min}
Full Completion: {comp_full}
Excellence (Going Above & Beyond): {comp_ex}