Project 13: Fault Hunter - Watchdog and Fault Logging

Implement watchdog resets, persistent fault logs, and safe recovery paths.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	20-30 hours
Main Programming Language	C++
Alternative Programming Languages	C
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	3. The “Service” Model
Prerequisites	C/C++ basics, Teensyduino setup, basic electronics, ability to use a multimeter/logic analyzer
Key Topics	watchdog, fault logging, reset causes

1. Learning Objectives

By completing this project, you will:

Explain the core question for this project in your own words.
Implement the main workflow and validate it with measurements.
Handle at least two failure modes and document recovery.
Produce a deterministic report that matches hardware behavior.

2. All Theory Needed (Per-Concept Breakdown)

Watchdog Timers, Fault Logging, and Recovery

Fundamentals A watchdog timer resets the system if software fails to feed it in time. This prevents permanent hangs but is only useful if you can diagnose why the reset happened. Fault logging records reset causes and error context in nonvolatile memory so failures can be reproduced. Reliable systems combine watchdogs, logs, and safe recovery.

Deep Dive into the concept Watchdogs are simple but powerful. The timeout must be long enough to avoid false resets but short enough to recover quickly from real faults. Feeding the watchdog only after critical tasks completes turns it into a health indicator. Fault logging must be power-failure tolerant: append-only records with checksums are common. Storage choices (EEPROM emulation vs LittleFS) affect durability and wear. Recovery strategies include safe mode after repeated resets, reduced feature sets, and explicit error reporting. This project has you inject faults, verify logs, and implement deterministic recovery behavior so resets become diagnosable events rather than mysteries.

How this fit on projects This concept directly powers the implementation choices and validation steps in this project.

Definitions & key terms

Watchdog: Hardware timer that resets the system if not serviced.
Reset cause: Register indicating why the MCU reset.
Fault log: Persistent record of failures across reboots.
Safe mode: Reduced functionality mode after repeated faults.

Mental model diagram (ASCII)

Main loop -> Feed watchdog -> (fault) -> Reset -> Log -> Safe mode

How it works (step-by-step)

Configure watchdog timeout and enable it.
Log reset cause on boot.
Inject a fault and verify reset + log.
Implement safe mode after repeated faults.

Minimal concrete example

wdt_enable(1000); // 1s
if (healthy) wdt_feed();

Common misconceptions

Watchdogs remove the need for debugging.
Any reset is fine as long as it reboots.
Logging can be done in the last millisecond.

Check-your-understanding questions

Why should watchdog feeding be conditional?
How do you log without corrupting storage?
What is a good fault injection strategy?

Check-your-understanding answers

It ensures the system only feeds the watchdog when healthy.
Use append-only records with checksum and flush.
Force a dead loop or corrupt buffer to trigger reset.

Real-world applications

Industrial controllers
Remote sensors
Automotive ECUs

Where you’ll apply it

See §3.2 Functional Requirements and §5.10 Implementation Phases in this file.
Also used in: P12-power-budgeter-battery-life-and-low-power-modes.md, P17-teensy-instrument-platform.md

References

MCU watchdog documentation
Embedded reliability engineering texts

Key insights

A watchdog without logs only tells you that something failed, not why.

Summary

Fault logging turns watchdog resets into actionable diagnostics.

Homework/Exercises to practice the concept

Trigger a watchdog reset and confirm the log entry persists.
Test safe mode by forcing three consecutive faults.

Solutions to the homework/exercises

Write a record before reset and read it on next boot.
Increment a boot counter and branch to safe mode at threshold.

3. Project Specification

3.1 What You Will Build

Implement watchdog resets, persistent fault logs, and safe recovery paths.

3.2 Functional Requirements

Enable watchdog with safe timeout.
Log reset cause and context to nonvolatile memory.
Inject faults to test recovery.
Implement safe mode after repeated failures.

3.3 Non-Functional Requirements

Performance: Meet the target timing/throughput for the project.
Reliability: Detect errors and recover without undefined behavior.
Usability: Provide clear logs and a repeatable workflow.

3.4 Example Usage / Output

./P13-fault-hunter-watchdog-and-fault-logging --run

3.5 Data Formats / Schemas / Protocols

Log record: [boot_id][reset_cause][error_code][crc]

3.6 Edge Cases

Power loss during log write
Watchdog triggers during long task
Log storage full

3.7 Real World Outcome

You will run the project and see deterministic logs and measurements that match physical hardware behavior.

3.7.1 How to Run (Copy/Paste)

cd project-root
make
./P13-fault-hunter-watchdog-and-fault-logging --run

3.7.2 Golden Path Demo (Deterministic)

Use a fixed input configuration and a known test signal. Capture output for 60 seconds and verify it matches expected values.

3.7.3 If CLI: exact terminal transcript

$ ./P13-fault-hunter-watchdog-and-fault-logging --run --seed 42
[INFO] Fault Hunter - Watchdog and Fault Logging starting
[INFO] Report saved to data/report.csv
[INFO] Status: OK
$ echo $?
0

Failure Demo (Deterministic)

$ ./P13-fault-hunter-watchdog-and-fault-logging --run --missing-device
[ERROR] Device not detected
$ echo $?
2

4. Solution Architecture

4.1 High-Level Design

Inputs -> Acquisition -> Processing -> Output/Log

4.2 Key Components

Component	Responsibility	Key Decisions
Acquisition	Configure peripherals and capture data	Use stable clock settings
Processing	Convert raw data to meaningful values	Apply calibration/filters
Output/Log	Emit reports and logs	CSV for reproducibility

4.3 Data Structures (No Full Code)

struct Sample {
    uint32_t timestamp_us;
    uint32_t value;
    uint32_t flags;
};

4.4 Algorithm Overview

Key Algorithm: Measurement + Report

Initialize hardware and verify configuration.
Capture data and record timestamps.
Compute metrics and write report.

Complexity Analysis:

Time: O(n) in samples
Space: O(n) for log storage

5. Implementation Guide

5.1 Development Environment Setup

# Arduino IDE + Teensyduino must be installed
# Optional CLI workflow
arduino-cli core update-index
arduino-cli core install teensy:avr

5.2 Project Structure

project-root/
├── src/
│   ├── main.ino
│   ├── hw_config.h
│   └── measurements.cpp
├── tools/
│   └── analyze.py
├── data/
│   └── samples.csv
└── README.md

5.3 The Core Question You’re Answering

“How do I make failures diagnosable instead of mysterious?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

watchdog, fault logging, reset causes
Data logging and measurement techniques
Basic timing math and error analysis

5.5 Questions to Guide Your Design

Where will logs be stored?
How will you prevent log corruption?
What triggers safe mode?

5.6 Thinking Exercise

Design a fault record that survives power loss.

5.7 The Interview Questions They’ll Ask

What is a watchdog used for?
How do you log reset causes?
Why is safe mode important?

5.8 Hints in Layers

Use append-only logging with checksums.
Feed watchdog only after key tasks complete.
Test with a deliberate infinite loop.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (6 hours)

Goals:

Enable watchdog
Read reset cause

Tasks:

Enable watchdog
Read reset cause

Checkpoint: Watchdog verified

Phase 2: Core Functionality (10 hours)

Goals:

Implement logging
Inject faults

Tasks:

Implement logging
Inject faults

Checkpoint: Fault logs persistent

Phase 3: Polish (6 hours)

Goals:

Safe mode
Recovery report

Tasks:

Safe mode
Recovery report

Checkpoint: Full workflow

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

{test_cases}

6.3 Test Data

Use a fixed test input pattern and record outputs to data/report.csv

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| {pitfalls}

7.2 Debugging Strategies

{debug_strats}

7.3 Performance Traps

Large buffers improve stability but increase latency. Measure both throughput and jitter to choose the right size.

8. Extensions & Challenges

8.1 Beginner Extensions

{ex_begin}

8.2 Intermediate Extensions

{ex_inter}

8.3 Advanced Extensions

{ex_adv}

9. Real-World Connections

9.1 Industry Applications

{industry_apps}

{open_source}

9.3 Interview Relevance

{interview_rel}

10. Resources

10.1 Essential Reading

{resources}

10.2 Video Resources

Embedded systems timing walkthrough (YouTube)
Teensy hardware deep dive (Conference talk)

10.3 Tools & Documentation

Teensyduino: Toolchain for Teensy boards
Logic Analyzer: Timing verification
Multimeter: Voltage and current measurement

{related_projects}

11. Self-Assessment Checklist

11.1 Understanding

I can explain the main concept without notes.
I can explain why the measurements match (or do not match) expectations.
I understand at least one tradeoff made in this project.

11.2 Implementation

All functional requirements are met.
All critical test cases pass.
Logs and reports are reproducible.
Edge cases are handled.

11.3 Growth

I documented lessons learned.
I can explain this project in a job interview.
I identified one improvement for next iteration.

12. Submission / Completion Criteria

Minimum Viable Completion: {comp_min}

Full Completion: {comp_full}

Excellence (Going Above & Beyond): {comp_ex}

Project 13: Fault Hunter - Watchdog and Fault Logging

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Watchdog Timers, Fault Logging, and Recovery

Definitions & key terms

Mental model diagram (ASCII)

How it works (step-by-step)

Minimal concrete example

Common misconceptions

Check-your-understanding questions

Check-your-understanding answers

Real-world applications

Where you’ll apply it

References

Key insights

Summary

Homework/Exercises to practice the concept

Solutions to the homework/exercises

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: exact terminal transcript

Failure Demo (Deterministic)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (6 hours)

Phase 2: Core Functionality (10 hours)

Phase 3: Polish (6 hours)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria