Project 14: Fault Injection and Watchdog Recovery

A reliability test harness that injects faults, triggers watchdog resets, and logs recovery behavior.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks
Main Programming Language C (Alternatives: C++, Rust, Ada)
Alternative Programming Languages C++, Rust, Ada
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold”
Prerequisites Watchdog configuration, boot diagnostics, reset cause flags
Key Topics Watchdogs, fault injection, safe-state recovery

1. Learning Objectives

By completing this project, you will:

  1. Configure and service watchdogs correctly.
  2. Inject faults and verify reset behavior.
  3. Log reset causes and safe-state recovery.
  4. Build a repeatable fault test plan.

2. All Theory Needed (Per-Concept Breakdown)

Watchdogs, Fault Injection, and Recovery

Fundamentals A watchdog is a hardware timer that resets the MCU if software fails to service it in time. It is a safety net for hangs, deadlocks, or unexpected infinite loops. Fault injection is the practice of deliberately causing errors to verify that recovery mechanisms work. Together, they form the backbone of reliable embedded systems.

Deep Dive into the concept STM32 MCUs provide independent and windowed watchdogs. The independent watchdog (IWDG) runs from its own clock source and is resilient to clock failures. The windowed watchdog (WWDG) requires refresh within a time window, which can detect both stalls and runaway loops. A robust design chooses one or both depending on safety requirements. Fault injection exercises these systems by simulating failures: intentionally blocking the main loop, corrupting memory, disabling interrupts, or creating brown-out conditions. The key is to observe how the system recovers and to log the reset cause. STM32 provides reset cause flags (e.g., watchdog reset, power-on reset), which should be read and cleared at boot. A fault-recovery plan includes a safe-state strategy: after reset, peripherals should be placed in a known safe state before resuming operation. In a sensor system, that might mean disabling actuators until diagnostics pass. You should also define watchdog servicing points. Instead of refreshing in the main loop unconditionally, you can refresh only after critical tasks complete, ensuring the watchdog resets the system if those tasks hang. Another aspect is timeout selection. If the timeout is too short, you get false resets; if too long, faults persist too long. The right value is derived from worst-case execution time plus margin. Fault injection results must be deterministic: you should produce a reproducible scenario, log the fault, see a reset, and then confirm recovery. This project teaches that reliability is tested, not assumed.

How this fit on projects In Fault Injection and Watchdog Recovery, you inject controlled faults, verify watchdog resets, and implement a safe-state recovery flow.

Definitions & key terms

  • Watchdog -> Hardware timer that resets the MCU if not periodically refreshed.
  • IWDG -> Independent watchdog clocked from a dedicated oscillator.
  • WWDG -> Windowed watchdog that must be refreshed within a specific time window.
  • Reset cause -> Hardware flags indicating why the MCU reset.
  • Safe state -> A known output configuration that avoids unsafe behavior after reset.

Mental model diagram (ASCII)

Task progress -> refresh watchdog -> continue
(if no refresh) -> watchdog timeout -> reset -> safe-state

How it works (step-by-step, with invariants and failure modes)

  1. Configure watchdog timeout based on worst-case task time.
  2. Refresh watchdog only after critical tasks complete.
  3. Inject faults by blocking the loop or disabling interrupts.
  4. On reset, read reset cause flags and log them.
  5. Invariant: faults lead to reset and safe-state; failure mode: watchdog refreshed too early or not configured.

Minimal concrete example

if (task_ok) {
IWDG->KR = 0xAAAA; // refresh
}

Common misconceptions

  • Watchdog guarantees safety ignores unsafe outputs after reset.
  • Refresh in the main loop is enough ignores that the loop can still run while tasks are stuck.
  • Fault injection is optional ignores that recovery must be proven.

Check-your-understanding questions

  1. Why might you prefer the independent watchdog over the windowed watchdog?
  2. How do you ensure that the watchdog detects a stalled critical task?
  3. What should happen immediately after a watchdog reset?

Check-your-understanding answers

  1. IWDG uses its own clock and is less dependent on system clocks, making it more robust.
  2. Refresh only after the task completes, not unconditionally in the main loop.
  3. Set outputs to a safe state, log reset cause, then resume normal init.

Real-world applications

  • Industrial controllers that must recover from faults.
  • Automotive ECUs with safety requirements.
  • Remote IoT devices where manual reset is impossible.

Where you’ll apply it

References

  • STM32F3 Reference Manual (IWDG/WWDG chapters).
  • IEC 61508 safety standard concepts (watchdog usage).
  • ST application notes on reliability and watchdog design.

Key insights

  • A watchdog is only useful when faults are injected and recovery is verified.

Summary Watchdogs and fault injection convert reliability into a measurable, testable property. They ensure the system recovers predictably when things go wrong.

Homework/Exercises to practice the concept

  1. Set a watchdog timeout and intentionally stall the main loop to confirm reset.
  2. Log reset cause after a watchdog event and verify flags are cleared.
  3. Define a safe-state output configuration and test it after reset.

Solutions to the homework/exercises

  1. The MCU should reset after the timeout; if not, check watchdog configuration.
  2. Read reset cause registers early in boot; clear them after logging.
  3. Set all PWM outputs to 0 and disable actuators before normal init.

Boot and Reset Sequence (Vector Table, Startup, and Memory Map)

Fundamentals On a Cortex-M MCU, power-up and reset are not vague events; they are a strict and observable sequence of reads and writes. The core starts by fetching the initial stack pointer and reset handler from the vector table at a fixed address. It then runs startup code that sets up memory, initializes .data and .bss, and finally calls main. If any of those steps are wrong, nothing else matters: your program never truly starts. For STM32F3, the memory map is fixed by the silicon, and your linker script determines where code and data live. Understanding how the vector table is laid out, how the stack pointer is initialized, and how the reset handler transitions into C code is the difference between ‘it sometimes boots’ and ‘it always boots’. Bring-up work relies on this knowledge because every later peripheral setup is layered on top of a correct reset path.

Deep Dive into the concept The reset sequence on Cortex-M is deterministic and documented, which makes it measurable. At reset, the core reads address 0x00000000 (after any memory remap) to load the initial Main Stack Pointer (MSP). The next word is the reset handler address. This means your linker script and vector table must be in the right place; otherwise, the CPU jumps into invalid memory. Startup code then performs three vital tasks: it configures the vector table location (VTOR if you remap), initializes the data segment by copying initial values from flash to SRAM, and clears the .bss segment to zero. Only after those steps does it call SystemInit() (often to set clocks) and then main(). If any of these steps are skipped, global variables may contain garbage and peripheral drivers may read invalid configuration. On STM32 devices, option bytes and BOOT pins determine whether the device boots from user flash, system memory (factory bootloader), or SRAM. That means a hardware strap or option byte can change the initial vector table base. For robust firmware, you should treat boot configuration as a first-class system requirement: document the expected BOOT pin state, verify it, and log it if possible. Another common subtlety is stack alignment. The Cortex-M requires 8-byte alignment on exception entry when the FPU is used. If your linker or startup code sets a misaligned stack, you will see hard faults only under interrupt load. Memory map knowledge is crucial for fault handling too. The STM32F303 has flash and SRAM ranges, plus peripheral address ranges. When a hard fault occurs, the fault status registers (CFSR, HFSR) and stacked registers point to an address. Knowing whether that address is in flash, SRAM, or peripheral space tells you whether the fault came from a bad pointer, an invalid register access, or an executing-from-data bug. Finally, boot-time initialization controls determinism. If your clock setup depends on external crystals, startup time may vary with temperature or load. If you use the system bootloader for DFU, it may reconfigure clocks or remap memory. A disciplined bring-up includes measuring the reset-to-main latency, capturing the clock source at startup, and asserting that vector table and stack pointers are in range. That is why a ‘boot checklist’ is not busywork; it is a measurable contract between the MCU and your firmware.

How this fit on projects In Fault Injection and Watchdog Recovery, you build a bring-up checklist that proves the reset path is correct. You verify that the vector table is in the right place, that the startup code reached main, and that the system can print or blink predictable outputs without random early faults.

Definitions & key terms

  • Vector table -> An array of exception and interrupt handler addresses located at a fixed base address.
  • Reset handler -> The first code executed after reset; it initializes memory and calls main.
  • MSP/PSP -> Main and Process Stack Pointers used by the core; MSP is used during reset and exceptions.
  • .data/.bss -> Memory sections for initialized and zero-initialized globals in SRAM.
  • BOOT pins -> Hardware straps that select boot source (flash, system memory, or SRAM).

Mental model diagram (ASCII)

Reset -> Read MSP -> Read Reset Handler -> Init data/bss -> SystemInit -> main()
   |               |                     |                     |
   |               v                     v                     v
 Vector Table    Stack Pointer       Memory Layout        App Logic

How it works (step-by-step, with invariants and failure modes)

  1. Reset is asserted; core fetches MSP and reset handler from vector table.
  2. Startup code initializes .data and clears .bss, ensuring globals are valid.
  3. SystemInit configures clocks and vector table relocation if needed.
  4. main() runs; peripheral drivers assume memory and clock configuration are stable.
  5. Invariant: vector table address and stack pointer must point to valid memory; failure mode: hard fault before main.

Minimal concrete example

extern unsigned long _estack;
void Reset_Handler(void);
__attribute__((section(".isr_vector")))
const void* vector_table[] = {
(void*)&_estack,
Reset_Handler,
};
void Reset_Handler(void) {
SystemInit();
main();
}

Common misconceptions

  • Reset just jumps to main ignores memory initialization and vector table requirements.
  • If code compiles, it will boot ignores linker script and BOOT pin configuration.
  • Stack alignment only matters for floating point’ is false; misalignment can break exception entry.

Check-your-understanding questions

  1. Why does the MCU read two words from address 0x00000000 during reset?
  2. What happens if the vector table is placed in the wrong memory region?
  3. How can you prove that .bss was cleared correctly?
  4. Why might a firmware boot correctly when debugging but fail after power cycle?

Check-your-understanding answers

  1. The first word initializes MSP and the second provides the reset handler address so the core knows where to execute.
  2. The CPU jumps to an invalid address, often leading to a hard fault before main.
  3. Place a global variable without initialization and check that it starts at zero after reset.
  4. Debug tools can remap memory or configure clocks; power cycle uses raw BOOT configuration.

Real-world applications

  • Bootloader designs that switch between factory and application firmware.
  • Field-upgradable devices that must always recover from partial updates.
  • Safety-critical systems that log reset causes for post-mortem analysis.

Where you’ll apply it

References

  • Joseph Yiu, ‘The Definitive Guide to ARM Cortex-M3/M4’ (startup and vector tables).
  • STMicroelectronics STM32F3 Reference Manual (memory map and system configuration).
  • Elecia White, ‘Making Embedded Systems’ (bring-up discipline).

Key insights

  • Boot is just a deterministic memory fetch sequence; if you can validate each step, you can trust every later subsystem.

Summary A reliable firmware starts with a reliable reset path. The vector table, stack pointer, and startup code are the first contracts your MCU executes. Proving they are correct is the foundation for every timing, peripheral, and safety claim you make later.

Homework/Exercises to practice the concept

  1. Open the linker map file and identify where the vector table is located.
  2. Write a test that prints the MSP value at boot and verify it is within SRAM.
  3. Simulate a wrong BOOT pin configuration and observe the effect.

Solutions to the homework/exercises

  1. The vector table should live at the flash base address unless you intentionally remap.
  2. On STM32F303, a valid MSP should fall inside the SRAM address range; if not, fix the linker script.
  3. With BOOT0 asserted, the device enters system memory and your application in flash will not start.

Power Modes, Clock Gating, and Wake-Up Sources

Fundamentals Power modes reduce energy consumption by shutting down parts of the MCU or lowering clock speeds. STM32 devices typically provide sleep, stop, and standby modes, each with different trade-offs between power and wake latency. Clock gating disables unused peripherals, saving power without stopping the CPU entirely. Understanding these modes lets you design efficient battery-powered systems.

Deep Dive into the concept Power consumption in MCUs is largely driven by clock frequency and active peripherals. Sleep mode stops the CPU but keeps peripherals running; stop mode halts most clocks while retaining RAM; standby mode powers down most of the chip and resets on wake. Each mode has a different wake-up time and context retention. Clock gating is another important technique: even when the CPU runs, you can disable clocks to unused peripherals in the RCC to reduce power. On STM32F3, the regulator and voltage scaling also impact power; lower voltage ranges reduce current but limit max frequency. A power mode explorer project should measure current draw in each mode using a multimeter or current probe. You must also identify which wake sources are available: GPIO interrupts, RTC alarms, watchdogs, or timer events. The firmware should log the wake reason, because debugging power modes without that log is difficult. Another key aspect is peripheral state. Some peripherals must be reinitialized after wake, especially after stop or standby. You need a clean re-init path and a test that proves it works. Power modes also interact with clocks: after stop mode, the system may revert to a default clock source. A robust design reconfigures the clock tree on wake and revalidates timing. By building a power mode explorer, you learn to treat energy as a measurable resource and to build systems that can trade performance for battery life predictably.

How this fit on projects In Fault Injection and Watchdog Recovery, you cycle through power modes, measure current draw, and verify wake-up behavior and clock restoration.

Definitions & key terms

  • Sleep -> CPU halted, peripherals running.
  • Stop -> Most clocks off, RAM retained.
  • Standby -> Deepest low-power mode, most state lost.
  • Clock gating -> Disabling peripheral clocks to save power.
  • Wake source -> Event that brings MCU back to active mode.

Mental model diagram (ASCII)

Run -> Sleep -> Stop -> Standby
 ^       ^        ^        ^
 |   wake sources (GPIO/RTC/Timer)

How it works (step-by-step, with invariants and failure modes)

  1. Configure and enter each power mode.
  2. Measure current draw and log results.
  3. Trigger wake-up via GPIO or timer.
  4. Reconfigure clocks and peripherals after wake.
  5. Invariant: system resumes reliably; failure mode: lost configuration or unexpected reset.

Minimal concrete example

// Enter stop mode and wait for interrupt
HAL_PWR_EnterSTOPMode(PWR_LOWPOWERREGULATOR_ON, PWR_STOPENTRY_WFI);

Common misconceptions

  • Sleep and stop are the same ignores clock and state differences.
  • Wake-up restores everything ignores that clocks and peripherals often reset.
  • Power modes are only for battery devices ignores thermal and reliability benefits.

Check-your-understanding questions

  1. Which mode retains RAM while stopping most clocks?
  2. Why must you reconfigure clocks after stop mode?
  3. How do you measure current draw accurately?

Check-your-understanding answers

  1. Stop mode retains RAM but disables most clocks.
  2. The system often reverts to HSI or default clock after stop; you must restore PLL settings.
  3. Use a multimeter in series or a current probe and log stable readings.

Real-world applications

  • Battery-powered sensors and IoT devices.
  • Energy-efficient industrial monitoring.
  • Thermally constrained systems.

Where you’ll apply it

References

  • STM32F3 Reference Manual (power control chapter).
  • ST application notes on low-power design.
  • ‘Making Embedded Systems’ (power and reliability).

Key insights

  • Low power is an engineered state, not a default; you must measure and verify it.

Summary Power modes trade performance for energy savings. By measuring current draw and validating wake behavior, you gain control over your system’s energy profile.

Homework/Exercises to practice the concept

  1. Measure current draw in run and sleep modes and compute percentage savings.
  2. Test wake-up from a GPIO interrupt and log the wake reason.
  3. Verify that UART baud rate is correct after waking from stop mode.

Solutions to the homework/exercises

  1. Sleep current should be significantly lower; compute (run - sleep) / run.
  2. Store wake reason in backup registers or RAM before sleep.
  3. Reinitialize clock tree and UART after wake to restore baud accuracy.

3. Project Specification

3.1 What You Will Build

A fault injection harness that intentionally stalls tasks or disables interrupts to force watchdog resets, then logs recovery.

3.2 Functional Requirements

  1. Watchdog Setup: Configure IWDG or WWDG with appropriate timeout.
  2. Fault Injection: Provide at least two fault scenarios (stall loop, disable IRQ).
  3. Reset Cause Logging: Log watchdog reset cause at boot.
  4. Safe-State Recovery: Ensure outputs are safe after reset.

3.3 Non-Functional Requirements

  • Performance: Watchdog timeout within specified bounds.
  • Reliability: Faults always lead to reset and recovery.
  • Usability: Clear log describing fault and recovery.

3.4 Example Usage / Output

FAULT=stall_loop
RESET_CAUSE=IWDG
SAFE_STATE=OK
Status: PASS

3.5 Data Formats / Schemas / Protocols

Log format: FAULT=<name> RESET_CAUSE=<flag> SAFE_STATE=<OK|FAIL>

3.6 Edge Cases

  • Watchdog refreshed too early, masking faults.
  • Reset cause flags not cleared, causing false reports.
  • Safe-state not applied before outputs re-enable.
  • Fault injection code disables watchdog incorrectly.

3.7 Real World Outcome

You will force a fault, observe a watchdog reset, and verify safe-state recovery.

3.7.1 How to Run (Copy/Paste)

$ make flash
$ screen /dev/tty.usbmodem* 115200

3.7.2 Golden Path Demo (Deterministic)

  • Trigger the stall fault and observe watchdog reset.

3.7.3 CLI Transcript (Success)

FAULT=stall_loop
RESET_CAUSE=IWDG
SAFE_STATE=OK
RESULT=PASS
# Exit code: 0

3.7.4 Failure Demo (Watchdog Not Armed)

FAULT=stall_loop
RESET_CAUSE=UNKNOWN
SAFE_STATE=FAIL
RESULT=FAIL
# Exit code: 2

4. Solution Architecture

Fault scenarios trigger watchdog resets; boot code logs reset cause and enforces safe-state outputs.

4.1 High-Level Design

Fault Injection -> Watchdog Timeout -> Reset -> Safe-State -> Log

4.2 Key Components

Component Responsibility Key Decisions
Watchdog Manager Configures and refreshes watchdog Refresh only after critical tasks
Fault Injector Triggers controlled faults Menu-controlled fault selection
Recovery Handler Sets safe outputs and logs reset cause Run before normal init

4.3 Data Structures (No Full Code)

typedef struct {
const char* fault_name;
uint32_t reset_flags;
uint8_t safe_state_ok;
} fault_report_t;

4.4 Algorithm Overview

Fault Test Flow

  1. Enable watchdog.
  2. Trigger fault scenario.
  3. On reset, log cause and verify safe state.

Complexity: O(1) per test.


5. Implementation Guide

5.1 Development Environment Setup

make init
make flash
screen /dev/tty.usbmodem* 115200

5.2 Project Structure

project-root/
|-- src/
|   |-- main.c
|   |-- drivers/
|   `-- app/
|-- include/
|-- Makefile
`-- README.md

5.3 The Core Question You’re Answering

“How do I prove my system recovers safely from faults?”

5.4 Concepts You Must Understand First

  1. Watchdog operation and timeouts.
  2. Reset cause flags.
  3. Safe-state design.

5.5 Questions to Guide Your Design

  1. Which faults are realistic for your system?
  2. What outputs must be forced safe after reset?
  3. What timeout balances safety and false resets?

5.6 Thinking Exercise

Watchdog Window

Worst-case task time = 150 ms
Watchdog timeout = 250 ms
Margin = 100 ms

5.7 The Interview Questions They’ll Ask

  1. Why use a watchdog?
  2. What is the difference between IWDG and WWDG?
  3. How do you prove watchdog recovery works?

5.8 Hints in Layers

Hint 1: Refresh watchdog only after critical tasks complete. Hint 2: Read reset cause flags early in boot. Hint 3: Force a fault by disabling interrupts.

5.9 Books That Will Help

Topic Book Chapter
Reliability Making Embedded Systems Ch. 11
Watchdogs STM32 Reference Manual IWDG/WWDG

5.10 Implementation Phases

Phase 1: Watchdog Bring-Up (2 days)

Configure watchdog and confirm reset.

Phase 2: Fault Injection (4 days)

Add stall and IRQ-disable faults.

Phase 3: Recovery Validation (3 days)

Log reset cause and safe-state outputs.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Watchdog type IWDG vs WWDG IWDG Independent clock reliability
Faults Stall vs memory fault Stall + IRQ disable Simple and reproducible

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Reset flag decode Flag mapping checks
Integration Tests Watchdog reset flow Stall loop fault
Edge Case Tests No watchdog armed Fail scenario

6.2 Critical Test Cases

  1. Stall Fault: Watchdog resets within timeout.
  2. Reset Cause: Reset cause logged as IWDG.
  3. Safe-State: Outputs forced safe after reset.

6.3 Test Data

Fault=stall, Reset=IWDG, Safe=OK

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Refreshing watchdog too often Faults not detected Refresh only after critical tasks
No reset flag clear Wrong reset cause Clear flags after logging
Unsafe outputs Actuators enabled after reset Set safe state before init

7.2 Debugging Strategies

  • Use a GPIO to indicate watchdog refresh moments.
  • Log reset flags early in boot.
  • Validate safe-state outputs with a multimeter.

7.3 Performance Traps

Too long watchdog timeout reduces recovery speed; too short causes false resets.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a watchdog status LED.

8.2 Intermediate Extensions

  • Add a fault injection menu over UART.

8.3 Advanced Extensions

  • Implement brown-out detection and recovery.

9. Real-World Connections

9.1 Industry Applications

  • Safety-critical systems: Watchdog recovery is mandatory.
  • Remote IoT devices: Automatic recovery without physical access.
  • Zephyr watchdog subsystem: Reference for watchdog usage.
  • STM32Cube examples: Watchdog configuration demos.

9.3 Interview Relevance

  • Watchdog and fault recovery questions.
  • Reset cause and safe-state design discussions.

10. Resources

10.1 Essential Reading

  • Making Embedded Systems by Elecia White - Reliability and watchdogs.
  • STM32F3 Reference Manual by ST - Watchdog configuration.

10.2 Video Resources

  • Watchdog timer tutorial for STM32.
  • Fault injection concepts in embedded systems.

10.3 Tools & Documentation

  • STM32CubeIDE: Build and debug.
  • Logic analyzer: Observe reset behavior.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain watchdog types and behavior.
  • I can interpret reset cause flags.
  • I can design a safe-state strategy.

11.2 Implementation

  • Fault injection triggers reset as expected.
  • Safe-state outputs applied after reset.
  • Logs show correct reset cause.

11.3 Growth

  • I can add new fault scenarios.
  • I can justify watchdog timeouts.
  • I can explain this system in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Watchdog configured.
  • At least one fault injected.
  • Reset cause logged.

Full Completion:

  • Multiple faults tested.
  • Safe-state verified.

Excellence (Going Above & Beyond):

  • Automated fault test suite and report.