Project 25: Production Firmware Platform (State Machines, Logging, Watchdog, Boot Policy)

Build a production-style firmware platform layer for NeoTrellis with explicit state machines, structured diagnostics, safe config migration, and recovery-first behavior.

Quick Reference

Attribute	Value
Difficulty	Level 5: Master
Time Estimate	3-4 weeks
Main Programming Language	C/C++
Alternative Programming Languages	Rust embedded (architecture port)
Coolness Level	Level 5: The “WTF, That’s Possible?”
Business Potential	4. The “Disruptor”
Prerequisites	state machines, memory reliability, USB/device basics
Key Topics	state architecture, versioned config, logging, crash handling, watchdog policy, boot safety

1. Learning Objectives

By completing this project, you will:

Define explicit firmware state/event contracts.
Implement versioned config schema with migration logic.
Add structured, rate-limited event logging and crash records.
Build watchdog policy with fault evidence retention.
Validate boot image checks and safe fallback behavior.

2. All Theory Needed (Project-Scoped)

2.1 Explicit State Architecture

State tables prevent hidden flag interactions and make failure modes reviewable.

2.2 Versioned Configuration

Schema versioning and migration prevent upgrades from bricking settings.

2.3 Crash Observability

A watchdog reset without retained context only masks issues.

2.4 Boot and Recovery Policy

Boot must validate image/config before jump and support safe fallback mode.

2.5 OTA Feasibility Planning

Even without radio support, memory and boot design should preserve future staged-update paths.

3. Project Specification

3.1 What You Will Build

A platform layer that includes:

state transition engine
config versioning + migration
structured event logging
watchdog heartbeat strategy
crash record retention
boot validation gate

3.2 Functional Requirements

Illegal state transitions are detected and rejected.
Config migrations succeed across at least three version jumps.
Watchdog recovers induced hangs without boot loops.
Crash records persist across reset and can be decoded.
Boot validation prevents jump to invalid image metadata.

3.3 Non-Functional Requirements

Performance: logging overhead bounded and low.
Reliability: repeated fault injection yields safe recovery.
Maintainability: architecture docs map to implementation clearly.

3.4 Real World Outcome

$ fw_reliability_suite --fault-injection all --cycles 500
[STATE] illegal_transition_count=0
[CONFIG] migrations_pass=4/4 crc_failures=0
[WDT] induced_hangs=120 recovered=120 boot_loop=0
[CRASHLOG] retained=120 decoded=120
[BOOT] image_validation pass=500 fail=0
PASS: production gate cleared

4. Solution Architecture

4.1 High-Level Design

Boot stage --> image/config validation --> runtime state machine
                                         |--> watchdog heartbeat monitor
                                         |--> structured log ring
                                         |--> crash capture on fault

4.2 Key Components

Component	Responsibility	Key Decision
State engine	legal transition control	explicit event/state tables
Config manager	schema versioning + migration	CRC + rollback-safe update
Event logger	runtime observability	compact event IDs + payloads
Fault manager	watchdog + crash capture	reset with retained evidence
Boot gate	startup integrity checks	fail-safe recovery mode

4.3 Core Record Shapes (Pseudocode)

state_event = {from_state, event, to_state, guard_result, ts}
config_record = {version, length, crc, payload}
crash_record = {reset_reason, fault_pc, fault_lr, mode, counters}

5. Implementation Guide

5.1 Phases

Build state/event map and transition validator.
Add config versioning + migration harness.
Add structured logs and bounded buffering.
Add watchdog and crash capture.
Add boot validation and recovery mode tests.

5.2 The Core Question You’re Answering

“Can this firmware fail safely, recover predictably, and preserve enough evidence for root-cause analysis?”

5.3 Questions to Guide Design

Which transitions are legal and why?
What minimum crash data must survive reset?
Which conditions trigger recovery mode instead of normal boot?

5.4 Thinking Exercise

Create a top-10 failure-mode matrix with detection signal, immediate action, and retained evidence for each mode.

5.5 Interview Questions They Will Ask

How do explicit state machines improve embedded reliability?
Why is config migration mandatory in long-lived firmware?
How should watchdog and crash logging complement each other?
What makes boot validation robust enough for field updates?
How would you extend this design for OTA in future hardware revisions?

5.6 Hints in Layers

Hint 1: Define architecture tables before implementation.
Hint 2: Keep logging structured and rate-limited.
Hint 3: Capture crash context before watchdog reset when possible.
Hint 4: Exercise repeated fault-injection loops early.

5.7 Common Pitfalls and Debugging

Problem	Why	Fix	Quick Test
Mode lockups	implicit state coupling	centralize transition table	randomized event-order test
Reset loops	watchdog recovery without escalation	add loop counter + safe mode	forced startup fault test
Upgrade breaks settings	missing migration coverage	test version corpus + migration chain	replay archived configs

5.8 Definition of Done

Transition table and invariants are implemented and tested.
Config migration suite passes all target versions.
Watchdog/fault-injection run shows safe recovery and evidence retention.
Boot validation rejects invalid image metadata deterministically.

6. References

ARM Cortex-M4 Technical Reference Manual
TinyUSB Documentation
USB 2.0 Specification (USB-IF)
“Design Patterns for Embedded Systems in C” by Bruce Powel Douglass
“Making Embedded Systems, 2nd Ed” by Elecia White