Project 25: Production Firmware Platform (State Machines, Logging, Watchdog, Boot Policy)
Build a production-style firmware platform layer for NeoTrellis with explicit state machines, structured diagnostics, safe config migration, and recovery-first behavior.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 5: Master |
| Time Estimate | 3-4 weeks |
| Main Programming Language | C/C++ |
| Alternative Programming Languages | Rust embedded (architecture port) |
| Coolness Level | Level 5: The “WTF, That’s Possible?” |
| Business Potential | 4. The “Disruptor” |
| Prerequisites | state machines, memory reliability, USB/device basics |
| Key Topics | state architecture, versioned config, logging, crash handling, watchdog policy, boot safety |
1. Learning Objectives
By completing this project, you will:
- Define explicit firmware state/event contracts.
- Implement versioned config schema with migration logic.
- Add structured, rate-limited event logging and crash records.
- Build watchdog policy with fault evidence retention.
- Validate boot image checks and safe fallback behavior.
2. All Theory Needed (Project-Scoped)
2.1 Explicit State Architecture
State tables prevent hidden flag interactions and make failure modes reviewable.
2.2 Versioned Configuration
Schema versioning and migration prevent upgrades from bricking settings.
2.3 Crash Observability
A watchdog reset without retained context only masks issues.
2.4 Boot and Recovery Policy
Boot must validate image/config before jump and support safe fallback mode.
2.5 OTA Feasibility Planning
Even without radio support, memory and boot design should preserve future staged-update paths.
3. Project Specification
3.1 What You Will Build
A platform layer that includes:
- state transition engine
- config versioning + migration
- structured event logging
- watchdog heartbeat strategy
- crash record retention
- boot validation gate
3.2 Functional Requirements
- Illegal state transitions are detected and rejected.
- Config migrations succeed across at least three version jumps.
- Watchdog recovers induced hangs without boot loops.
- Crash records persist across reset and can be decoded.
- Boot validation prevents jump to invalid image metadata.
3.3 Non-Functional Requirements
- Performance: logging overhead bounded and low.
- Reliability: repeated fault injection yields safe recovery.
- Maintainability: architecture docs map to implementation clearly.
3.4 Real World Outcome
$ fw_reliability_suite --fault-injection all --cycles 500
[STATE] illegal_transition_count=0
[CONFIG] migrations_pass=4/4 crc_failures=0
[WDT] induced_hangs=120 recovered=120 boot_loop=0
[CRASHLOG] retained=120 decoded=120
[BOOT] image_validation pass=500 fail=0
PASS: production gate cleared
4. Solution Architecture
4.1 High-Level Design
Boot stage --> image/config validation --> runtime state machine
|--> watchdog heartbeat monitor
|--> structured log ring
|--> crash capture on fault
4.2 Key Components
| Component | Responsibility | Key Decision |
|---|---|---|
| State engine | legal transition control | explicit event/state tables |
| Config manager | schema versioning + migration | CRC + rollback-safe update |
| Event logger | runtime observability | compact event IDs + payloads |
| Fault manager | watchdog + crash capture | reset with retained evidence |
| Boot gate | startup integrity checks | fail-safe recovery mode |
4.3 Core Record Shapes (Pseudocode)
state_event = {from_state, event, to_state, guard_result, ts}
config_record = {version, length, crc, payload}
crash_record = {reset_reason, fault_pc, fault_lr, mode, counters}
5. Implementation Guide
5.1 Phases
- Build state/event map and transition validator.
- Add config versioning + migration harness.
- Add structured logs and bounded buffering.
- Add watchdog and crash capture.
- Add boot validation and recovery mode tests.
5.2 The Core Question You’re Answering
“Can this firmware fail safely, recover predictably, and preserve enough evidence for root-cause analysis?”
5.3 Questions to Guide Design
- Which transitions are legal and why?
- What minimum crash data must survive reset?
- Which conditions trigger recovery mode instead of normal boot?
5.4 Thinking Exercise
Create a top-10 failure-mode matrix with detection signal, immediate action, and retained evidence for each mode.
5.5 Interview Questions They Will Ask
- How do explicit state machines improve embedded reliability?
- Why is config migration mandatory in long-lived firmware?
- How should watchdog and crash logging complement each other?
- What makes boot validation robust enough for field updates?
- How would you extend this design for OTA in future hardware revisions?
5.6 Hints in Layers
- Hint 1: Define architecture tables before implementation.
- Hint 2: Keep logging structured and rate-limited.
- Hint 3: Capture crash context before watchdog reset when possible.
- Hint 4: Exercise repeated fault-injection loops early.
5.7 Common Pitfalls and Debugging
| Problem | Why | Fix | Quick Test |
|---|---|---|---|
| Mode lockups | implicit state coupling | centralize transition table | randomized event-order test |
| Reset loops | watchdog recovery without escalation | add loop counter + safe mode | forced startup fault test |
| Upgrade breaks settings | missing migration coverage | test version corpus + migration chain | replay archived configs |
5.8 Definition of Done
- Transition table and invariants are implemented and tested.
- Config migration suite passes all target versions.
- Watchdog/fault-injection run shows safe recovery and evidence retention.
- Boot validation rejects invalid image metadata deterministically.
6. References
- ARM Cortex-M4 Technical Reference Manual
- TinyUSB Documentation
- USB 2.0 Specification (USB-IF)
- “Design Patterns for Embedded Systems in C” by Bruce Powel Douglass
- “Making Embedded Systems, 2nd Ed” by Elecia White