Project 25: Production Firmware Platform (State Machines, Logging, Watchdog, Boot Policy)

Build a production-style firmware platform layer for NeoTrellis with explicit state machines, structured diagnostics, safe config migration, and recovery-first behavior.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 3-4 weeks
Main Programming Language C/C++
Alternative Programming Languages Rust embedded (architecture port)
Coolness Level Level 5: The “WTF, That’s Possible?”
Business Potential 4. The “Disruptor”
Prerequisites state machines, memory reliability, USB/device basics
Key Topics state architecture, versioned config, logging, crash handling, watchdog policy, boot safety

1. Learning Objectives

By completing this project, you will:

  1. Define explicit firmware state/event contracts.
  2. Implement versioned config schema with migration logic.
  3. Add structured, rate-limited event logging and crash records.
  4. Build watchdog policy with fault evidence retention.
  5. Validate boot image checks and safe fallback behavior.

2. All Theory Needed (Project-Scoped)

2.1 Explicit State Architecture

State tables prevent hidden flag interactions and make failure modes reviewable.

2.2 Versioned Configuration

Schema versioning and migration prevent upgrades from bricking settings.

2.3 Crash Observability

A watchdog reset without retained context only masks issues.

2.4 Boot and Recovery Policy

Boot must validate image/config before jump and support safe fallback mode.

2.5 OTA Feasibility Planning

Even without radio support, memory and boot design should preserve future staged-update paths.


3. Project Specification

3.1 What You Will Build

A platform layer that includes:

  • state transition engine
  • config versioning + migration
  • structured event logging
  • watchdog heartbeat strategy
  • crash record retention
  • boot validation gate

3.2 Functional Requirements

  1. Illegal state transitions are detected and rejected.
  2. Config migrations succeed across at least three version jumps.
  3. Watchdog recovers induced hangs without boot loops.
  4. Crash records persist across reset and can be decoded.
  5. Boot validation prevents jump to invalid image metadata.

3.3 Non-Functional Requirements

  • Performance: logging overhead bounded and low.
  • Reliability: repeated fault injection yields safe recovery.
  • Maintainability: architecture docs map to implementation clearly.

3.4 Real World Outcome

$ fw_reliability_suite --fault-injection all --cycles 500
[STATE] illegal_transition_count=0
[CONFIG] migrations_pass=4/4 crc_failures=0
[WDT] induced_hangs=120 recovered=120 boot_loop=0
[CRASHLOG] retained=120 decoded=120
[BOOT] image_validation pass=500 fail=0
PASS: production gate cleared

4. Solution Architecture

4.1 High-Level Design

Boot stage --> image/config validation --> runtime state machine
                                         |--> watchdog heartbeat monitor
                                         |--> structured log ring
                                         |--> crash capture on fault

4.2 Key Components

Component Responsibility Key Decision
State engine legal transition control explicit event/state tables
Config manager schema versioning + migration CRC + rollback-safe update
Event logger runtime observability compact event IDs + payloads
Fault manager watchdog + crash capture reset with retained evidence
Boot gate startup integrity checks fail-safe recovery mode

4.3 Core Record Shapes (Pseudocode)

state_event = {from_state, event, to_state, guard_result, ts}
config_record = {version, length, crc, payload}
crash_record = {reset_reason, fault_pc, fault_lr, mode, counters}

5. Implementation Guide

5.1 Phases

  1. Build state/event map and transition validator.
  2. Add config versioning + migration harness.
  3. Add structured logs and bounded buffering.
  4. Add watchdog and crash capture.
  5. Add boot validation and recovery mode tests.

5.2 The Core Question You’re Answering

“Can this firmware fail safely, recover predictably, and preserve enough evidence for root-cause analysis?”

5.3 Questions to Guide Design

  1. Which transitions are legal and why?
  2. What minimum crash data must survive reset?
  3. Which conditions trigger recovery mode instead of normal boot?

5.4 Thinking Exercise

Create a top-10 failure-mode matrix with detection signal, immediate action, and retained evidence for each mode.

5.5 Interview Questions They Will Ask

  1. How do explicit state machines improve embedded reliability?
  2. Why is config migration mandatory in long-lived firmware?
  3. How should watchdog and crash logging complement each other?
  4. What makes boot validation robust enough for field updates?
  5. How would you extend this design for OTA in future hardware revisions?

5.6 Hints in Layers

  • Hint 1: Define architecture tables before implementation.
  • Hint 2: Keep logging structured and rate-limited.
  • Hint 3: Capture crash context before watchdog reset when possible.
  • Hint 4: Exercise repeated fault-injection loops early.

5.7 Common Pitfalls and Debugging

Problem Why Fix Quick Test
Mode lockups implicit state coupling centralize transition table randomized event-order test
Reset loops watchdog recovery without escalation add loop counter + safe mode forced startup fault test
Upgrade breaks settings missing migration coverage test version corpus + migration chain replay archived configs

5.8 Definition of Done

  • Transition table and invariants are implemented and tested.
  • Config migration suite passes all target versions.
  • Watchdog/fault-injection run shows safe recovery and evidence retention.
  • Boot validation rejects invalid image metadata deterministically.

6. References