Project 2: The Flight State Machine (The Life Cycle)

Design and implement a deterministic spacecraft mode manager that enforces invariants, prevents oscillation, and coordinates subsystem behavior across SAFE, NOMINAL, and SCIENCE modes.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Main Programming Language	C
Alternative Programming Languages	C++, Rust, Python
Coolness Level	Level 4: Systems Orchestrator
Business Potential	Level 3: Mission Reliability Tooling
Prerequisites	State machines, embedded scheduling, basic subsystem concepts
Key Topics	Mode logic, hysteresis, safety invariants, command gating

1. Learning Objectives

By completing this project, you will:

Build a formal mode/state machine for a CubeSat flight software stack.
Implement safety invariants that gate commands and subsystem actions.
Add hysteresis and cooldown timers to prevent mode thrashing.
Design deterministic transition logic with clear telemetry events.
Validate the state machine with fault injections and time-tagged transitions.

2. All Theory Needed (Per-Concept Breakdown)

Mode Management and Hysteresis in Safety-Critical Systems

Fundamentals Spacecraft modes are not cosmetic states; they are operational contracts that define what the spacecraft is allowed to do. SAFE mode protects power and attitude, NOMINAL mode handles routine operations, and SCIENCE mode focuses on payload tasks. A mode system must be deterministic: given the same inputs, it must always pick the same next mode. Hysteresis is the practice of using different entry and exit thresholds (or cooldown timers) so the system does not oscillate. Without hysteresis, a battery voltage that hovers near a threshold could cause constant mode flipping, which is operationally disastrous.

Deep Dive into the concept A mode manager sits at the top of the flight software hierarchy. Its job is to coordinate subsystem behavior and impose global invariants such as “do not transmit when battery is below X” or “do not enable payload in eclipse.” The most robust approach is to define a mode state machine where each mode has explicit entry and exit conditions, permissible actions, and required subsystem configurations. For example, SAFE mode might force ADCS into sun-pointing, disable the payload, reduce telemetry, and enable a beacon. NOMINAL mode might allow downlink and housekeeping, while SCIENCE mode might enable payload processing and more aggressive attitude maneuvers.

Hysteresis is implemented by defining separate thresholds for entering and exiting a mode. Suppose SAFE mode triggers at SoC < 30% and exits at SoC > 40%. This prevents flapping when SoC hovers around 30%. Another form of hysteresis is time-based: require that the triggering condition persist for N seconds before a transition, and then enforce a cooldown period before another transition. These timers must be part of the state machine and must be preserved across resets if you want stable behavior after reboot.

Mode managers must also handle conflicting triggers. For example, a thermal overtemperature might require SAFE mode even if battery is healthy. You should prioritize triggers by severity and define a precedence order. A common pattern is a decision ladder: evaluate fault triggers from most severe to least, choose the highest-priority mode, and ignore lower-priority triggers. This ensures that you never accidentally leave SAFE mode because a less severe condition cleared.

Operationally, mode transitions must be observable. The state machine should emit telemetry events that include previous mode, new mode, and reason code (e.g., LOW_SOC, HIGH_TEMP). Ground operators rely on this to understand behavior. Your implementation should also allow explicit command overrides (if safe) and use an approval flag to prevent immediate re-entry into a risky mode. This is where command gating is essential: before executing any command, you check whether the current mode allows it.

When you test a mode manager, you should inject fault conditions in a deterministic script: drop SoC below threshold, simulate eclipse, then recover and verify mode exit. A correct design will never oscillate and will always respect invariants. This is the heart of operational safety.

How this fit on projects This concept drives Section 3.2 mode transition requirements, Section 4 architecture, and Section 7 debugging (anti-thrashing strategies).

Definitions & key terms

Mode -> Operational state defining allowed behaviors and subsystem configurations.
Invariant -> A constraint that must always hold (e.g., payload off in eclipse).
Hysteresis -> Different entry/exit thresholds or time delays to prevent oscillation.
Cooldown -> Minimum time before another transition is allowed.

Mental model diagram (ASCII)

SAFE <----(low power / high temp)---- NOMINAL ----(science window)----> SCIENCE
  ^                    |                                      |
  |--------------------+-----------(fault trigger)------------+

How it works (step-by-step, with invariants and failure modes)

Evaluate fault triggers in priority order.
Decide candidate mode based on triggers and timers.
If candidate differs from current mode and cooldown expired, transition.
Apply mode-specific subsystem configuration and log reason code.

Invariants: SAFE mode disables payload; SCIENCE never runs in eclipse; comms limited below SoC threshold.

Failure modes: mode thrashing, conflicting triggers causing undefined states, missing telemetry events.

Minimal concrete example

if (fault_high_temp()) target = MODE_SAFE;
else if (soc < 30) target = MODE_SAFE;
else if (science_window && soc > 60) target = MODE_SCIENCE;
else target = MODE_NOMINAL;

Common misconceptions

“Modes are UI states” -> Modes control hard constraints and safety actions.
“A single threshold is enough” -> Without hysteresis, thrashing is common.

Check-your-understanding questions

Why should SAFE mode exit threshold be higher than entry threshold?
How do you prioritize conflicting fault triggers?
What telemetry should accompany mode transitions?

Check-your-understanding answers

To prevent oscillation when metrics hover near the boundary.
Use a severity ladder; choose the most severe trigger.
Include previous mode, new mode, and a reason code.

Real-world applications

CubeSat safe mode control logic.
Fault-protection frameworks in larger spacecraft.

Where you’ll apply it

See Section 3.2 and Section 5.10 for transition logic and testing.
Also used in: P11-fdir-watchdog-the-dead-mans-switch.md, P13-full-mission-simulator-the-digital-twin.md

References

NASA GSFC-HDBK-8007 (Fault management)
Space Mission Engineering (Operations and modes)

Key insights Mode management is the safety boundary between a healthy mission and a dead satellite.

Summary A deterministic, hysteresis-aware state machine is required for stable operations.

Homework/Exercises to practice the concept

Define SAFE/NOMINAL/SCIENCE thresholds and draw a transition table.

Solutions to the homework/exercises

Use entry/exit thresholds with a 10-20% gap and 60s cooldown.

Command Gating and Subsystem Contracts

Fundamentals Subsystems are independent components with their own constraints. Command gating is the practice of checking whether a command is allowed in the current mode and power/thermal state. Without gating, a single operator command could violate safety constraints (e.g., enabling payload in eclipse). Subsystem contracts are written promises: EPS promises to provide SoC and voltage; ADCS promises attitude; COMMS promises link status. The mode manager is the policy layer that ensures these contracts are respected.

Deep Dive into the concept Flight software must mediate between operators and hardware. This means commands are not always safe to execute immediately. Command gating adds a validation layer that inspects the command type, the current mode, and relevant subsystem telemetry. For example, a “payload on” command might require: mode == SCIENCE, SoC > 60%, thermal < 50C, and not in eclipse. If any condition fails, the command is rejected or queued. This logic must be deterministic and explainable; every rejected command should return a reason code for operators.

Subsystem contracts are the interface definitions that make gating possible. They specify what telemetry is published, what commands are accepted, and what state transitions are safe. Without a contract, one subsystem might interpret “enable” as a latch, while another expects a timed pulse. The mode manager relies on contracts to know which actions are reversible and which are destructive. A contract also includes units and update rates. If the mode manager assumes temperatures in Celsius but receives Kelvin, gating fails. Therefore contracts must be explicit and validated.

Practically, you can implement gating as a set of policy rules: for each command, define allowed modes, required telemetry thresholds, and any additional constraints. Then evaluate rules before executing the command. This can be implemented as a table-driven policy (data-driven) or as code (procedural). Table-driven policies are easier to audit and modify; procedural policies allow richer logic but are harder to review. For safety-critical systems, auditability matters, so prefer a table-based approach with clearly defined fields.

Another subtlety is time-tagged commands. If you accept a command for future execution, you must ensure the constraints are checked at execution time, not just acceptance time. For example, a command uploaded during a pass might execute in eclipse later. The gating logic must re-validate conditions at execution. If invalid, the command should be rejected and logged.

Finally, command gating is part of overall fault management. If a command is repeatedly rejected, that might indicate a deeper issue (e.g., power deficit or stuck mode). Therefore you should emit telemetry counters and logs for rejections.

How this fit on projects This concept drives Section 3.2 (command validation) and Section 5.5 (design questions), and feeds into P12 ground console design.

Definitions & key terms

Command gating -> Validating a command against mode and telemetry constraints.
Subsystem contract -> Documented interface: commands, telemetry, units, and constraints.
Reason code -> Structured explanation for rejection or acceptance.

Mental model diagram (ASCII)

Command -> Policy Check -> [Allowed?] -> Execute -> Telemetry Event
                       \-> Reject + Reason

How it works (step-by-step, with invariants and failure modes)

Receive command (immediate or time-tagged).
Check allowed modes and thresholds.
If valid, execute and log success.
If invalid, reject and log reason code.

Invariants: no unsafe command executes; every rejection has a reason; telemetry uses correct units.

Failure modes: missing unit conversions, stale telemetry, non-deterministic policy outcomes.

Minimal concrete example

if (cmd == CMD_PAYLOAD_ON && mode == MODE_SCIENCE && soc > 60 && !eclipse)
  accept();
else
  reject(REASON_UNSAFE_MODE);

Common misconceptions

“Operators won’t send unsafe commands” -> Mistakes happen; software must guard.
“Check once at upload time” -> Conditions change; re-check at execution.

Check-your-understanding questions

Why should command gating be data-driven when possible?
What telemetry fields are critical for payload enablement?
How do you handle stale telemetry in gating?

Check-your-understanding answers

It is auditable and easier to review for safety.
SoC, eclipse status, thermal limits, current mode.
Reject or defer commands when telemetry age exceeds limit.

Real-world applications

Ground command approval systems.
Onboard autonomy scripts that must obey safety policies.

Where you’ll apply it

See Section 3.2 (functional requirements), Section 5.5 (design questions).
Also used in: P12-ground-station-command-console-the-hmi.md

References

Spacecraft Systems Engineering (command and data handling)
NASA Mission Success Handbook (operational safety)

Key insights Commands are not actions; they are requests filtered through safety policy.

Summary Command gating turns a state machine into a safety system.

Homework/Exercises to practice the concept

Build a small table of commands and allowed modes.

Solutions to the homework/exercises

Example: PAYLOAD_ON allowed only in SCIENCE with SoC > 60%.

3. Project Specification

3.1 What You Will Build

A deterministic mode manager that transitions between SAFE, NOMINAL, and SCIENCE based on telemetry thresholds and fault triggers, with explicit command gating rules and reason-code telemetry.

3.2 Functional Requirements

Mode state machine: define states, transitions, and entry/exit actions.
Hysteresis & cooldown: implement thresholds and minimum dwell times.
Command gating: validate commands against mode and telemetry.
Telemetry events: emit mode changes and rejection reasons.

3.3 Non-Functional Requirements

Determinism: same telemetry trace yields identical mode transitions.
Reliability: no oscillation under noisy telemetry.
Traceability: every transition logs a reason code.

3.4 Example Usage / Output

$ ./mode_sim --trace traces/nominal.json
[t=120s] MODE=SAFE (LOW_SOC)
[t=480s] MODE=NOMINAL (SOC_RECOVERED)

3.5 Data Formats / Schemas / Protocols

Input telemetry trace schema:

{"t":0,"soc":0.55,"temp":32.1,"eclipse":false,"fault":null}

Mode event schema:

{"t":120,"prev":"NOMINAL","next":"SAFE","reason":"LOW_SOC"}

3.6 Edge Cases

SoC oscillating around threshold.
Conflicting triggers (high temp + low SoC).
Missing telemetry samples (gaps).

3.7 Real World Outcome

A mission-quality mode manager that can be exercised with traces and shows stable behavior under faults.

3.7.1 How to Run (Copy/Paste)

./mode_sim --trace traces/low_soc.json --cooldown 60

3.7.2 Golden Path Demo (Deterministic)

Use traces/nominal.json with fixed timestamps.
Expect a single transition into NOMINAL at t=0 and no further transitions.

3.7.3 Failure Demo (Deterministic)

./mode_sim --trace traces/thrashing.json

Expected behavior: with hysteresis enabled, only one SAFE entry and one exit despite noisy SoC.

3.7.4 If CLI: Exact Terminal Transcript

$ ./mode_sim --trace traces/low_soc.json --cooldown 60
[t=120s] MODE=SAFE reason=LOW_SOC
[t=480s] MODE=NOMINAL reason=SOC_RECOVERED
ExitCode=0

4. Solution Architecture

4.1 High-Level Design

Telemetry Trace -> Evaluator -> Mode Decision -> Actions -> Event Log
                         ^           |
                         |-----------| (hysteresis + cooldown)

4.2 Key Components

Component	Responsibility	Key Decisions
Evaluator	Reads telemetry and fault flags	Stale data handling
Mode FSM	Determines next mode	Priority order
Gating Engine	Approves commands	Rule format
Logger	Emits mode events	Reason codes

4.3 Data Structures (No Full Code)

typedef enum { MODE_SAFE, MODE_NOMINAL, MODE_SCIENCE } mode_t;

typedef struct {
  float soc;
  float temp_c;
  int eclipse;
} telemetry_t;

4.4 Algorithm Overview

Key Algorithm: Mode decision loop

Evaluate trigger conditions (power, thermal, faults).
Select target mode by severity.
Apply hysteresis and cooldown.
Transition and log if mode changes.

Complexity Analysis:

Time: O(1) per telemetry sample.
Space: O(1) state.

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o mode_sim src/*.c

5.2 Project Structure

project-root/
+-- src/
|   +-- main.c
|   +-- fsm.c
|   +-- gating.c
|   +-- trace.c
+-- traces/
|   +-- nominal.json
|   +-- low_soc.json
|   +-- thrashing.json
+-- README.md

5.3 The Core Question You’re Answering

“How does a spacecraft decide what it is allowed to do right now?”

5.4 Concepts You Must Understand First

Deterministic state machines.
Hysteresis and cooldown strategies.
Safety invariants and command gating.

5.5 Questions to Guide Your Design

What is the minimum SoC for SAFE entry and exit?
Which faults override all other conditions?
What is the operator-facing reason code set?

5.6 Thinking Exercise

Draw the state machine and annotate each transition with a threshold and a timer.

5.7 The Interview Questions They’ll Ask

“How do you prevent mode thrashing?”
“What must remain enabled in safe mode?”
“How do you prioritize fault triggers?”

5.8 Hints in Layers

Hint 1: Start with a simple if/else ladder for mode selection.

Hint 2: Add a cooldown timer to block rapid oscillations.

Hint 3: Encode thresholds in a config file to simplify testing.

Hint 4: Log transitions with structured reason codes.

5.9 Books That Will Help

Topic	Book	Chapter
Fault management	NASA GSFC-HDBK-8007	Fault protection
Space operations	Space Mission Engineering	Modes and ops
FSM design	“Practical UML Statecharts”	State machine patterns

5.10 Implementation Phases

Phase 1: FSM Skeleton (2-3 days)

Goals: implement modes and transitions. Tasks:

Define enums and transition logic.
Add simple trace playback. Checkpoint: transitions occur as expected on simple traces.

Phase 2: Hysteresis + Cooldown (3-4 days)

Goals: stabilize transitions. Tasks:

Implement entry/exit thresholds.
Add cooldown timers. Checkpoint: thrashing trace produces stable behavior.

Phase 3: Command Gating (3-4 days)

Goals: enforce mode-based command rules. Tasks:

Add a small policy table.
Log rejection reasons. Checkpoint: unsafe commands are rejected with reason codes.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Transition logic	table-driven / code	Table-driven	Easier to audit
Hysteresis	threshold gap / time delay	Both	Combats noise and delays
Command gating	rules engine / hard-coded	Rules engine	Auditability

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Trigger evaluation	low_soc, high_temp
Integration Tests	Full trace playback	nominal.json
Edge Case Tests	Thrashing	jittery SoC trace

6.2 Critical Test Cases

LOW_SOC entry: SoC drops below threshold -> SAFE.
Recovery: SoC rises above exit threshold -> NOMINAL.
Conflict: high temp + low SoC -> SAFE with high-temp reason.

6.3 Test Data

{"t":100,"soc":0.29,"temp":30,"eclipse":true}

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
No hysteresis	Rapid mode flips	Add entry/exit thresholds
Missing reason codes	Operators confused	Log reason codes
Stale telemetry	Wrong transitions	Reject stale samples

7.2 Debugging Strategies

Replay traces: deterministic playback reveals transitions.
Event logs: compare logs against expected timeline.

7.3 Performance Traps

None; logic is O(1). Focus on correctness.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a CONFIG file for thresholds.

8.2 Intermediate Extensions

Add a fourth mode (COMMS) with its own gating rules.

8.3 Advanced Extensions

Persist state and timers across simulated reboot.

9. Real-World Connections

9.1 Industry Applications

Flight software mode managers in CubeSat missions.
Autonomous mode logic in deep-space probes.

NASA cFS: mode management and command gating.
OpenSatKit: operational command frameworks.

9.3 Interview Relevance

State machine design and safety gating are common systems interview topics.

10. Resources

10.1 Essential Reading

NASA GSFC-HDBK-8007 - fault management strategies.
Space Mission Engineering - operations and mode design.

10.2 Video Resources

Mission operations talks (CubeSat conferences).

10.3 Tools & Documentation

Graphviz: render state machine diagrams.
jq: inspect telemetry traces.

P11-fdir-watchdog-the-dead-mans-switch.md - fault triggers.
P13-full-mission-simulator-the-digital-twin.md - integration.

11. Self-Assessment Checklist

11.1 Understanding

I can explain why hysteresis is critical for mode stability.
I can describe the safety invariants enforced by SAFE mode.
I can reason about command gating rules.

11.2 Implementation

Deterministic transitions on all traces.
Mode logs include reasons and timestamps.
Gating rejects unsafe commands.

11.3 Growth

I can explain this FSM to a mission operator.
I can propose an improvement to reduce false transitions.

12. Submission / Completion Criteria

Minimum Viable Completion:

Implement SAFE and NOMINAL with threshold logic.
Log transitions with reason codes.
Pass the golden trace test.

Full Completion:

Include SCIENCE mode and command gating.
Add cooldown/hysteresis to prevent thrashing.

Excellence (Going Above & Beyond):

Persist mode state across reboot and implement manual override safeguards.