Project 2: The Flight State Machine (The Life Cycle)

Design and implement a deterministic spacecraft mode manager that enforces invariants, prevents oscillation, and coordinates subsystem behavior across SAFE, NOMINAL, and SCIENCE modes.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Programming Language C
Alternative Programming Languages C++, Rust, Python
Coolness Level Level 4: Systems Orchestrator
Business Potential Level 3: Mission Reliability Tooling
Prerequisites State machines, embedded scheduling, basic subsystem concepts
Key Topics Mode logic, hysteresis, safety invariants, command gating

1. Learning Objectives

By completing this project, you will:

  1. Build a formal mode/state machine for a CubeSat flight software stack.
  2. Implement safety invariants that gate commands and subsystem actions.
  3. Add hysteresis and cooldown timers to prevent mode thrashing.
  4. Design deterministic transition logic with clear telemetry events.
  5. Validate the state machine with fault injections and time-tagged transitions.

2. All Theory Needed (Per-Concept Breakdown)

Mode Management and Hysteresis in Safety-Critical Systems

Fundamentals Spacecraft modes are not cosmetic states; they are operational contracts that define what the spacecraft is allowed to do. SAFE mode protects power and attitude, NOMINAL mode handles routine operations, and SCIENCE mode focuses on payload tasks. A mode system must be deterministic: given the same inputs, it must always pick the same next mode. Hysteresis is the practice of using different entry and exit thresholds (or cooldown timers) so the system does not oscillate. Without hysteresis, a battery voltage that hovers near a threshold could cause constant mode flipping, which is operationally disastrous.

Deep Dive into the concept A mode manager sits at the top of the flight software hierarchy. Its job is to coordinate subsystem behavior and impose global invariants such as “do not transmit when battery is below X” or “do not enable payload in eclipse.” The most robust approach is to define a mode state machine where each mode has explicit entry and exit conditions, permissible actions, and required subsystem configurations. For example, SAFE mode might force ADCS into sun-pointing, disable the payload, reduce telemetry, and enable a beacon. NOMINAL mode might allow downlink and housekeeping, while SCIENCE mode might enable payload processing and more aggressive attitude maneuvers.

Hysteresis is implemented by defining separate thresholds for entering and exiting a mode. Suppose SAFE mode triggers at SoC < 30% and exits at SoC > 40%. This prevents flapping when SoC hovers around 30%. Another form of hysteresis is time-based: require that the triggering condition persist for N seconds before a transition, and then enforce a cooldown period before another transition. These timers must be part of the state machine and must be preserved across resets if you want stable behavior after reboot.

Mode managers must also handle conflicting triggers. For example, a thermal overtemperature might require SAFE mode even if battery is healthy. You should prioritize triggers by severity and define a precedence order. A common pattern is a decision ladder: evaluate fault triggers from most severe to least, choose the highest-priority mode, and ignore lower-priority triggers. This ensures that you never accidentally leave SAFE mode because a less severe condition cleared.

Operationally, mode transitions must be observable. The state machine should emit telemetry events that include previous mode, new mode, and reason code (e.g., LOW_SOC, HIGH_TEMP). Ground operators rely on this to understand behavior. Your implementation should also allow explicit command overrides (if safe) and use an approval flag to prevent immediate re-entry into a risky mode. This is where command gating is essential: before executing any command, you check whether the current mode allows it.

When you test a mode manager, you should inject fault conditions in a deterministic script: drop SoC below threshold, simulate eclipse, then recover and verify mode exit. A correct design will never oscillate and will always respect invariants. This is the heart of operational safety.

How this fit on projects This concept drives Section 3.2 mode transition requirements, Section 4 architecture, and Section 7 debugging (anti-thrashing strategies).

Definitions & key terms

  • Mode -> Operational state defining allowed behaviors and subsystem configurations.
  • Invariant -> A constraint that must always hold (e.g., payload off in eclipse).
  • Hysteresis -> Different entry/exit thresholds or time delays to prevent oscillation.
  • Cooldown -> Minimum time before another transition is allowed.

Mental model diagram (ASCII)

SAFE <----(low power / high temp)---- NOMINAL ----(science window)----> SCIENCE
  ^                    |                                      |
  |--------------------+-----------(fault trigger)------------+

How it works (step-by-step, with invariants and failure modes)

  1. Evaluate fault triggers in priority order.
  2. Decide candidate mode based on triggers and timers.
  3. If candidate differs from current mode and cooldown expired, transition.
  4. Apply mode-specific subsystem configuration and log reason code.

Invariants: SAFE mode disables payload; SCIENCE never runs in eclipse; comms limited below SoC threshold.

Failure modes: mode thrashing, conflicting triggers causing undefined states, missing telemetry events.

Minimal concrete example

if (fault_high_temp()) target = MODE_SAFE;
else if (soc < 30) target = MODE_SAFE;
else if (science_window && soc > 60) target = MODE_SCIENCE;
else target = MODE_NOMINAL;

Common misconceptions

  • “Modes are UI states” -> Modes control hard constraints and safety actions.
  • “A single threshold is enough” -> Without hysteresis, thrashing is common.

Check-your-understanding questions

  1. Why should SAFE mode exit threshold be higher than entry threshold?
  2. How do you prioritize conflicting fault triggers?
  3. What telemetry should accompany mode transitions?

Check-your-understanding answers

  1. To prevent oscillation when metrics hover near the boundary.
  2. Use a severity ladder; choose the most severe trigger.
  3. Include previous mode, new mode, and a reason code.

Real-world applications

  • CubeSat safe mode control logic.
  • Fault-protection frameworks in larger spacecraft.

Where you’ll apply it

References

  • NASA GSFC-HDBK-8007 (Fault management)
  • Space Mission Engineering (Operations and modes)

Key insights Mode management is the safety boundary between a healthy mission and a dead satellite.

Summary A deterministic, hysteresis-aware state machine is required for stable operations.

Homework/Exercises to practice the concept

  • Define SAFE/NOMINAL/SCIENCE thresholds and draw a transition table.

Solutions to the homework/exercises

  • Use entry/exit thresholds with a 10-20% gap and 60s cooldown.

Command Gating and Subsystem Contracts

Fundamentals Subsystems are independent components with their own constraints. Command gating is the practice of checking whether a command is allowed in the current mode and power/thermal state. Without gating, a single operator command could violate safety constraints (e.g., enabling payload in eclipse). Subsystem contracts are written promises: EPS promises to provide SoC and voltage; ADCS promises attitude; COMMS promises link status. The mode manager is the policy layer that ensures these contracts are respected.

Deep Dive into the concept Flight software must mediate between operators and hardware. This means commands are not always safe to execute immediately. Command gating adds a validation layer that inspects the command type, the current mode, and relevant subsystem telemetry. For example, a “payload on” command might require: mode == SCIENCE, SoC > 60%, thermal < 50C, and not in eclipse. If any condition fails, the command is rejected or queued. This logic must be deterministic and explainable; every rejected command should return a reason code for operators.

Subsystem contracts are the interface definitions that make gating possible. They specify what telemetry is published, what commands are accepted, and what state transitions are safe. Without a contract, one subsystem might interpret “enable” as a latch, while another expects a timed pulse. The mode manager relies on contracts to know which actions are reversible and which are destructive. A contract also includes units and update rates. If the mode manager assumes temperatures in Celsius but receives Kelvin, gating fails. Therefore contracts must be explicit and validated.

Practically, you can implement gating as a set of policy rules: for each command, define allowed modes, required telemetry thresholds, and any additional constraints. Then evaluate rules before executing the command. This can be implemented as a table-driven policy (data-driven) or as code (procedural). Table-driven policies are easier to audit and modify; procedural policies allow richer logic but are harder to review. For safety-critical systems, auditability matters, so prefer a table-based approach with clearly defined fields.

Another subtlety is time-tagged commands. If you accept a command for future execution, you must ensure the constraints are checked at execution time, not just acceptance time. For example, a command uploaded during a pass might execute in eclipse later. The gating logic must re-validate conditions at execution. If invalid, the command should be rejected and logged.

Finally, command gating is part of overall fault management. If a command is repeatedly rejected, that might indicate a deeper issue (e.g., power deficit or stuck mode). Therefore you should emit telemetry counters and logs for rejections.

How this fit on projects This concept drives Section 3.2 (command validation) and Section 5.5 (design questions), and feeds into P12 ground console design.

Definitions & key terms

  • Command gating -> Validating a command against mode and telemetry constraints.
  • Subsystem contract -> Documented interface: commands, telemetry, units, and constraints.
  • Reason code -> Structured explanation for rejection or acceptance.

Mental model diagram (ASCII)

Command -> Policy Check -> [Allowed?] -> Execute -> Telemetry Event
                       \-> Reject + Reason

How it works (step-by-step, with invariants and failure modes)

  1. Receive command (immediate or time-tagged).
  2. Check allowed modes and thresholds.
  3. If valid, execute and log success.
  4. If invalid, reject and log reason code.

Invariants: no unsafe command executes; every rejection has a reason; telemetry uses correct units.

Failure modes: missing unit conversions, stale telemetry, non-deterministic policy outcomes.

Minimal concrete example

if (cmd == CMD_PAYLOAD_ON && mode == MODE_SCIENCE && soc > 60 && !eclipse)
  accept();
else
  reject(REASON_UNSAFE_MODE);

Common misconceptions

  • “Operators won’t send unsafe commands” -> Mistakes happen; software must guard.
  • “Check once at upload time” -> Conditions change; re-check at execution.

Check-your-understanding questions

  1. Why should command gating be data-driven when possible?
  2. What telemetry fields are critical for payload enablement?
  3. How do you handle stale telemetry in gating?

Check-your-understanding answers

  1. It is auditable and easier to review for safety.
  2. SoC, eclipse status, thermal limits, current mode.
  3. Reject or defer commands when telemetry age exceeds limit.

Real-world applications

  • Ground command approval systems.
  • Onboard autonomy scripts that must obey safety policies.

Where you’ll apply it

References

  • Spacecraft Systems Engineering (command and data handling)
  • NASA Mission Success Handbook (operational safety)

Key insights Commands are not actions; they are requests filtered through safety policy.

Summary Command gating turns a state machine into a safety system.

Homework/Exercises to practice the concept

  • Build a small table of commands and allowed modes.

Solutions to the homework/exercises

  • Example: PAYLOAD_ON allowed only in SCIENCE with SoC > 60%.

3. Project Specification

3.1 What You Will Build

A deterministic mode manager that transitions between SAFE, NOMINAL, and SCIENCE based on telemetry thresholds and fault triggers, with explicit command gating rules and reason-code telemetry.

3.2 Functional Requirements

  1. Mode state machine: define states, transitions, and entry/exit actions.
  2. Hysteresis & cooldown: implement thresholds and minimum dwell times.
  3. Command gating: validate commands against mode and telemetry.
  4. Telemetry events: emit mode changes and rejection reasons.

3.3 Non-Functional Requirements

  • Determinism: same telemetry trace yields identical mode transitions.
  • Reliability: no oscillation under noisy telemetry.
  • Traceability: every transition logs a reason code.

3.4 Example Usage / Output

$ ./mode_sim --trace traces/nominal.json
[t=120s] MODE=SAFE (LOW_SOC)
[t=480s] MODE=NOMINAL (SOC_RECOVERED)

3.5 Data Formats / Schemas / Protocols

Input telemetry trace schema:

{"t":0,"soc":0.55,"temp":32.1,"eclipse":false,"fault":null}

Mode event schema:

{"t":120,"prev":"NOMINAL","next":"SAFE","reason":"LOW_SOC"}

3.6 Edge Cases

  • SoC oscillating around threshold.
  • Conflicting triggers (high temp + low SoC).
  • Missing telemetry samples (gaps).

3.7 Real World Outcome

A mission-quality mode manager that can be exercised with traces and shows stable behavior under faults.

3.7.1 How to Run (Copy/Paste)

./mode_sim --trace traces/low_soc.json --cooldown 60

3.7.2 Golden Path Demo (Deterministic)

  • Use traces/nominal.json with fixed timestamps.
  • Expect a single transition into NOMINAL at t=0 and no further transitions.

3.7.3 Failure Demo (Deterministic)

./mode_sim --trace traces/thrashing.json

Expected behavior: with hysteresis enabled, only one SAFE entry and one exit despite noisy SoC.

3.7.4 If CLI: Exact Terminal Transcript

$ ./mode_sim --trace traces/low_soc.json --cooldown 60
[t=120s] MODE=SAFE reason=LOW_SOC
[t=480s] MODE=NOMINAL reason=SOC_RECOVERED
ExitCode=0

4. Solution Architecture

4.1 High-Level Design

Telemetry Trace -> Evaluator -> Mode Decision -> Actions -> Event Log
                         ^           |
                         |-----------| (hysteresis + cooldown)

4.2 Key Components

Component Responsibility Key Decisions
Evaluator Reads telemetry and fault flags Stale data handling
Mode FSM Determines next mode Priority order
Gating Engine Approves commands Rule format
Logger Emits mode events Reason codes

4.3 Data Structures (No Full Code)

typedef enum { MODE_SAFE, MODE_NOMINAL, MODE_SCIENCE } mode_t;

typedef struct {
  float soc;
  float temp_c;
  int eclipse;
} telemetry_t;

4.4 Algorithm Overview

Key Algorithm: Mode decision loop

  1. Evaluate trigger conditions (power, thermal, faults).
  2. Select target mode by severity.
  3. Apply hysteresis and cooldown.
  4. Transition and log if mode changes.

Complexity Analysis:

  • Time: O(1) per telemetry sample.
  • Space: O(1) state.

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o mode_sim src/*.c

5.2 Project Structure

project-root/
+-- src/
|   +-- main.c
|   +-- fsm.c
|   +-- gating.c
|   +-- trace.c
+-- traces/
|   +-- nominal.json
|   +-- low_soc.json
|   +-- thrashing.json
+-- README.md

5.3 The Core Question You’re Answering

“How does a spacecraft decide what it is allowed to do right now?”

5.4 Concepts You Must Understand First

  1. Deterministic state machines.
  2. Hysteresis and cooldown strategies.
  3. Safety invariants and command gating.

5.5 Questions to Guide Your Design

  1. What is the minimum SoC for SAFE entry and exit?
  2. Which faults override all other conditions?
  3. What is the operator-facing reason code set?

5.6 Thinking Exercise

Draw the state machine and annotate each transition with a threshold and a timer.

5.7 The Interview Questions They’ll Ask

  1. “How do you prevent mode thrashing?”
  2. “What must remain enabled in safe mode?”
  3. “How do you prioritize fault triggers?”

5.8 Hints in Layers

Hint 1: Start with a simple if/else ladder for mode selection.

Hint 2: Add a cooldown timer to block rapid oscillations.

Hint 3: Encode thresholds in a config file to simplify testing.

Hint 4: Log transitions with structured reason codes.


5.9 Books That Will Help

Topic Book Chapter
Fault management NASA GSFC-HDBK-8007 Fault protection
Space operations Space Mission Engineering Modes and ops
FSM design “Practical UML Statecharts” State machine patterns

5.10 Implementation Phases

Phase 1: FSM Skeleton (2-3 days)

Goals: implement modes and transitions. Tasks:

  1. Define enums and transition logic.
  2. Add simple trace playback. Checkpoint: transitions occur as expected on simple traces.

Phase 2: Hysteresis + Cooldown (3-4 days)

Goals: stabilize transitions. Tasks:

  1. Implement entry/exit thresholds.
  2. Add cooldown timers. Checkpoint: thrashing trace produces stable behavior.

Phase 3: Command Gating (3-4 days)

Goals: enforce mode-based command rules. Tasks:

  1. Add a small policy table.
  2. Log rejection reasons. Checkpoint: unsafe commands are rejected with reason codes.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Transition logic table-driven / code Table-driven Easier to audit
Hysteresis threshold gap / time delay Both Combats noise and delays
Command gating rules engine / hard-coded Rules engine Auditability

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Trigger evaluation low_soc, high_temp
Integration Tests Full trace playback nominal.json
Edge Case Tests Thrashing jittery SoC trace

6.2 Critical Test Cases

  1. LOW_SOC entry: SoC drops below threshold -> SAFE.
  2. Recovery: SoC rises above exit threshold -> NOMINAL.
  3. Conflict: high temp + low SoC -> SAFE with high-temp reason.

6.3 Test Data

{"t":100,"soc":0.29,"temp":30,"eclipse":true}

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
No hysteresis Rapid mode flips Add entry/exit thresholds
Missing reason codes Operators confused Log reason codes
Stale telemetry Wrong transitions Reject stale samples

7.2 Debugging Strategies

  • Replay traces: deterministic playback reveals transitions.
  • Event logs: compare logs against expected timeline.

7.3 Performance Traps

None; logic is O(1). Focus on correctness.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a CONFIG file for thresholds.

8.2 Intermediate Extensions

  • Add a fourth mode (COMMS) with its own gating rules.

8.3 Advanced Extensions

  • Persist state and timers across simulated reboot.

9. Real-World Connections

9.1 Industry Applications

  • Flight software mode managers in CubeSat missions.
  • Autonomous mode logic in deep-space probes.
  • NASA cFS: mode management and command gating.
  • OpenSatKit: operational command frameworks.

9.3 Interview Relevance

  • State machine design and safety gating are common systems interview topics.

10. Resources

10.1 Essential Reading

  • NASA GSFC-HDBK-8007 - fault management strategies.
  • Space Mission Engineering - operations and mode design.

10.2 Video Resources

  • Mission operations talks (CubeSat conferences).

10.3 Tools & Documentation

  • Graphviz: render state machine diagrams.
  • jq: inspect telemetry traces.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why hysteresis is critical for mode stability.
  • I can describe the safety invariants enforced by SAFE mode.
  • I can reason about command gating rules.

11.2 Implementation

  • Deterministic transitions on all traces.
  • Mode logs include reasons and timestamps.
  • Gating rejects unsafe commands.

11.3 Growth

  • I can explain this FSM to a mission operator.
  • I can propose an improvement to reduce false transitions.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Implement SAFE and NOMINAL with threshold logic.
  • Log transitions with reason codes.
  • Pass the golden trace test.

Full Completion:

  • Include SCIENCE mode and command gating.
  • Add cooldown/hysteresis to prevent thrashing.

Excellence (Going Above & Beyond):

  • Persist mode state across reboot and implement manual override safeguards.