Project 8: Memory Scrubbing Simulator (Defeating Radiation)

Build a memory scrubbing simulator that injects SEU bit flips, detects them with Hamming codes, corrects single-bit errors, and logs uncorrectable faults.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Main Programming Language	C
Alternative Programming Languages	Python
Coolness Level	Level 3: The Bit Guardian
Business Potential	Level 1: Reliability Tooling
Prerequisites	Bitwise ops, basic coding theory, scheduling
Key Topics	SEU modeling, Hamming codes, scrub scheduling

1. Learning Objectives

By completing this project, you will:

Model single-event upsets (SEUs) as random bit flips in memory.
Implement Hamming-based error detection and correction (EDAC).
Design a scrub scheduler that scans memory periodically.
Log corrected vs uncorrectable errors deterministically.
Understand how radiation impacts flight software reliability.

2. All Theory Needed (Per-Concept Breakdown)

Radiation-Induced SEUs and Memory Error Models

Fundamentals In orbit, radiation can flip memory bits. These single-event upsets (SEUs) occur randomly and can corrupt data structures, state machines, and critical code. Flight software cannot assume memory is perfect. To maintain correctness, you must detect and correct errors before they accumulate. A memory scrubbing task periodically scans memory for corrupted bits and fixes them. Understanding SEU behavior and modeling is essential for designing scrub policies.

Deep Dive into the concept Radiation in space includes high-energy protons, electrons, and heavy ions. When these particles strike semiconductor devices, they can deposit enough charge to flip a memory bit. The result is a single-event upset, typically a random bit flip in SRAM or DRAM. The rate of SEUs depends on orbit, shielding, and solar activity. Low Earth orbit experiences lower rates than higher orbits, but SEUs still occur. During solar storms, SEU rates can increase dramatically.

For simulation, you can model SEUs as Poisson events: each bit has a small probability of flipping per time unit. A simple model is to inject a random bit flip at a configurable rate (e.g., 1e-5 per bit per second). You can implement this by selecting a random bit index and flipping it. The important part is determinism: use a fixed random seed so your tests are repeatable. The simulator should allow adjusting the SEU rate to explore worst-case scenarios.

SEUs can be single-bit or multi-bit. Single-bit flips are correctable with Hamming codes; multi-bit flips are not. Your simulator should allow injecting double-bit flips to test uncorrectable detection. Over time, multiple single-bit errors can accumulate in the same word if scrubbing is too infrequent. This is why scrub scheduling matters: you want to scrub frequently enough that the probability of multiple errors in one word is low.

SEU modeling also connects to fault management. If you detect uncorrectable errors, you may need to reboot or enter safe mode. In this project, you will log these events and continue running. In real systems, uncorrectable errors can trigger more severe responses. Therefore, your simulator should classify errors and produce statistics that can inform policy decisions.

How this fit on projects This concept drives Section 3.2 (SEU injection) and Section 7 debugging. It connects with P11 fault management.

Definitions & key terms

SEU -> Single-event upset, random bit flip.
SCRUB -> Periodic scan of memory to detect/correct errors.
Poisson process -> Random event model with constant average rate.

Mental model diagram (ASCII)

Radiation -> Bit Flip -> EDAC -> Correct or Log Fault

How it works (step-by-step, with invariants and failure modes)

Randomly select a bit and flip it.
Read encoded word and compute syndrome.
If single-bit error, correct and log.
If multi-bit, log uncorrectable error.

Invariants: encoded parity bits always updated; scrub interval fixed.

Failure modes: scrub too slow, double-bit errors undetected, nondeterministic tests.

Minimal concrete example

uint32_t bit = rand() % (mem_bits);
mem[bit/8] ^= (1 << (bit % 8));

Common misconceptions

“SEUs are rare enough to ignore” -> Even a few can corrupt mission state.
“EDAC fixes everything” -> Double-bit errors remain uncorrectable.

Check-your-understanding questions

Why is scrub interval important?
What happens if two bits flip in the same word?
How does SEU rate affect design?

Check-your-understanding answers

Longer intervals increase chance of multi-bit errors.
Hamming codes detect but cannot correct it.
Higher rates require more frequent scrubbing.

Real-world applications

Onboard memory protection in spacecraft computers.
Radiation-hardened system qualification.

Where you’ll apply it

See Section 3.2 (SEU injection) and Section 6.2 (fault tests).
Also used in: P11-fdir-watchdog-the-dead-mans-switch.md

References

NASA GSFC-HDBK-8007 (fault management)
Wertz, Spacecraft Reliability Engineering (SEU models)

Key insights SEU risk is statistical; scrubbing is a scheduling problem.

Summary Modeling SEUs gives you a quantitative way to design scrub policies.

Homework/Exercises to practice the concept

Compute expected SEU count in 1 hour for a 1 Mbit memory at 1e-6 flips/bit/hour.

Solutions to the homework/exercises

Expected flips = 1e6 * 1e-6 = 1 flip per hour.

Hamming Codes and Error Detection/Correction

Fundamentals Hamming codes add parity bits to data so you can detect and correct errors. A Hamming(7,4) code, for example, stores 4 data bits and 3 parity bits. The parity bits are computed over specific bit positions. When you read back data, you compute a syndrome that tells you which bit (if any) flipped. Hamming codes correct single-bit errors and detect (but not correct) double-bit errors. This is the core of EDAC in many space systems.

Deep Dive into the concept Hamming codes are linear block codes. You arrange data and parity bits into a word such that each parity bit covers a specific set of bit positions. When you read a word, you recompute parity and compare to stored parity. The XOR of these yields a syndrome. The syndrome is an index of the bit that is incorrect; if the syndrome is zero, the word is clean. This allows single-bit correction by flipping the indicated bit.

For example, in Hamming(7,4), positions 1,2,4 are parity bits, and positions 3,5,6,7 are data bits. Parity bit 1 covers positions with LSB=1, parity bit 2 covers positions with the second bit set, parity bit 4 covers positions with the third bit set. This systematic structure means you can compute the syndrome by checking which parity constraints failed. The syndrome value directly maps to the erroneous bit index.

To detect double-bit errors, you can add an overall parity bit (SECDED: single-error correct, double-error detect). The overall parity catches cases where the syndrome is non-zero but the parity doesn’t match, indicating a double-bit error. This is important in space systems because double-bit errors are rare but possible, especially if the scrub interval is too long. In this project, you can implement SECDED for each word to detect uncorrectable errors and log them.

Implementing Hamming codes in software requires careful bit manipulation. You must pack and unpack parity bits correctly. It is easy to make off-by-one errors or misplace parity bits. The best practice is to write a reference encoder/decoder and test it with known vectors. You should also include an exhaustive test for small code sizes (e.g., all 16 possible 4-bit inputs) to verify correctness.

When integrating with the scrubber, you decode each word, compute the syndrome, correct if needed, and re-encode if you corrected a bit. This ensures the stored word remains consistent. The scrubber then logs correction counts and uncorrectable counts. These statistics can be used to adjust scrub frequency.

How this fit on projects This concept drives Section 3.2 EDAC requirements and Section 6.2 unit tests, and informs P11 fault response.

Definitions & key terms

Syndrome -> Bit pattern indicating error location.
SECDED -> Single-error correct, double-error detect.
Parity bit -> Bit used to enforce XOR parity across data bits.

Mental model diagram (ASCII)

Data Bits -> Parity Bits -> Encoded Word -> Syndrome -> Correct/Detect

How it works (step-by-step, with invariants and failure modes)

Encode data bits with parity.
Read word and recompute parity.
Compute syndrome.
If syndrome != 0 and parity ok -> correct.
If syndrome != 0 and parity mismatch -> uncorrectable.

Invariants: encoding/decoding consistent; parity mapping fixed.

Failure modes: wrong parity mapping, false corrections, failure to detect double-bit errors.

Minimal concrete example

uint8_t syndrome = p1 | (p2<<1) | (p4<<2);
if (syndrome) flip_bit(word, syndrome);

Common misconceptions

“Hamming codes fix any error” -> They only fix single-bit errors.
“Double-bit errors are impossible” -> They are rare but real.

Check-your-understanding questions

How does the syndrome locate the wrong bit?
Why add an overall parity bit?
What happens if parity bits themselves flip?

Check-your-understanding answers

Each parity bit corresponds to a bit position mask.
To detect double-bit errors (SECDED).
The syndrome still points to the flipped parity bit.

Real-world applications

Onboard memory protection in spacecraft CPUs.
ECC in server memory and SSDs.

Where you’ll apply it

See Section 3.2 and Section 6.2 tests.
Also used in: P11-fdir-watchdog-the-dead-mans-switch.md

References

Lin & Costello, Error Control Coding
Peterson & Weldon, Error-Correcting Codes

Key insights Hamming codes are simple, powerful, and perfectly suited for SEU correction.

Summary EDAC turns random bit flips into recoverable events.

Homework/Exercises to practice the concept

Implement Hamming(7,4) and verify all single-bit errors are corrected.

Solutions to the homework/exercises

Test each of 7 bit flips and verify the decoded data matches original.

3. Project Specification

3.1 What You Will Build

A simulator that models a memory array with ECC, injects SEUs at a controlled rate, scrubs memory at a fixed interval, and reports correction statistics.

3.2 Functional Requirements

SEU injection: random single- and double-bit flips with fixed seed.
ECC encoding/decoding: Hamming or SECDED for each word.
Scrub scheduler: periodic scan with configurable interval.
Logging: corrected vs uncorrectable errors, per interval statistics.

3.3 Non-Functional Requirements

Determinism: fixed RNG seed for repeatable runs.
Performance: scrub 1 MB memory at 1 Hz in simulation.
Reliability: no silent corruption; errors always logged.

3.4 Example Usage / Output

$ ./scrub_sim --memory 1024 --rate 1e-5 --seed 42
[SEU] Bit flip at address 0x1A4
[FIX] Corrected single-bit error
[STATS] corrected=12 uncorrectable=0

3.5 Data Formats / Schemas / Protocols

Stats JSON:

{"t":60,"corrected":12,"uncorrectable":1}

3.6 Edge Cases

Back-to-back flips in same word.
Scrub interval too long.
Memory size not divisible by ECC word size.

3.7 Real World Outcome

A deterministic simulation showing error detection, correction, and fault statistics.

3.7.1 How to Run (Copy/Paste)

./scrub_sim --memory 1024 --rate 1e-5 --seed 42 --interval 1

3.7.2 Golden Path Demo (Deterministic)

Use seed 42 and rate 1e-5.
Compare logs to golden_scrub.log.

3.7.3 Failure Demo (Deterministic)

./scrub_sim --memory 1024 --rate 1e-3 --interval 60 --seed 42

Expected: multiple uncorrectable errors; exit code 3.

3.7.4 If CLI: Exact Terminal Transcript

$ ./scrub_sim --memory 1024 --rate 1e-5 --seed 42
[STATS] corrected=12 uncorrectable=0
ExitCode=0

4. Solution Architecture

4.1 High-Level Design

Memory -> ECC Encode -> SEU Injection -> Scrubber -> Logs

4.2 Key Components

Component	Responsibility	Key Decisions
SEU Injector	Random flips	Poisson rate
ECC Encoder	Add parity bits	Hamming/SECDED
Scrubber	Periodic scan	Interval tuning
Logger	Stats and events	Structured logs

4.3 Data Structures (No Full Code)

typedef struct { uint16_t word; } ecc_word_t;

4.4 Algorithm Overview

Key Algorithm: Scrub loop

Inject SEU events.
Scan memory words.
Decode, correct, re-encode.
Log statistics.

Complexity Analysis:

Time: O(memory) per scrub cycle.
Space: O(memory).

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o scrub_sim src/*.c

5.2 Project Structure

project-root/
+-- src/
|   +-- ecc.c
|   +-- scrub.c
|   +-- main.c
+-- tests/
+-- README.md

5.3 The Core Question You’re Answering

“How do you keep memory correct when bits flip randomly?”

5.4 Concepts You Must Understand First

SEU models.
Hamming codes and syndromes.
Scheduling and scrub intervals.

5.5 Questions to Guide Your Design

What scrub interval keeps double-bit errors rare?
How do you detect uncorrectable errors?
How do you log corrections for ops use?

5.6 Thinking Exercise

Estimate how scrub interval affects probability of two flips in one word.

5.7 The Interview Questions They’ll Ask

“What is memory scrubbing?”
“Why use SECDED instead of plain Hamming?”
“How do you choose scrub frequency?”

5.8 Hints in Layers

Hint 1: Start with Hamming(7,4) for a small word.

Hint 2: Add SECDED for double-bit detection.

Hint 3: Implement a loop that scans memory each interval.

Hint 4: Add SEU injection to test correction.

5.9 Books That Will Help

Topic	Book	Chapter
Coding theory	Lin & Costello	Hamming codes
Reliability	NASA GSFC-HDBK-8007	SEU mitigation
Embedded systems	Elecia White, Making Embedded Systems	Reliability

5.10 Implementation Phases

Phase 1: ECC Encoder/Decoder (3-4 days)

Goals: correct single-bit errors. Tasks: implement Hamming code; test vectors. Checkpoint: all single-bit errors corrected.

Phase 2: SEU Injection (2-3 days)

Goals: simulate random flips. Tasks: RNG with fixed seed; flip bits. Checkpoint: logged flips appear at expected rate.

Phase 3: Scrubber + Stats (3-4 days)

Goals: periodic scanning and logging. Tasks: implement scrub loop and stats output. Checkpoint: corrected vs uncorrectable stats match golden log.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
ECC scheme	Hamming / SECDED	SECDED	Detects double-bit errors
Scrub interval	fixed / adaptive	fixed	Deterministic baseline
Injection	uniform / per-word	uniform	Simple and testable

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	ECC correctness	all 4-bit inputs
Integration Tests	Scrub loop	golden_scrub.log
Edge Case Tests	Double-bit errors	forced flips

6.2 Critical Test Cases

Single-bit: corrected and logged.
Double-bit: detected and logged as uncorrectable.
Scrub interval: shorter interval reduces uncorrectable count.

6.3 Test Data

seed=42, rate=1e-5

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong parity mapping	False corrections	Verify with test vectors
No seed control	Non-repeatable results	Fix RNG seed
Scrub too slow	Many uncorrectables	Reduce interval

7.2 Debugging Strategies

Use deterministic seeds and small memory sizes.
Log syndromes and corrected bit positions.

7.3 Performance Traps

Scrubbing large memory too frequently can dominate CPU time; measure and adjust.

8. Extensions & Challenges

8.1 Beginner Extensions

Add CSV export of error stats.

8.2 Intermediate Extensions

Adaptive scrub interval based on SEU rate.

8.3 Advanced Extensions

Simulate radiation storm events with burst error rates.

9. Real-World Connections

9.1 Industry Applications

Onboard EDAC in spacecraft computers.
Reliability analysis for memory subsystem design.

cFS memory scrubbing utilities.
RTEMS EDAC examples.

9.3 Interview Relevance

Demonstrates reliability engineering and error correction knowledge.

10. Resources

10.1 Essential Reading

Lin & Costello, Error Control Coding.
NASA GSFC-HDBK-8007 (fault management).

10.2 Video Resources

Reliability engineering lectures.

10.3 Tools & Documentation

Standard C bitwise operations.

P11-fdir-watchdog-the-dead-mans-switch.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain Hamming codes and syndromes.
I can compute SEU rates for a given memory size.
I can justify scrub interval choices.

11.2 Implementation

Single-bit errors corrected.
Double-bit errors detected and logged.
Deterministic output with fixed seed.

11.3 Growth

I can design an adaptive scrub strategy.

12. Submission / Completion Criteria

Minimum Viable Completion:

Hamming code implementation with correction.
Basic SEU injection and scrub loop.

Full Completion:

SECDED detection and detailed stats logging.

Excellence (Going Above & Beyond):

Adaptive scrub intervals based on observed error rate.