Project 8: Memory Scrubbing Simulator (Defeating Radiation)
Build a memory scrubbing simulator that injects SEU bit flips, detects them with Hamming codes, corrects single-bit errors, and logs uncorrectable faults.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C |
| Alternative Programming Languages | Python |
| Coolness Level | Level 3: The Bit Guardian |
| Business Potential | Level 1: Reliability Tooling |
| Prerequisites | Bitwise ops, basic coding theory, scheduling |
| Key Topics | SEU modeling, Hamming codes, scrub scheduling |
1. Learning Objectives
By completing this project, you will:
- Model single-event upsets (SEUs) as random bit flips in memory.
- Implement Hamming-based error detection and correction (EDAC).
- Design a scrub scheduler that scans memory periodically.
- Log corrected vs uncorrectable errors deterministically.
- Understand how radiation impacts flight software reliability.
2. All Theory Needed (Per-Concept Breakdown)
Radiation-Induced SEUs and Memory Error Models
Fundamentals In orbit, radiation can flip memory bits. These single-event upsets (SEUs) occur randomly and can corrupt data structures, state machines, and critical code. Flight software cannot assume memory is perfect. To maintain correctness, you must detect and correct errors before they accumulate. A memory scrubbing task periodically scans memory for corrupted bits and fixes them. Understanding SEU behavior and modeling is essential for designing scrub policies.
Deep Dive into the concept Radiation in space includes high-energy protons, electrons, and heavy ions. When these particles strike semiconductor devices, they can deposit enough charge to flip a memory bit. The result is a single-event upset, typically a random bit flip in SRAM or DRAM. The rate of SEUs depends on orbit, shielding, and solar activity. Low Earth orbit experiences lower rates than higher orbits, but SEUs still occur. During solar storms, SEU rates can increase dramatically.
For simulation, you can model SEUs as Poisson events: each bit has a small probability of flipping per time unit. A simple model is to inject a random bit flip at a configurable rate (e.g., 1e-5 per bit per second). You can implement this by selecting a random bit index and flipping it. The important part is determinism: use a fixed random seed so your tests are repeatable. The simulator should allow adjusting the SEU rate to explore worst-case scenarios.
SEUs can be single-bit or multi-bit. Single-bit flips are correctable with Hamming codes; multi-bit flips are not. Your simulator should allow injecting double-bit flips to test uncorrectable detection. Over time, multiple single-bit errors can accumulate in the same word if scrubbing is too infrequent. This is why scrub scheduling matters: you want to scrub frequently enough that the probability of multiple errors in one word is low.
SEU modeling also connects to fault management. If you detect uncorrectable errors, you may need to reboot or enter safe mode. In this project, you will log these events and continue running. In real systems, uncorrectable errors can trigger more severe responses. Therefore, your simulator should classify errors and produce statistics that can inform policy decisions.
How this fit on projects This concept drives Section 3.2 (SEU injection) and Section 7 debugging. It connects with P11 fault management.
Definitions & key terms
- SEU -> Single-event upset, random bit flip.
- SCRUB -> Periodic scan of memory to detect/correct errors.
- Poisson process -> Random event model with constant average rate.
Mental model diagram (ASCII)
Radiation -> Bit Flip -> EDAC -> Correct or Log Fault
How it works (step-by-step, with invariants and failure modes)
- Randomly select a bit and flip it.
- Read encoded word and compute syndrome.
- If single-bit error, correct and log.
- If multi-bit, log uncorrectable error.
Invariants: encoded parity bits always updated; scrub interval fixed.
Failure modes: scrub too slow, double-bit errors undetected, nondeterministic tests.
Minimal concrete example
uint32_t bit = rand() % (mem_bits);
mem[bit/8] ^= (1 << (bit % 8));
Common misconceptions
- “SEUs are rare enough to ignore” -> Even a few can corrupt mission state.
- “EDAC fixes everything” -> Double-bit errors remain uncorrectable.
Check-your-understanding questions
- Why is scrub interval important?
- What happens if two bits flip in the same word?
- How does SEU rate affect design?
Check-your-understanding answers
- Longer intervals increase chance of multi-bit errors.
- Hamming codes detect but cannot correct it.
- Higher rates require more frequent scrubbing.
Real-world applications
- Onboard memory protection in spacecraft computers.
- Radiation-hardened system qualification.
Where you’ll apply it
- See Section 3.2 (SEU injection) and Section 6.2 (fault tests).
- Also used in: P11-fdir-watchdog-the-dead-mans-switch.md
References
- NASA GSFC-HDBK-8007 (fault management)
- Wertz, Spacecraft Reliability Engineering (SEU models)
Key insights SEU risk is statistical; scrubbing is a scheduling problem.
Summary Modeling SEUs gives you a quantitative way to design scrub policies.
Homework/Exercises to practice the concept
- Compute expected SEU count in 1 hour for a 1 Mbit memory at 1e-6 flips/bit/hour.
Solutions to the homework/exercises
- Expected flips = 1e6 * 1e-6 = 1 flip per hour.
Hamming Codes and Error Detection/Correction
Fundamentals Hamming codes add parity bits to data so you can detect and correct errors. A Hamming(7,4) code, for example, stores 4 data bits and 3 parity bits. The parity bits are computed over specific bit positions. When you read back data, you compute a syndrome that tells you which bit (if any) flipped. Hamming codes correct single-bit errors and detect (but not correct) double-bit errors. This is the core of EDAC in many space systems.
Deep Dive into the concept Hamming codes are linear block codes. You arrange data and parity bits into a word such that each parity bit covers a specific set of bit positions. When you read a word, you recompute parity and compare to stored parity. The XOR of these yields a syndrome. The syndrome is an index of the bit that is incorrect; if the syndrome is zero, the word is clean. This allows single-bit correction by flipping the indicated bit.
For example, in Hamming(7,4), positions 1,2,4 are parity bits, and positions 3,5,6,7 are data bits. Parity bit 1 covers positions with LSB=1, parity bit 2 covers positions with the second bit set, parity bit 4 covers positions with the third bit set. This systematic structure means you can compute the syndrome by checking which parity constraints failed. The syndrome value directly maps to the erroneous bit index.
To detect double-bit errors, you can add an overall parity bit (SECDED: single-error correct, double-error detect). The overall parity catches cases where the syndrome is non-zero but the parity doesn’t match, indicating a double-bit error. This is important in space systems because double-bit errors are rare but possible, especially if the scrub interval is too long. In this project, you can implement SECDED for each word to detect uncorrectable errors and log them.
Implementing Hamming codes in software requires careful bit manipulation. You must pack and unpack parity bits correctly. It is easy to make off-by-one errors or misplace parity bits. The best practice is to write a reference encoder/decoder and test it with known vectors. You should also include an exhaustive test for small code sizes (e.g., all 16 possible 4-bit inputs) to verify correctness.
When integrating with the scrubber, you decode each word, compute the syndrome, correct if needed, and re-encode if you corrected a bit. This ensures the stored word remains consistent. The scrubber then logs correction counts and uncorrectable counts. These statistics can be used to adjust scrub frequency.
How this fit on projects This concept drives Section 3.2 EDAC requirements and Section 6.2 unit tests, and informs P11 fault response.
Definitions & key terms
- Syndrome -> Bit pattern indicating error location.
- SECDED -> Single-error correct, double-error detect.
- Parity bit -> Bit used to enforce XOR parity across data bits.
Mental model diagram (ASCII)
Data Bits -> Parity Bits -> Encoded Word -> Syndrome -> Correct/Detect
How it works (step-by-step, with invariants and failure modes)
- Encode data bits with parity.
- Read word and recompute parity.
- Compute syndrome.
- If syndrome != 0 and parity ok -> correct.
- If syndrome != 0 and parity mismatch -> uncorrectable.
Invariants: encoding/decoding consistent; parity mapping fixed.
Failure modes: wrong parity mapping, false corrections, failure to detect double-bit errors.
Minimal concrete example
uint8_t syndrome = p1 | (p2<<1) | (p4<<2);
if (syndrome) flip_bit(word, syndrome);
Common misconceptions
- “Hamming codes fix any error” -> They only fix single-bit errors.
- “Double-bit errors are impossible” -> They are rare but real.
Check-your-understanding questions
- How does the syndrome locate the wrong bit?
- Why add an overall parity bit?
- What happens if parity bits themselves flip?
Check-your-understanding answers
- Each parity bit corresponds to a bit position mask.
- To detect double-bit errors (SECDED).
- The syndrome still points to the flipped parity bit.
Real-world applications
- Onboard memory protection in spacecraft CPUs.
- ECC in server memory and SSDs.
Where you’ll apply it
- See Section 3.2 and Section 6.2 tests.
- Also used in: P11-fdir-watchdog-the-dead-mans-switch.md
References
- Lin & Costello, Error Control Coding
- Peterson & Weldon, Error-Correcting Codes
Key insights Hamming codes are simple, powerful, and perfectly suited for SEU correction.
Summary EDAC turns random bit flips into recoverable events.
Homework/Exercises to practice the concept
- Implement Hamming(7,4) and verify all single-bit errors are corrected.
Solutions to the homework/exercises
- Test each of 7 bit flips and verify the decoded data matches original.
3. Project Specification
3.1 What You Will Build
A simulator that models a memory array with ECC, injects SEUs at a controlled rate, scrubs memory at a fixed interval, and reports correction statistics.
3.2 Functional Requirements
- SEU injection: random single- and double-bit flips with fixed seed.
- ECC encoding/decoding: Hamming or SECDED for each word.
- Scrub scheduler: periodic scan with configurable interval.
- Logging: corrected vs uncorrectable errors, per interval statistics.
3.3 Non-Functional Requirements
- Determinism: fixed RNG seed for repeatable runs.
- Performance: scrub 1 MB memory at 1 Hz in simulation.
- Reliability: no silent corruption; errors always logged.
3.4 Example Usage / Output
$ ./scrub_sim --memory 1024 --rate 1e-5 --seed 42
[SEU] Bit flip at address 0x1A4
[FIX] Corrected single-bit error
[STATS] corrected=12 uncorrectable=0
3.5 Data Formats / Schemas / Protocols
Stats JSON:
{"t":60,"corrected":12,"uncorrectable":1}
3.6 Edge Cases
- Back-to-back flips in same word.
- Scrub interval too long.
- Memory size not divisible by ECC word size.
3.7 Real World Outcome
A deterministic simulation showing error detection, correction, and fault statistics.
3.7.1 How to Run (Copy/Paste)
./scrub_sim --memory 1024 --rate 1e-5 --seed 42 --interval 1
3.7.2 Golden Path Demo (Deterministic)
- Use seed 42 and rate 1e-5.
- Compare logs to
golden_scrub.log.
3.7.3 Failure Demo (Deterministic)
./scrub_sim --memory 1024 --rate 1e-3 --interval 60 --seed 42
Expected: multiple uncorrectable errors; exit code 3.
3.7.4 If CLI: Exact Terminal Transcript
$ ./scrub_sim --memory 1024 --rate 1e-5 --seed 42
[STATS] corrected=12 uncorrectable=0
ExitCode=0
4. Solution Architecture
4.1 High-Level Design
Memory -> ECC Encode -> SEU Injection -> Scrubber -> Logs
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| SEU Injector | Random flips | Poisson rate |
| ECC Encoder | Add parity bits | Hamming/SECDED |
| Scrubber | Periodic scan | Interval tuning |
| Logger | Stats and events | Structured logs |
4.3 Data Structures (No Full Code)
typedef struct { uint16_t word; } ecc_word_t;
4.4 Algorithm Overview
Key Algorithm: Scrub loop
- Inject SEU events.
- Scan memory words.
- Decode, correct, re-encode.
- Log statistics.
Complexity Analysis:
- Time: O(memory) per scrub cycle.
- Space: O(memory).
5. Implementation Guide
5.1 Development Environment Setup
cc -O2 -Wall -Wextra -o scrub_sim src/*.c
5.2 Project Structure
project-root/
+-- src/
| +-- ecc.c
| +-- scrub.c
| +-- main.c
+-- tests/
+-- README.md
5.3 The Core Question You’re Answering
“How do you keep memory correct when bits flip randomly?”
5.4 Concepts You Must Understand First
- SEU models.
- Hamming codes and syndromes.
- Scheduling and scrub intervals.
5.5 Questions to Guide Your Design
- What scrub interval keeps double-bit errors rare?
- How do you detect uncorrectable errors?
- How do you log corrections for ops use?
5.6 Thinking Exercise
Estimate how scrub interval affects probability of two flips in one word.
5.7 The Interview Questions They’ll Ask
- “What is memory scrubbing?”
- “Why use SECDED instead of plain Hamming?”
- “How do you choose scrub frequency?”
5.8 Hints in Layers
Hint 1: Start with Hamming(7,4) for a small word.
Hint 2: Add SECDED for double-bit detection.
Hint 3: Implement a loop that scans memory each interval.
Hint 4: Add SEU injection to test correction.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Coding theory | Lin & Costello | Hamming codes |
| Reliability | NASA GSFC-HDBK-8007 | SEU mitigation |
| Embedded systems | Elecia White, Making Embedded Systems | Reliability |
5.10 Implementation Phases
Phase 1: ECC Encoder/Decoder (3-4 days)
Goals: correct single-bit errors. Tasks: implement Hamming code; test vectors. Checkpoint: all single-bit errors corrected.
Phase 2: SEU Injection (2-3 days)
Goals: simulate random flips. Tasks: RNG with fixed seed; flip bits. Checkpoint: logged flips appear at expected rate.
Phase 3: Scrubber + Stats (3-4 days)
Goals: periodic scanning and logging. Tasks: implement scrub loop and stats output. Checkpoint: corrected vs uncorrectable stats match golden log.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| ECC scheme | Hamming / SECDED | SECDED | Detects double-bit errors |
| Scrub interval | fixed / adaptive | fixed | Deterministic baseline |
| Injection | uniform / per-word | uniform | Simple and testable |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | ECC correctness | all 4-bit inputs |
| Integration Tests | Scrub loop | golden_scrub.log |
| Edge Case Tests | Double-bit errors | forced flips |
6.2 Critical Test Cases
- Single-bit: corrected and logged.
- Double-bit: detected and logged as uncorrectable.
- Scrub interval: shorter interval reduces uncorrectable count.
6.3 Test Data
seed=42, rate=1e-5
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Wrong parity mapping | False corrections | Verify with test vectors |
| No seed control | Non-repeatable results | Fix RNG seed |
| Scrub too slow | Many uncorrectables | Reduce interval |
7.2 Debugging Strategies
- Use deterministic seeds and small memory sizes.
- Log syndromes and corrected bit positions.
7.3 Performance Traps
Scrubbing large memory too frequently can dominate CPU time; measure and adjust.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add CSV export of error stats.
8.2 Intermediate Extensions
- Adaptive scrub interval based on SEU rate.
8.3 Advanced Extensions
- Simulate radiation storm events with burst error rates.
9. Real-World Connections
9.1 Industry Applications
- Onboard EDAC in spacecraft computers.
- Reliability analysis for memory subsystem design.
9.2 Related Open Source Projects
- cFS memory scrubbing utilities.
- RTEMS EDAC examples.
9.3 Interview Relevance
- Demonstrates reliability engineering and error correction knowledge.
10. Resources
10.1 Essential Reading
- Lin & Costello, Error Control Coding.
- NASA GSFC-HDBK-8007 (fault management).
10.2 Video Resources
- Reliability engineering lectures.
10.3 Tools & Documentation
- Standard C bitwise operations.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain Hamming codes and syndromes.
- I can compute SEU rates for a given memory size.
- I can justify scrub interval choices.
11.2 Implementation
- Single-bit errors corrected.
- Double-bit errors detected and logged.
- Deterministic output with fixed seed.
11.3 Growth
- I can design an adaptive scrub strategy.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Hamming code implementation with correction.
- Basic SEU injection and scrub loop.
Full Completion:
- SECDED detection and detailed stats logging.
Excellence (Going Above & Beyond):
- Adaptive scrub intervals based on observed error rate.