Project 13: Fault-Tolerant Driver System
Build a driver supervisor that detects crashes, restarts drivers, and replays requests.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 3-4 weeks |
| Language | C (Alternatives: Rust) |
| Prerequisites | IPC, user-space drivers, process supervision |
| Key Topics | Fault detection, restart, state recovery |
1. Learning Objectives
By completing this project, you will:
- Detect driver crashes and restarts.
- Preserve state and replay in-flight requests.
- Build a supervision protocol for user-space drivers.
- Explain how microkernels achieve high reliability.
2. Theoretical Foundation
2.1 Core Concepts
- Reincarnation Server: MINIX 3’s recovery mechanism.
- Supervision Trees: Process monitoring and restart policies.
- State Checkpointing: Persist state across restarts.
- Request Replay: Reissue operations after failure.
2.2 Why This Matters
Driver bugs are the #1 cause of kernel crashes in monolithic systems. Microkernels isolate drivers so they can restart without rebooting.
2.3 Historical Context / Background
MINIX 3’s RS pioneered practical OS self-healing. QNX and seL4 systems rely on similar supervision strategies.
2.4 Common Misconceptions
- “Restarting is enough.” Without state recovery, you lose in-flight work.
- “Checkpointing is expensive.” It can be minimal if you checkpoint only metadata.
3. Project Specification
3.1 What You Will Build
A supervisor process that launches drivers, monitors them, restarts on crash, and replays queued requests. You will demonstrate recovery during a live workload.
3.2 Functional Requirements
- Driver registry: Track running drivers and endpoints.
- Crash detection: Use waitpid or heartbeats.
- Restart policy: Restart driver with backoff.
- Request queue: Save in-flight requests for replay.
- State restore: Optional checkpoint for stateful drivers.
3.3 Non-Functional Requirements
- Reliability: Recovery works under repeated crashes.
- Safety: Prevent restart storms.
- Transparency: Clients experience minimal disruption.
3.4 Example Usage / Output
$ ./supervisor
[SUP] starting storage_driver
[SUP] starting net_driver
3.5 Real World Outcome
$ ./supervisor &
[SUP] Launching storage_driver (pid=222)
$ dd if=/dev/sda of=/dev/null bs=1M count=10
10 MB read successfully
$ kill -SEGV 222
[SUP] storage_driver crashed (SIGSEGV)
[SUP] restarting storage_driver
[SUP] replaying 2 queued requests
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ requests ┌──────────────┐
│ Clients │ ─────────▶ │ Supervisor │
└──────────────┘ ◀───────── │ + queue │
└────┬───────┘
│
▼
┌──────────────┐
│ Driver │
└──────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Supervisor | Launch + restart | backoff policy |
| Request queue | Save in-flight ops | FIFO with IDs |
| Checkpoint | Store state | incremental vs full |
| Driver API | IPC contract | versioned protocol |
4.3 Data Structures
typedef struct {
uint64_t req_id;
message_t msg;
} pending_req_t;
typedef struct {
pid_t pid;
endpoint_t ep;
int restart_count;
} driver_t;
4.4 Algorithm Overview
Key Algorithm: Crash Recovery
- Detect crash (waitpid or heartbeat timeout).
- Restart driver process.
- Restore checkpointed state.
- Replay queued requests in order.
Complexity Analysis:
- Time: O(N) for replay of N pending requests
- Space: O(N) pending queue size
5. Implementation Guide
5.1 Development Environment Setup
cc -O2 -g -o supervisor *.c
5.2 Project Structure
supervisor/
├── src/
│ ├── supervisor.c
│ ├── queue.c
│ ├── checkpoint.c
│ └── main.c
└── tests/
└── test_recovery.c
5.3 The Core Question You’re Answering
“How can the OS keep running when a driver crashes?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Process supervision
- IPC request semantics
- Checkpointing approaches
- At-least-once vs exactly-once semantics
5.5 Questions to Guide Your Design
- How will you identify in-flight requests?
- How will you prevent duplicate execution on replay?
- What is the restart backoff strategy?
5.6 Thinking Exercise
Design a Replay Protocol
If a driver crashes after partially completing a request, how will you avoid corrupting state on replay?
5.7 The Interview Questions They’ll Ask
- “How does MINIX recover from driver crashes?”
- “What is the difference between at-least-once and exactly-once?”
5.8 Hints in Layers
Hint 1: Add request IDs IDs help detect duplicates.
Hint 2: Persist checkpoints Store minimal state to resume.
Hint 3: Test with fault injection Crash driver in the middle of operations.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Fault tolerance | MINIX papers | RS sections |
| Supervision trees | Erlang docs | Supervision |
5.10 Implementation Phases
Phase 1: Foundation (1 week)
Goals:
- Supervisor launches drivers
Tasks:
- Fork/exec drivers.
- Detect crash with waitpid.
Checkpoint: Restart happens after crash.
Phase 2: Core Functionality (1-2 weeks)
Goals:
- Request queue and replay
Tasks:
- Store pending requests.
- Replay after restart.
Checkpoint: Requests complete after restart.
Phase 3: Polish & Edge Cases (1 week)
Goals:
- Checkpointing and backoff
Tasks:
- Add checkpoint protocol.
- Implement restart backoff.
Checkpoint: Repeated crashes handled gracefully.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Recovery semantics | at-least-once, exactly-once | at-least-once | Simpler to implement |
| Checkpoint frequency | periodic, on-demand | periodic | Predictable recovery |
| Restart policy | immediate, backoff | backoff | Avoid loops |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Fault Tests | Crash recovery | SIGSEGV driver |
| Replay Tests | Request queue | 5 pending ops |
| Stress Tests | Repeated crashes | 10 restarts |
6.2 Critical Test Cases
- Crash during request triggers replay.
- Restart loop is throttled.
- Checkpoint restore restores state.
6.3 Test Data
Requests: read, write
Crash points: before reply, after reply
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Duplicate replay | Data corruption | Track request IDs |
| No backoff | CPU spike | Exponential delay |
| Lost state | Inconsistent results | Checkpoint minimal state |
7.2 Debugging Strategies
- Log request lifecycle with IDs.
- Add a debug command to dump pending queue.
7.3 Performance Traps
Checkpointing too often can hurt throughput. Balance frequency with recovery needs.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a CLI to show driver status.
- Add per-driver restart limits.
8.2 Intermediate Extensions
- Persist checkpoints to disk.
- Add health checks instead of crash-only detection.
8.3 Advanced Extensions
- Exactly-once semantics with idempotent ops.
- Distributed supervision for multiple nodes.
9. Real-World Connections
9.1 Industry Applications
- Automotive: driver isolation for safety.
- Industrial control: recovery without reboot.
9.2 Related Open Source Projects
- MINIX 3: https://www.minix3.org/
- Erlang/OTP: supervision patterns.
9.3 Interview Relevance
Fault tolerance is a senior-level systems topic.
10. Resources
10.1 Essential Reading
- MINIX 3 reliability papers - RS design.
- Erlang supervision docs - restart policies.
10.2 Video Resources
- Reliability engineering talks.
10.3 Tools & Documentation
- gdb: debug crashes
- valgrind: detect memory bugs
10.4 Related Projects in This Series
- Project 3: User-space drivers.
- Project 7: MINIX exploration.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how RS restarts drivers.
- I can describe replay semantics.
11.2 Implementation
- Supervisor restarts crashed drivers.
- Pending requests are replayed.
11.3 Growth
- I can explain tradeoffs of checkpointing.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Supervisor restarts drivers on crash.
- Basic request replay works.
Full Completion:
- Checkpoint restore implemented.
- Restart backoff works.
Excellence (Going Above & Beyond):
- Exactly-once semantics.
- Comprehensive reliability report.
This guide was generated from LEARN_MICROKERNELS.md. For the complete learning path, see the parent directory.