Project 13: Fault-Tolerant Driver System

Build a driver supervisor that detects crashes, restarts drivers, and replays requests.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	3-4 weeks
Language	C (Alternatives: Rust)
Prerequisites	IPC, user-space drivers, process supervision
Key Topics	Fault detection, restart, state recovery

1. Learning Objectives

By completing this project, you will:

Detect driver crashes and restarts.
Preserve state and replay in-flight requests.
Build a supervision protocol for user-space drivers.
Explain how microkernels achieve high reliability.

2. Theoretical Foundation

2.1 Core Concepts

Reincarnation Server: MINIX 3’s recovery mechanism.
Supervision Trees: Process monitoring and restart policies.
State Checkpointing: Persist state across restarts.
Request Replay: Reissue operations after failure.

2.2 Why This Matters

Driver bugs are the #1 cause of kernel crashes in monolithic systems. Microkernels isolate drivers so they can restart without rebooting.

2.3 Historical Context / Background

MINIX 3’s RS pioneered practical OS self-healing. QNX and seL4 systems rely on similar supervision strategies.

2.4 Common Misconceptions

“Restarting is enough.” Without state recovery, you lose in-flight work.
“Checkpointing is expensive.” It can be minimal if you checkpoint only metadata.

3. Project Specification

3.1 What You Will Build

A supervisor process that launches drivers, monitors them, restarts on crash, and replays queued requests. You will demonstrate recovery during a live workload.

3.2 Functional Requirements

Driver registry: Track running drivers and endpoints.
Crash detection: Use waitpid or heartbeats.
Restart policy: Restart driver with backoff.
Request queue: Save in-flight requests for replay.
State restore: Optional checkpoint for stateful drivers.

3.3 Non-Functional Requirements

Reliability: Recovery works under repeated crashes.
Safety: Prevent restart storms.
Transparency: Clients experience minimal disruption.

3.4 Example Usage / Output

$ ./supervisor
[SUP] starting storage_driver
[SUP] starting net_driver

3.5 Real World Outcome

$ ./supervisor &
[SUP] Launching storage_driver (pid=222)

$ dd if=/dev/sda of=/dev/null bs=1M count=10
10 MB read successfully

$ kill -SEGV 222
[SUP] storage_driver crashed (SIGSEGV)
[SUP] restarting storage_driver
[SUP] replaying 2 queued requests

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐   requests  ┌──────────────┐
│   Clients    │ ─────────▶  │  Supervisor  │
└──────────────┘ ◀─────────  │  + queue     │
                               └────┬───────┘
                                    │
                                    ▼
                             ┌──────────────┐
                             │   Driver     │
                             └──────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Supervisor	Launch + restart	backoff policy
Request queue	Save in-flight ops	FIFO with IDs
Checkpoint	Store state	incremental vs full
Driver API	IPC contract	versioned protocol

4.3 Data Structures

typedef struct {
    uint64_t req_id;
    message_t msg;
} pending_req_t;

typedef struct {
    pid_t pid;
    endpoint_t ep;
    int restart_count;
} driver_t;

4.4 Algorithm Overview

Key Algorithm: Crash Recovery

Detect crash (waitpid or heartbeat timeout).
Restart driver process.
Restore checkpointed state.
Replay queued requests in order.

Complexity Analysis:

Time: O(N) for replay of N pending requests
Space: O(N) pending queue size

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -g -o supervisor *.c

5.2 Project Structure

supervisor/
├── src/
│   ├── supervisor.c
│   ├── queue.c
│   ├── checkpoint.c
│   └── main.c
└── tests/
    └── test_recovery.c

5.3 The Core Question You’re Answering

“How can the OS keep running when a driver crashes?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Process supervision
IPC request semantics
Checkpointing approaches
At-least-once vs exactly-once semantics

5.5 Questions to Guide Your Design

How will you identify in-flight requests?
How will you prevent duplicate execution on replay?
What is the restart backoff strategy?

5.6 Thinking Exercise

Design a Replay Protocol

If a driver crashes after partially completing a request, how will you avoid corrupting state on replay?

5.7 The Interview Questions They’ll Ask

“How does MINIX recover from driver crashes?”
“What is the difference between at-least-once and exactly-once?”

5.8 Hints in Layers

Hint 1: Add request IDs IDs help detect duplicates.

Hint 2: Persist checkpoints Store minimal state to resume.

Hint 3: Test with fault injection Crash driver in the middle of operations.

5.9 Books That Will Help

Topic	Book	Chapter
Fault tolerance	MINIX papers	RS sections
Supervision trees	Erlang docs	Supervision

5.10 Implementation Phases

Phase 1: Foundation (1 week)

Goals:

Supervisor launches drivers

Tasks:

Fork/exec drivers.
Detect crash with waitpid.

Checkpoint: Restart happens after crash.

Phase 2: Core Functionality (1-2 weeks)

Goals:

Request queue and replay

Tasks:

Store pending requests.
Replay after restart.

Checkpoint: Requests complete after restart.

Phase 3: Polish & Edge Cases (1 week)

Goals:

Checkpointing and backoff

Tasks:

Add checkpoint protocol.
Implement restart backoff.

Checkpoint: Repeated crashes handled gracefully.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Recovery semantics	at-least-once, exactly-once	at-least-once	Simpler to implement
Checkpoint frequency	periodic, on-demand	periodic	Predictable recovery
Restart policy	immediate, backoff	backoff	Avoid loops

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Fault Tests	Crash recovery	SIGSEGV driver
Replay Tests	Request queue	5 pending ops
Stress Tests	Repeated crashes	10 restarts

6.2 Critical Test Cases

Crash during request triggers replay.
Restart loop is throttled.
Checkpoint restore restores state.

6.3 Test Data

Requests: read, write
Crash points: before reply, after reply

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Duplicate replay	Data corruption	Track request IDs
No backoff	CPU spike	Exponential delay
Lost state	Inconsistent results	Checkpoint minimal state

7.2 Debugging Strategies

Log request lifecycle with IDs.
Add a debug command to dump pending queue.

7.3 Performance Traps

Checkpointing too often can hurt throughput. Balance frequency with recovery needs.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a CLI to show driver status.
Add per-driver restart limits.

8.2 Intermediate Extensions

Persist checkpoints to disk.
Add health checks instead of crash-only detection.

8.3 Advanced Extensions

Exactly-once semantics with idempotent ops.
Distributed supervision for multiple nodes.

9. Real-World Connections

9.1 Industry Applications

Automotive: driver isolation for safety.
Industrial control: recovery without reboot.

MINIX 3: https://www.minix3.org/
Erlang/OTP: supervision patterns.

9.3 Interview Relevance

Fault tolerance is a senior-level systems topic.

10. Resources

10.1 Essential Reading

MINIX 3 reliability papers - RS design.
Erlang supervision docs - restart policies.

10.2 Video Resources

Reliability engineering talks.

10.3 Tools & Documentation

gdb: debug crashes
valgrind: detect memory bugs

Project 3: User-space drivers.
Project 7: MINIX exploration.

11. Self-Assessment Checklist

11.1 Understanding

I can explain how RS restarts drivers.
I can describe replay semantics.

11.2 Implementation

Supervisor restarts crashed drivers.
Pending requests are replayed.

11.3 Growth

I can explain tradeoffs of checkpointing.

12. Submission / Completion Criteria

Minimum Viable Completion:

Supervisor restarts drivers on crash.
Basic request replay works.

Full Completion:

Checkpoint restore implemented.
Restart backoff works.

Excellence (Going Above & Beyond):

Exactly-once semantics.
Comprehensive reliability report.

This guide was generated from LEARN_MICROKERNELS.md. For the complete learning path, see the parent directory.

Project 13: Fault-Tolerant Driver System

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

Design a Replay Protocol

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (1 week)

Phase 2: Core Functionality (1-2 weeks)

Phase 3: Polish & Edge Cases (1 week)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria