Project 13: Fault-Tolerant Driver System

Build a driver supervisor that detects crashes, restarts drivers, and replays requests.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 3-4 weeks
Language C (Alternatives: Rust)
Prerequisites IPC, user-space drivers, process supervision
Key Topics Fault detection, restart, state recovery

1. Learning Objectives

By completing this project, you will:

  1. Detect driver crashes and restarts.
  2. Preserve state and replay in-flight requests.
  3. Build a supervision protocol for user-space drivers.
  4. Explain how microkernels achieve high reliability.

2. Theoretical Foundation

2.1 Core Concepts

  • Reincarnation Server: MINIX 3’s recovery mechanism.
  • Supervision Trees: Process monitoring and restart policies.
  • State Checkpointing: Persist state across restarts.
  • Request Replay: Reissue operations after failure.

2.2 Why This Matters

Driver bugs are the #1 cause of kernel crashes in monolithic systems. Microkernels isolate drivers so they can restart without rebooting.

2.3 Historical Context / Background

MINIX 3’s RS pioneered practical OS self-healing. QNX and seL4 systems rely on similar supervision strategies.

2.4 Common Misconceptions

  • “Restarting is enough.” Without state recovery, you lose in-flight work.
  • “Checkpointing is expensive.” It can be minimal if you checkpoint only metadata.

3. Project Specification

3.1 What You Will Build

A supervisor process that launches drivers, monitors them, restarts on crash, and replays queued requests. You will demonstrate recovery during a live workload.

3.2 Functional Requirements

  1. Driver registry: Track running drivers and endpoints.
  2. Crash detection: Use waitpid or heartbeats.
  3. Restart policy: Restart driver with backoff.
  4. Request queue: Save in-flight requests for replay.
  5. State restore: Optional checkpoint for stateful drivers.

3.3 Non-Functional Requirements

  • Reliability: Recovery works under repeated crashes.
  • Safety: Prevent restart storms.
  • Transparency: Clients experience minimal disruption.

3.4 Example Usage / Output

$ ./supervisor
[SUP] starting storage_driver
[SUP] starting net_driver

3.5 Real World Outcome

$ ./supervisor &
[SUP] Launching storage_driver (pid=222)

$ dd if=/dev/sda of=/dev/null bs=1M count=10
10 MB read successfully

$ kill -SEGV 222
[SUP] storage_driver crashed (SIGSEGV)
[SUP] restarting storage_driver
[SUP] replaying 2 queued requests

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐   requests  ┌──────────────┐
│   Clients    │ ─────────▶  │  Supervisor  │
└──────────────┘ ◀─────────  │  + queue     │
                               └────┬───────┘
                                    │
                                    ▼
                             ┌──────────────┐
                             │   Driver     │
                             └──────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Supervisor Launch + restart backoff policy
Request queue Save in-flight ops FIFO with IDs
Checkpoint Store state incremental vs full
Driver API IPC contract versioned protocol

4.3 Data Structures

typedef struct {
    uint64_t req_id;
    message_t msg;
} pending_req_t;

typedef struct {
    pid_t pid;
    endpoint_t ep;
    int restart_count;
} driver_t;

4.4 Algorithm Overview

Key Algorithm: Crash Recovery

  1. Detect crash (waitpid or heartbeat timeout).
  2. Restart driver process.
  3. Restore checkpointed state.
  4. Replay queued requests in order.

Complexity Analysis:

  • Time: O(N) for replay of N pending requests
  • Space: O(N) pending queue size

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -g -o supervisor *.c

5.2 Project Structure

supervisor/
├── src/
│   ├── supervisor.c
│   ├── queue.c
│   ├── checkpoint.c
│   └── main.c
└── tests/
    └── test_recovery.c

5.3 The Core Question You’re Answering

“How can the OS keep running when a driver crashes?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Process supervision
  2. IPC request semantics
  3. Checkpointing approaches
  4. At-least-once vs exactly-once semantics

5.5 Questions to Guide Your Design

  1. How will you identify in-flight requests?
  2. How will you prevent duplicate execution on replay?
  3. What is the restart backoff strategy?

5.6 Thinking Exercise

Design a Replay Protocol

If a driver crashes after partially completing a request, how will you avoid corrupting state on replay?

5.7 The Interview Questions They’ll Ask

  1. “How does MINIX recover from driver crashes?”
  2. “What is the difference between at-least-once and exactly-once?”

5.8 Hints in Layers

Hint 1: Add request IDs IDs help detect duplicates.

Hint 2: Persist checkpoints Store minimal state to resume.

Hint 3: Test with fault injection Crash driver in the middle of operations.

5.9 Books That Will Help

Topic Book Chapter
Fault tolerance MINIX papers RS sections
Supervision trees Erlang docs Supervision

5.10 Implementation Phases

Phase 1: Foundation (1 week)

Goals:

  • Supervisor launches drivers

Tasks:

  1. Fork/exec drivers.
  2. Detect crash with waitpid.

Checkpoint: Restart happens after crash.

Phase 2: Core Functionality (1-2 weeks)

Goals:

  • Request queue and replay

Tasks:

  1. Store pending requests.
  2. Replay after restart.

Checkpoint: Requests complete after restart.

Phase 3: Polish & Edge Cases (1 week)

Goals:

  • Checkpointing and backoff

Tasks:

  1. Add checkpoint protocol.
  2. Implement restart backoff.

Checkpoint: Repeated crashes handled gracefully.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Recovery semantics at-least-once, exactly-once at-least-once Simpler to implement
Checkpoint frequency periodic, on-demand periodic Predictable recovery
Restart policy immediate, backoff backoff Avoid loops

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Fault Tests Crash recovery SIGSEGV driver
Replay Tests Request queue 5 pending ops
Stress Tests Repeated crashes 10 restarts

6.2 Critical Test Cases

  1. Crash during request triggers replay.
  2. Restart loop is throttled.
  3. Checkpoint restore restores state.

6.3 Test Data

Requests: read, write
Crash points: before reply, after reply

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Duplicate replay Data corruption Track request IDs
No backoff CPU spike Exponential delay
Lost state Inconsistent results Checkpoint minimal state

7.2 Debugging Strategies

  • Log request lifecycle with IDs.
  • Add a debug command to dump pending queue.

7.3 Performance Traps

Checkpointing too often can hurt throughput. Balance frequency with recovery needs.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a CLI to show driver status.
  • Add per-driver restart limits.

8.2 Intermediate Extensions

  • Persist checkpoints to disk.
  • Add health checks instead of crash-only detection.

8.3 Advanced Extensions

  • Exactly-once semantics with idempotent ops.
  • Distributed supervision for multiple nodes.

9. Real-World Connections

9.1 Industry Applications

  • Automotive: driver isolation for safety.
  • Industrial control: recovery without reboot.
  • MINIX 3: https://www.minix3.org/
  • Erlang/OTP: supervision patterns.

9.3 Interview Relevance

Fault tolerance is a senior-level systems topic.


10. Resources

10.1 Essential Reading

  • MINIX 3 reliability papers - RS design.
  • Erlang supervision docs - restart policies.

10.2 Video Resources

  • Reliability engineering talks.

10.3 Tools & Documentation

  • gdb: debug crashes
  • valgrind: detect memory bugs
  • Project 3: User-space drivers.
  • Project 7: MINIX exploration.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how RS restarts drivers.
  • I can describe replay semantics.

11.2 Implementation

  • Supervisor restarts crashed drivers.
  • Pending requests are replayed.

11.3 Growth

  • I can explain tradeoffs of checkpointing.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Supervisor restarts drivers on crash.
  • Basic request replay works.

Full Completion:

  • Checkpoint restore implemented.
  • Restart backoff works.

Excellence (Going Above & Beyond):

  • Exactly-once semantics.
  • Comprehensive reliability report.

This guide was generated from LEARN_MICROKERNELS.md. For the complete learning path, see the parent directory.