Project 3: Process Supervisor with Signal Forwarding

Build a supervisor that spawns, monitors, and gracefully restarts child processes without zombies.

Quick Reference

Attribute Value
Difficulty Intermediate-Advanced
Time Estimate 2-3 weeks
Language C (Alternatives: Rust)
Prerequisites fork/exec basics, signal handling, waitpid()
Key Topics SIGCHLD, process groups, graceful shutdown

1. Learning Objectives

By completing this project, you will:

  1. Implement a robust supervisor using fork(), exec(), and waitpid().
  2. Handle signals safely with async-signal-safe patterns.
  3. Forward signals to child process groups.
  4. Avoid zombie processes and restart safely.

2. Theoretical Foundation

2.1 Core Concepts

  • Process lifecycle: child creation, exec replacement, and exit status inspection.
  • SIGCHLD handling: reliable reaping to prevent zombies.
  • Process groups: broadcasting signals to a service and its children.
  • Signal safety: only async-signal-safe calls inside handlers.

2.2 Why This Matters

Supervisors keep production services running. Incorrect signal handling leads to zombies, orphaned processes, or stuck shutdowns.

2.3 Historical Context / Background

Unix process control APIs have remained stable for decades. Most production init systems build on these primitives.

2.4 Common Misconceptions

  • “Calling wait() once is enough.” Children can exit anytime; you must loop.
  • “Signal handlers can call printf.” Most libc calls are not async-signal-safe.

3. Project Specification

3.1 What You Will Build

A small init-style supervisor that runs a command, restarts it on crash, and forwards termination signals to the entire process group.

3.2 Functional Requirements

  1. Spawn and monitor: Start a child process and track status.
  2. Restart policy: Restart on crash with backoff.
  3. Signal forwarding: Forward SIGTERM/SIGINT to child group.
  4. No zombies: Reap all children reliably.

3.3 Non-Functional Requirements

  • Reliability: Survive rapid crash loops without resource leaks.
  • Usability: CLI flags for restart backoff and max restarts.
  • Observability: Logs for state transitions.

3.4 Example Usage / Output

$ ./supervisor --restart=always --backoff=2 -- /usr/bin/myservice
[supervisor] started pid=2314
[supervisor] child exited status=1, restarting in 2s

3.5 Real World Outcome

You run the supervisor, kill the child, and watch it restart with clean signal handling. Example output:

$ ./supervisor --restart=always --backoff=1 -- ./worker
[supervisor] started pid=4021
[supervisor] SIGTERM received, forwarding to process group
[supervisor] child exited status=0, shutting down

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐   ┌──────────────┐
│  Supervisor  │──▶│  Child Proc  │
└──────────────┘   └──────────────┘
        │                 ▲
        ▼                 │
   Signal Handler     waitpid loop

Process Supervisor Architecture

4.2 Key Components

Component Responsibility Key Decisions
Main Loop spawn/restart backoff policy
Signal Handler set flags self-pipe or atomic flag
Reaper waitpid() loop handle multiple exits

4.3 Data Structures

struct supervisor_state {
    pid_t child_pid;
    int restart_count;
    int shutting_down;
};

4.4 Algorithm Overview

Key Algorithm: SIGCHLD reaping

  1. On SIGCHLD, set a flag.
  2. In main loop, call waitpid(-1, &status, WNOHANG) until no more.
  3. Restart if policy permits.

Complexity Analysis:

  • Time: O(k) per reap, k exited children
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install build-essential

5.2 Project Structure

supervisor/
├── src/
│   ├── main.c
│   ├── signals.c
│   └── spawn.c
├── tests/
│   └── test_restart.sh
├── Makefile
└── README.md

Supervisor Project Structure

5.3 The Core Question You’re Answering

“How do I keep a process alive without creating zombies or signal races?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. fork/exec/wait
    • What does exec() do to PID and memory?
    • How do you interpret exit status?
    • Book Reference: “APUE” Ch. 8-9
  2. Signals and safety
    • Which functions are async-signal-safe?
    • What is signal coalescing?
    • Book Reference: “TLPI” Ch. 20-21
  3. Process groups
    • Why does a daemon spawn children?
    • How do you signal a group?
    • Book Reference: “APUE” Ch. 9

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Where do you store state updated by signals?
  2. How do you avoid missing SIGCHLD?
  3. When should you stop restarting a crashing service?
  4. How do you forward signals to grandchildren?

5.6 Thinking Exercise

Manual Process Tree

Draw a tree of:

  • supervisor
    • child
      • grandchild

Which PID receives SIGTERM if you send it to the process group?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you avoid zombie processes?”
  2. “Why can’t you call printf() in a signal handler?”
  3. “What is a process group and when does it matter?”

5.8 Hints in Layers

Hint 1: Use sigaction() Use sigaction() with SA_RESTART and a minimal handler.

Hint 2: Self-pipe trick Write a byte to a pipe in the handler to wake the main loop.

Hint 3: Backoff policy Sleep a bit after crashes to avoid tight restart loops.

5.9 Books That Will Help

Topic Book Chapter
Process control “APUE” Ch. 8-9
Signal handling “TLPI” Ch. 20-21
Process groups “APUE” Ch. 9

5.10 Implementation Phases

Phase 1: Foundation (3-4 days)

Goals:

  • Spawn child
  • Basic wait loop

Tasks:

  1. Implement fork() + exec().
  2. Capture child exit status.

Checkpoint: Supervisor restarts a crashing child.

Phase 2: Core Functionality (5-7 days)

Goals:

  • Signal handling
  • Process group control

Tasks:

  1. Add sigaction() handlers.
  2. Forward SIGTERM/SIGINT to child group.

Checkpoint: Ctrl+C stops child and supervisor cleanly.

Phase 3: Polish & Edge Cases (3-5 days)

Goals:

  • Backoff policy
  • Logging and metrics

Tasks:

  1. Add restart limits and exponential backoff.
  2. Log state transitions and timestamps.

Checkpoint: Stable under crash loops.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Signal handling simple flag vs self-pipe self-pipe avoids race with blocking wait
Restart policy always vs capped capped with backoff prevents thrash
Process group none vs setpgid() use process group signals reach grandchildren

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Signal helpers handler sets flags
Integration Tests Restart behavior crash loop test
Edge Case Tests Signal storms repeated SIGCHLD

6.2 Critical Test Cases

  1. Crash loop: child exits quickly 10 times.
  2. Graceful shutdown: send SIGTERM and ensure child exits.
  3. Grandchild handling: child spawns grandchild, ensure it receives signals.

6.3 Test Data

./worker --exit-after 1

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing SIGCHLD zombie processes loop waitpid()
Signal handler work crashes handler sets flags only
No process group grandchildren survive use setpgid()

7.2 Debugging Strategies

  • Use ps -o pid,ppid,stat to spot zombies.
  • Log restart reasons with exit status.

7.3 Performance Traps

Tight restart loops can consume CPU. Add backoff and limits.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --once to disable restarts.
  • Add JSON logging output.

8.2 Intermediate Extensions

  • Add a control socket for runtime commands.
  • Support configuration file loading.

8.3 Advanced Extensions

  • Implement health checks with heartbeat.
  • Add per-service dependency ordering.

9. Real-World Connections

9.1 Industry Applications

  • init systems: systemd, s6, runit.
  • process managers: supervisord, PM2.
  • runit: http://smarden.org/runit/ - Minimal supervisor
  • s6: https://skarnet.org/software/s6/ - Advanced process supervision

9.3 Interview Relevance

  • Demonstrates signal handling and process control depth.
  • Common debugging question: zombie processes.

10. Resources

10.1 Essential Reading

  • “APUE” by Stevens & Rago - Ch. 8-9
  • “The Linux Programming Interface” by Michael Kerrisk - Ch. 20-21

10.2 Video Resources

  • Unix signals deep dives - Conference talks
  • Process management debugging demos

10.3 Tools & Documentation

  • man 2 waitpid: Child reaping
  • man 2 sigaction: Signal handling
  • Project 1 teaches FD discipline.
  • Project 2 covers failure handling patterns.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how SIGCHLD works.
  • I know why process groups matter.
  • I can describe async-signal-safe rules.

11.2 Implementation

  • No zombies appear under stress.
  • Signals are forwarded correctly.
  • Restart policy behaves as configured.

11.3 Growth

  • I documented a race condition I fixed.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Supervisor restarts a crashing child.
  • SIGTERM cleanly shuts down child and supervisor.

Full Completion:

  • Backoff and restart limits.
  • Logs for state transitions.

Excellence (Going Above & Beyond):

  • Runtime control socket and health checks.
  • Multi-service dependency management.

This guide was generated from SPRINT_5_SYSTEMS_INTEGRATION_PROJECTS.md. For the complete learning path, see the parent directory.