Project 3: Process Supervisor with Signal Forwarding
Build a supervisor that spawns, monitors, and gracefully restarts child processes without zombies.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate-Advanced |
| Time Estimate | 2-3 weeks |
| Language | C (Alternatives: Rust) |
| Prerequisites | fork/exec basics, signal handling, waitpid() |
| Key Topics | SIGCHLD, process groups, graceful shutdown |
1. Learning Objectives
By completing this project, you will:
- Implement a robust supervisor using
fork(),exec(), andwaitpid(). - Handle signals safely with async-signal-safe patterns.
- Forward signals to child process groups.
- Avoid zombie processes and restart safely.
2. Theoretical Foundation
2.1 Core Concepts
- Process lifecycle: child creation, exec replacement, and exit status inspection.
- SIGCHLD handling: reliable reaping to prevent zombies.
- Process groups: broadcasting signals to a service and its children.
- Signal safety: only async-signal-safe calls inside handlers.
2.2 Why This Matters
Supervisors keep production services running. Incorrect signal handling leads to zombies, orphaned processes, or stuck shutdowns.
2.3 Historical Context / Background
Unix process control APIs have remained stable for decades. Most production init systems build on these primitives.
2.4 Common Misconceptions
- “Calling
wait()once is enough.” Children can exit anytime; you must loop. - “Signal handlers can call printf.” Most libc calls are not async-signal-safe.
3. Project Specification
3.1 What You Will Build
A small init-style supervisor that runs a command, restarts it on crash, and forwards termination signals to the entire process group.
3.2 Functional Requirements
- Spawn and monitor: Start a child process and track status.
- Restart policy: Restart on crash with backoff.
- Signal forwarding: Forward SIGTERM/SIGINT to child group.
- No zombies: Reap all children reliably.
3.3 Non-Functional Requirements
- Reliability: Survive rapid crash loops without resource leaks.
- Usability: CLI flags for restart backoff and max restarts.
- Observability: Logs for state transitions.
3.4 Example Usage / Output
$ ./supervisor --restart=always --backoff=2 -- /usr/bin/myservice
[supervisor] started pid=2314
[supervisor] child exited status=1, restarting in 2s
3.5 Real World Outcome
You run the supervisor, kill the child, and watch it restart with clean signal handling. Example output:
$ ./supervisor --restart=always --backoff=1 -- ./worker
[supervisor] started pid=4021
[supervisor] SIGTERM received, forwarding to process group
[supervisor] child exited status=0, shutting down
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ ┌──────────────┐
│ Supervisor │──▶│ Child Proc │
└──────────────┘ └──────────────┘
│ ▲
▼ │
Signal Handler waitpid loop

4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Main Loop | spawn/restart | backoff policy |
| Signal Handler | set flags | self-pipe or atomic flag |
| Reaper | waitpid() loop |
handle multiple exits |
4.3 Data Structures
struct supervisor_state {
pid_t child_pid;
int restart_count;
int shutting_down;
};
4.4 Algorithm Overview
Key Algorithm: SIGCHLD reaping
- On SIGCHLD, set a flag.
- In main loop, call
waitpid(-1, &status, WNOHANG)until no more. - Restart if policy permits.
Complexity Analysis:
- Time: O(k) per reap, k exited children
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install build-essential
5.2 Project Structure
supervisor/
├── src/
│ ├── main.c
│ ├── signals.c
│ └── spawn.c
├── tests/
│ └── test_restart.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering
“How do I keep a process alive without creating zombies or signal races?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- fork/exec/wait
- What does
exec()do to PID and memory? - How do you interpret exit status?
- Book Reference: “APUE” Ch. 8-9
- What does
- Signals and safety
- Which functions are async-signal-safe?
- What is signal coalescing?
- Book Reference: “TLPI” Ch. 20-21
- Process groups
- Why does a daemon spawn children?
- How do you signal a group?
- Book Reference: “APUE” Ch. 9
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Where do you store state updated by signals?
- How do you avoid missing SIGCHLD?
- When should you stop restarting a crashing service?
- How do you forward signals to grandchildren?
5.6 Thinking Exercise
Manual Process Tree
Draw a tree of:
- supervisor
- child
- grandchild
- child
Which PID receives SIGTERM if you send it to the process group?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you avoid zombie processes?”
- “Why can’t you call
printf()in a signal handler?” - “What is a process group and when does it matter?”
5.8 Hints in Layers
Hint 1: Use sigaction()
Use sigaction() with SA_RESTART and a minimal handler.
Hint 2: Self-pipe trick Write a byte to a pipe in the handler to wake the main loop.
Hint 3: Backoff policy Sleep a bit after crashes to avoid tight restart loops.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Process control | “APUE” | Ch. 8-9 |
| Signal handling | “TLPI” | Ch. 20-21 |
| Process groups | “APUE” | Ch. 9 |
5.10 Implementation Phases
Phase 1: Foundation (3-4 days)
Goals:
- Spawn child
- Basic wait loop
Tasks:
- Implement
fork()+exec(). - Capture child exit status.
Checkpoint: Supervisor restarts a crashing child.
Phase 2: Core Functionality (5-7 days)
Goals:
- Signal handling
- Process group control
Tasks:
- Add
sigaction()handlers. - Forward SIGTERM/SIGINT to child group.
Checkpoint: Ctrl+C stops child and supervisor cleanly.
Phase 3: Polish & Edge Cases (3-5 days)
Goals:
- Backoff policy
- Logging and metrics
Tasks:
- Add restart limits and exponential backoff.
- Log state transitions and timestamps.
Checkpoint: Stable under crash loops.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Signal handling | simple flag vs self-pipe | self-pipe | avoids race with blocking wait |
| Restart policy | always vs capped | capped with backoff | prevents thrash |
| Process group | none vs setpgid() |
use process group | signals reach grandchildren |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Signal helpers | handler sets flags |
| Integration Tests | Restart behavior | crash loop test |
| Edge Case Tests | Signal storms | repeated SIGCHLD |
6.2 Critical Test Cases
- Crash loop: child exits quickly 10 times.
- Graceful shutdown: send SIGTERM and ensure child exits.
- Grandchild handling: child spawns grandchild, ensure it receives signals.
6.3 Test Data
./worker --exit-after 1
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Missing SIGCHLD | zombie processes | loop waitpid() |
| Signal handler work | crashes | handler sets flags only |
| No process group | grandchildren survive | use setpgid() |
7.2 Debugging Strategies
- Use
ps -o pid,ppid,statto spot zombies. - Log restart reasons with exit status.
7.3 Performance Traps
Tight restart loops can consume CPU. Add backoff and limits.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--onceto disable restarts. - Add JSON logging output.
8.2 Intermediate Extensions
- Add a control socket for runtime commands.
- Support configuration file loading.
8.3 Advanced Extensions
- Implement health checks with heartbeat.
- Add per-service dependency ordering.
9. Real-World Connections
9.1 Industry Applications
- init systems: systemd, s6, runit.
- process managers: supervisord, PM2.
9.2 Related Open Source Projects
- runit: http://smarden.org/runit/ - Minimal supervisor
- s6: https://skarnet.org/software/s6/ - Advanced process supervision
9.3 Interview Relevance
- Demonstrates signal handling and process control depth.
- Common debugging question: zombie processes.
10. Resources
10.1 Essential Reading
- “APUE” by Stevens & Rago - Ch. 8-9
- “The Linux Programming Interface” by Michael Kerrisk - Ch. 20-21
10.2 Video Resources
- Unix signals deep dives - Conference talks
- Process management debugging demos
10.3 Tools & Documentation
man 2 waitpid: Child reapingman 2 sigaction: Signal handling
10.4 Related Projects in This Series
- Project 1 teaches FD discipline.
- Project 2 covers failure handling patterns.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how SIGCHLD works.
- I know why process groups matter.
- I can describe async-signal-safe rules.
11.2 Implementation
- No zombies appear under stress.
- Signals are forwarded correctly.
- Restart policy behaves as configured.
11.3 Growth
- I documented a race condition I fixed.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Supervisor restarts a crashing child.
- SIGTERM cleanly shuts down child and supervisor.
Full Completion:
- Backoff and restart limits.
- Logs for state transitions.
Excellence (Going Above & Beyond):
- Runtime control socket and health checks.
- Multi-service dependency management.
This guide was generated from SPRINT_5_SYSTEMS_INTEGRATION_PROJECTS.md. For the complete learning path, see the parent directory.