Project 3: Process Supervisor with Signal Forwarding

Build a supervisor that launches a child process, forwards signals safely, restarts on crash, and never leaks zombies.

Quick Reference

Attribute Value
Difficulty Intermediate-Advanced
Time Estimate 2-3 weeks
Main Programming Language C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level Level 4 - Operationally essential
Business Potential Level 3 - Service management tooling
Prerequisites fork/exec/wait, signals, errno
Key Topics SIGCHLD, async-signal-safe, process groups, restart policy

1. Learning Objectives

By completing this project, you will:

  1. Handle SIGCHLD safely and reap children without races.
  2. Forward signals to process groups and manage shutdowns cleanly.
  3. Implement restart policies with backoff to avoid crash loops.
  4. Manage FD inheritance and avoid leaking resources into children.
  5. Produce deterministic, testable supervisor behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Signals, Async-Signal Safety, and EINTR

Fundamentals

Signals are asynchronous notifications delivered to a process. They can interrupt execution at any point, which makes signal handlers dangerous. Only a small set of functions are async-signal-safe. Calling non-safe functions (like printf, malloc, or pthread_mutex_lock) inside a handler can cause deadlocks or undefined behavior. The correct pattern is to set a flag or write to a pipe in the handler, then handle the real work in the main loop. Signals also interrupt system calls, causing them to return -1 with errno=EINTR. A robust supervisor must retry interrupted calls or gracefully handle them.

Deep Dive into the concept

The kernel delivers signals by interrupting the thread’s execution and running a handler on its stack. This can happen between any two instructions, which is why handlers must be minimal and reentrant. Async-signal-safe functions are those that the kernel guarantees can be called from handlers without corrupting global state. The list is small: write, _exit, signal, sigaction, and a few others. If you call printf, it can deadlock because printf might hold internal locks when the signal interrupts it. A deadlock in a signal handler is catastrophic because it can freeze the process in a non-recoverable state.

The safe pattern is: in the handler, set a volatile sig_atomic_t flag or write a byte to a self-pipe. Then the main loop detects the flag or reads from the pipe and performs the expensive work (like waitpid). This is called the self-pipe trick. It also allows the main loop to use poll() to wait on both I/O and signal events.

EINTR is another common integration issue. When a signal interrupts a blocking system call (like poll or read), the call returns early with EINTR. If you do not handle this, your supervisor can mistakenly interpret it as a fatal error. The correct response is usually to retry the call unless you have a shutdown flag set. This is particularly important in a supervisor that uses waitpid in a loop; waitpid can return EINTR and must be retried.

Signal handling also interacts with process states. SIGCHLD is delivered when a child changes state (exits, stops, or continues). The handler should not call waitpid directly in complex cases because waitpid may require a loop to reap multiple children. Instead, set a flag and handle it in the main loop where you can call waitpid(-1, &status, WNOHANG) in a loop until it returns 0.

How this fits in projects

Signals define how your supervisor observes child exits and forwards shutdown events. This is the core of Project 3 and also appears in Project 6 where the deployment pipeline manages supervised processes.

Definitions & key terms

  • SIGCHLD: Signal delivered when a child changes state.
  • Async-signal-safe: Functions safe to call from a signal handler.
  • EINTR: Interrupted system call error.
  • Self-pipe: Pipe used to notify the main loop of signals.

Mental model diagram (ASCII)

Signal -> handler -> set flag or write byte
                -> main loop -> waitpid, logging, restart

How it works (step-by-step, with invariants and failure modes)

  1. Install sigaction handlers for SIGCHLD, SIGTERM, SIGINT.
  2. Handler writes to a pipe or sets a flag.
  3. Main loop wakes, drains pipe, calls waitpid in loop.
  4. Restart logic runs outside handler.
  5. Failure mode: handler uses printf -> deadlock.

Minimal concrete example

volatile sig_atomic_t child_exited = 0;
void sigchld_handler(int sig) { child_exited = 1; }

Common misconceptions

  • “Any function is safe in a handler”: false.
  • “Signals are delivered only between syscalls”: false; they can arrive any time.
  • “Ignoring EINTR is fine”: it can cause spurious failures.

Check-your-understanding questions

  1. Why is printf unsafe inside a signal handler?
  2. What is the self-pipe trick and why use it?
  3. When should you retry a system call after EINTR?

Check-your-understanding answers

  1. It may hold internal locks and is not async-signal-safe.
  2. It converts an async signal into a readable event for the main loop.
  3. Usually immediately, unless a shutdown flag is set.

Real-world applications

  • Process supervisors like systemd and daemontools.
  • High-availability service managers.

Where you will apply it

References

  • man 7 signal, man 2 sigaction.
  • “Advanced Programming in the UNIX Environment” Ch. 10.

Key insights

Signal handlers must be minimal; the real work belongs in the main loop.

Summary

Signals are asynchronous and dangerous. Safe supervisors use minimal handlers, self-pipes, and robust EINTR handling.

Homework/exercises to practice the concept

  1. Write a program that sets a SIGUSR1 handler and observe EINTR on read().
  2. Implement the self-pipe trick and integrate with poll().
  3. Demonstrate a deadlock by using printf inside a handler (in a toy program).

Solutions to the homework/exercises

  1. The read() returns EINTR when SIGUSR1 arrives.
  2. The main loop wakes on the pipe and handles the signal safely.
  3. The program can hang due to unsafe handler usage.

2.2 waitpid, Child States, and Zombie Reaping

Fundamentals

When a child process exits, the kernel keeps a small record of its exit status until the parent calls wait() or waitpid(). Until then, the child is a zombie. Zombies consume a process table entry and can accumulate if not reaped. A supervisor must call waitpid in a loop to reap all exited children. waitpid can also report other state changes, such as stop or continue events.

Deep Dive into the concept

waitpid(-1, &status, WNOHANG) is the standard pattern for non-blocking reaping. The -1 means “any child.” With WNOHANG, the call returns 0 if no child has exited. The supervisor should call waitpid in a loop until it returns 0 or -1 with ECHILD. This ensures all exited children are reaped. If you only call waitpid once, you can leave zombies if multiple children exited around the same time.

The status value encodes whether the process exited normally or was killed by a signal. Use WIFEXITED, WEXITSTATUS, WIFSIGNALED, and WTERMSIG macros to decode it. This data informs restart policy: a clean exit may indicate a controlled shutdown, while a SIGSEGV implies a crash.

A supervisor should also handle child stop/continue events if it cares about job control. For most service managers, you can ignore stop/continue unless you explicitly send SIGSTOP or SIGCONT. However, you should be aware that waitpid can return for these states if you use WUNTRACED or WCONTINUED flags.

Reaping is closely tied to signal handling. SIGCHLD is delivered when a child changes state, but it does not accumulate: if you miss it, you might not get another signal. That is why you should always reap in a loop when the main loop detects a SIGCHLD flag, not just once.

How this fits in projects

This concept ensures your supervisor never leaks zombies and can correctly interpret exit status for restarts.

Definitions & key terms

  • Zombie: Process that exited but has not been reaped.
  • WNOHANG: waitpid option for non-blocking behavior.
  • WIFSIGNALED: Macro indicating termination by signal.

Mental model diagram (ASCII)

child exit -> zombie -> waitpid() -> reaped

How it works (step-by-step, with invariants and failure modes)

  1. Child exits; kernel records exit status.
  2. SIGCHLD is delivered to parent.
  3. Parent loops waitpid(-1, &status, WNOHANG).
  4. For each child, record exit cause and clear from table.
  5. Failure mode: no loop -> zombies remain.

Minimal concrete example

while ((pid = waitpid(-1, &status, WNOHANG)) > 0) {
    if (WIFSIGNALED(status)) { /* crashed */ }
}

Common misconceptions

  • “SIGCHLD is queued”: it is not; you can miss it.
  • “One waitpid is enough”: multiple children can exit.

Check-your-understanding questions

  1. Why do zombies exist at all?
  2. What does waitpid return when no children are waiting?
  3. How do you detect a SIGSEGV crash?

Check-your-understanding answers

  1. So the parent can collect the exit status.
  2. 0 when WNOHANG is used.
  3. WIFSIGNALED(status) and WTERMSIG(status) == SIGSEGV.

Real-world applications

  • Supervisors and init systems.
  • Container runtimes that manage child processes.

Where you will apply it

References

  • man 2 waitpid.
  • “Operating Systems: Three Easy Pieces” (Processes chapter).

Key insights

Reaping is a loop, not a single call.

Summary

A supervisor must reap children reliably to avoid zombies and must decode exit status to make restart decisions.

Homework/exercises to practice the concept

  1. Fork multiple children that exit quickly and verify all are reaped.
  2. Force a child to segfault and observe status decoding.
  3. Simulate missed SIGCHLD and confirm loop reaps all.

Solutions to the homework/exercises

  1. The loop should reap all children and leave none as zombies.
  2. WIFSIGNALED should be true with SIGSEGV.
  3. The loop works even if SIGCHLD signal was missed.

2.3 Process Groups, Sessions, and Signal Forwarding

Fundamentals

A process group is a collection of processes that can receive signals together. When a supervisor launches a child, it can place the child in a new process group and then send signals to the group to ensure all descendants are terminated. This is essential when the supervised process spawns children; otherwise, those children can outlive the supervisor and become orphaned. Proper signal forwarding ensures clean shutdowns and prevents resource leaks.

Deep Dive into the concept

Process groups are identified by a PGID. A process can become a group leader by calling setpgid(0, 0). If the supervisor launches a child and sets it as the leader of a new group, it can later send a signal to the entire group with kill(-pgid, SIGTERM). The negative PID indicates a process group. This is how systemd and many supervisors ensure all child processes stop on shutdown.

Sessions are higher-level groupings of process groups. A session is created by setsid() and typically used by daemons to detach from the controlling terminal. For a simple supervisor, you usually only need process groups. However, if you want the supervisor to manage a full service tree independent of a terminal, creating a new session for the child can be useful.

Signal forwarding policy must be explicit. When the supervisor receives SIGTERM, it should forward SIGTERM to the child process group and then wait for graceful shutdown. If the child does not exit within a timeout, the supervisor should escalate to SIGKILL. This two-phase shutdown is common in production to allow cleanup without hanging indefinitely.

How this fits in projects

Process groups enable correct shutdown behavior in Project 3 and are reused in Project 6 when restarting services during deployment.

Definitions & key terms

  • Process group: Set of processes with a shared PGID.
  • PGID: Process group ID.
  • Session: Collection of process groups.

Mental model diagram (ASCII)

Supervisor (PID 100)
  |
  +-- child (PID 200, PGID 200)
       |
       +-- grandchild (PID 201, PGID 200)

kill(-200, SIGTERM) => both child and grandchild

How it works (step-by-step, with invariants and failure modes)

  1. Fork child.
  2. In child, call setpgid(0, 0) to create a new group.
  3. Supervisor records PGID.
  4. On SIGTERM, supervisor calls kill(-pgid, SIGTERM).
  5. Wait for exit; after timeout, send SIGKILL.
  6. Failure mode: no process group -> orphaned grandchildren.

Minimal concrete example

pid_t pid = fork();
if (pid == 0) { setpgid(0, 0); execvp(argv[0], argv); }
// parent uses kill(-pid, SIGTERM)

Common misconceptions

  • “Killing the child kills all its children”: not always.
  • “SIGKILL is always fine”: it prevents graceful cleanup.

Check-your-understanding questions

  1. Why send signals to a process group instead of a single PID?
  2. When should you use SIGKILL?
  3. How do sessions differ from process groups?

Check-your-understanding answers

  1. To ensure all descendants receive the signal.
  2. Only after a graceful shutdown timeout.
  3. Sessions group process groups and detach from terminals.

Real-world applications

  • Systemd service managers.
  • Supervisors that control worker pools.

Where you will apply it

References

  • man 2 setpgid, man 2 kill.
  • APUE Chapter 9.

Key insights

Signal forwarding is about process groups, not individual PIDs.

Summary

Supervisors must manage process groups to ensure complete shutdown and avoid orphaned processes.

Homework/exercises to practice the concept

  1. Create a child that spawns a grandchild and observe behavior with and without process groups.
  2. Implement a two-phase SIGTERM then SIGKILL shutdown.
  3. Observe the difference between killing a PID and a PGID.

Solutions to the homework/exercises

  1. Without process groups, the grandchild can survive the supervisor.
  2. Two-phase shutdown allows cleanup before forced termination.
  3. Negative PID targets a group, which is more reliable for trees.

2.4 Restart Policies and Backoff

Fundamentals

A supervisor should restart crashed processes, but not endlessly in a tight loop. A restart policy defines when and how to restart and how long to wait between attempts. Backoff prevents CPU burn and log spam in crash loops. A common policy is exponential backoff with a cap, plus a reset when the process stays up for a certain time.

Deep Dive into the concept

Crash loops are a classic production failure. If a process crashes immediately, a naive supervisor will restart it instantly, creating a tight loop that consumes CPU and generates logs. Backoff mitigates this by increasing the delay between restarts. A typical sequence is 1s, 2s, 4s, 8s, capped at 60s. This reduces load and gives operators time to intervene.

A restart policy should also consider exit codes. Some exits are intentional (e.g., exit code 0 or a specific code indicating a clean shutdown). The supervisor should not restart in those cases unless configured to do so. You can define a “restart on crash” policy where only signal-terminated exits are restarted. Another option is “always restart,” which is appropriate for daemons that should run indefinitely.

A healthy run window is also useful. If the process stays up longer than a threshold (e.g., 10 minutes), you can reset the backoff to the minimum. This prevents long-lived services from being punished by previous failures.

How this fits in projects

Restart policy determines your supervisor’s resilience and operator experience. It also influences the deployment pipeline in Project 6.

Definitions & key terms

  • Crash loop: Rapid repeated crashes and restarts.
  • Backoff: Increasing delay between retries.
  • Restart policy: Rules for when to restart.

Mental model diagram (ASCII)

crash -> delay 1s -> crash -> delay 2s -> crash -> delay 4s
stable for 10m -> reset delay to 1s

How it works (step-by-step, with invariants and failure modes)

  1. Child exits; supervisor checks status.
  2. If restartable, compute backoff delay.
  3. Sleep for delay; restart child.
  4. If child stable for threshold, reset backoff.
  5. Failure mode: no backoff -> tight crash loop.

Minimal concrete example

int delay = min(max_delay, base << attempts);

Common misconceptions

  • “Always restart immediately”: this can DDoS your own machine.
  • “Exit code 0 means crash”: it usually means clean exit.

Check-your-understanding questions

  1. Why reset backoff after a stable run?
  2. What exit statuses should trigger restart?
  3. How do you cap backoff to avoid excessive delays?

Check-your-understanding answers

  1. To avoid penalizing long-running stable services.
  2. Typically signal-terminated or non-zero exit codes.
  3. Use a max delay and clamp the exponential growth.

Real-world applications

  • init systems and service managers.
  • Job runners that need resilience.

Where you will apply it

References

  • “Release It!” crash loop patterns.

Key insights

Backoff is an operational safety valve.

Summary

Restart policy prevents crash loops and gives operators control over recovery behavior.

Homework/exercises to practice the concept

  1. Simulate a crashing process and measure restart intervals.
  2. Implement a reset policy after a stable run.
  3. Add configurable restart modes (always, on-failure, never).

Solutions to the homework/exercises

  1. The delay should increase exponentially up to a cap.
  2. After the stable window, the delay resets to the minimum.
  3. Restart logic should follow the configured policy.

3. Project Specification

3.1 What You Will Build

A CLI supervisor that launches a child command, monitors it, forwards signals to its process group, restarts it on crash with backoff, and emits structured logs.

3.2 Functional Requirements

  1. Launch and monitor: Start a child command and monitor its exit.
  2. Signal forwarding: Forward SIGTERM and SIGINT to child process group.
  3. Restart policy: Restart on crash with configurable backoff.
  4. No zombies: Reap all children correctly.
  5. Deterministic logs: Use fixed timestamps in test mode.

3.3 Non-Functional Requirements

  • Reliability: No zombie processes after supervisor exit.
  • Usability: Clear log messages for restarts and exits.
  • Observability: Exit codes and signals are reported explicitly.

3.4 Example Usage / Output

$ ./supervisor -- ./worker --port 9000
[supervisor] started child pid=4412 pgid=4412
[supervisor] forwarding SIGTERM to process group
[supervisor] child exited status=0
[supervisor] restart in 2s

3.5 Data Formats / Schemas / Protocols

Config (optional JSON):

{
  "restart_policy": "on-failure",
  "backoff_base_ms": 1000,
  "backoff_max_ms": 30000,
  "graceful_shutdown_ms": 5000
}

3.6 Edge Cases

  • Child crashes immediately in a loop.
  • Supervisor receives SIGTERM while child is restarting.
  • Child spawns grandchildren and leaves them running.
  • waitpid interrupted by signals.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./supervisor -- ./worker --port 9000

3.7.2 Golden Path Demo (Deterministic)

  • Use a test worker that exits after 2 seconds with code 0.
  • Supervisor logs show clean exit and optional restart based on policy.

3.7.3 If CLI: exact terminal transcript

$ ./supervisor -- ./worker --port 9000
[supervisor] started child pid=4412
[supervisor] child exited status=0

Failure demo (crash loop):

$ ./supervisor -- ./crash
[supervisor] child died signal=11
[supervisor] restart in 2s
[supervisor] restart in 4s

Exit codes:

  • 0 on clean shutdown.
  • 5 on repeated crash loop beyond limit.

4. Solution Architecture

4.1 High-Level Design

+--------------------+
| Supervisor Main    |
+---------+----------+
          |
    signal pipe
          v
+---------+----------+
| Event Loop         |
+---------+----------+
          |
          v
+---------+----------+
| Child Manager      |
+--------------------+

4.2 Key Components

Component Responsibility Key Decisions
Signal handler Set flags / write to pipe Minimal handler
Event loop Poll for signal events and child status poll + waitpid loop
Child manager fork/exec, restart policy exponential backoff
Logger Emit structured logs deterministic test mode

4.3 Data Structures (No Full Code)

struct restart_policy {
    int base_ms;
    int max_ms;
    int stable_reset_ms;
};

4.4 Algorithm Overview

Key Algorithm: Signal + waitpid loop

  1. Wait on signal pipe with poll.
  2. On SIGCHLD, call waitpid in loop.
  3. If child crashed, apply restart policy.
  4. Forward signals to process group.

Complexity Analysis:

  • Time: O(1) per signal event.
  • Space: O(1).

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y gcc make

5.2 Project Structure

supervisor/
├── src/
│   ├── main.c
│   ├── signals.c
│   ├── child.c
│   └── policy.c
├── include/
│   ├── signals.h
│   └── child.h
├── tests/
│   ├── crash.sh
│   └── clean_exit.sh
└── Makefile

5.3 The Core Question You’re Answering

“How do I manage processes safely when the OS can interrupt me at any time?”

5.4 Concepts You Must Understand First

  1. Signals and async-signal-safe rules.
  2. waitpid and exit status decoding.
  3. Process groups and forwarding.
  4. Restart policies and backoff.

5.5 Questions to Guide Your Design

  1. What logic must run in the handler vs main loop?
  2. How will you avoid race conditions in signal handling?
  3. When does the supervisor stop restarting?

5.6 Thinking Exercise

Crash Loop Scenario

Child exits every 1s with SIGSEGV.
What policy avoids CPU burn and log spam?

5.7 The Interview Questions They’ll Ask

  1. Why is printf unsafe in signal handlers?
  2. How do you avoid zombies?
  3. What does SA_SIGINFO provide?

5.8 Hints in Layers

Hint 1: Minimal handler

volatile sig_atomic_t sigchld = 0;

Hint 2: Reap in loop

while (waitpid(-1, &status, WNOHANG) > 0) { }

Hint 3: Process group

kill(-pgid, SIGTERM);

Hint 4: Backoff

1s -> 2s -> 4s -> 8s

5.9 Books That Will Help

Topic Book Chapter
Signals Advanced Programming in the UNIX Environment Ch. 10
Process control APUE Ch. 8-9
Process API OSTEP Process chapter

5.10 Implementation Phases

Phase 1: Signals + Reaping (3-4 days)

Goals:

  • Correct signal handling and zombie reaping.

Tasks:

  1. Install signal handlers using sigaction.
  2. Implement waitpid loop and logging.

Checkpoint: No zombies after child exits.

Phase 2: Process Groups (3-4 days)

Goals:

  • Forward signals to child process group.

Tasks:

  1. Set child PGID on fork.
  2. Implement forward-on-signal logic.

Checkpoint: SIGTERM kills all child processes.

Phase 3: Restart Policy (4-6 days)

Goals:

  • Backoff and restart decisions.

Tasks:

  1. Add restart policy configuration.
  2. Implement exponential backoff and reset logic.

Checkpoint: Crash loops back off correctly.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Signal strategy handlers only, self-pipe self-pipe Safe and testable
Restart policy always, on-failure on-failure Avoid restarting clean exits
Shutdown timeout 1s, 5s, 30s 5s Reasonable balance

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Restart policy logic backoff calculation
Integration Tests Crash and restart behavior crash.sh
Edge Case Tests Signal interruptions kill during restart

6.2 Critical Test Cases

  1. SIGCHLD causes waitpid loop to reap all children.
  2. SIGTERM triggers signal forwarding to process group.
  3. Crash loop triggers exponential backoff.

6.3 Test Data

./worker exits with code 0
./crash segfaults immediately

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Doing work in handler Deadlock or crash Use self-pipe and flags
No waitpid loop Zombies accumulate Reap until waitpid returns 0
No process group Orphaned grandchildren setpgid and kill(-pgid)

7.2 Debugging Strategies

  • Use ps -o pid,ppid,stat to spot zombies (state Z).
  • Send SIGTERM and observe forwarding behavior.
  • Log exit codes and signals explicitly.

7.3 Performance Traps

  • Busy loops when handling signals without blocking.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a --once flag to disable restarts.
  • Add JSON logging output.

8.2 Intermediate Extensions

  • Add support for multiple children (worker pool).
  • Add watchdog timeouts (restart if no heartbeat).

8.3 Advanced Extensions

  • Implement a control socket for runtime commands.
  • Integrate with systemd notifications.

9. Real-World Connections

9.1 Industry Applications

  • systemd: Process supervision with cgroups and signals.
  • supervisord: Lightweight supervisor for services.
  • runit: Minimal supervisor with restart policy.
  • s6: Advanced process supervision suite.

9.3 Interview Relevance

  • Signal handling and process control are common systems questions.

10. Resources

10.1 Essential Reading

  • Advanced Programming in the UNIX Environment - Signal handling and process control.
  • Operating Systems: Three Easy Pieces - Process management chapter.

10.2 Video Resources

  • Talks on process supervision and init systems.

10.3 Tools & Documentation

  • man 2 waitpid, man 2 sigaction, man 2 setpgid.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why zombies happen.
  • I can implement a signal-safe handler.
  • I can describe process group signaling.

11.2 Implementation

  • All functional requirements are met.
  • Crash loops back off correctly.
  • No zombies remain after shutdown.

11.3 Growth

  • I can justify my restart policy in production.
  • I can explain the self-pipe trick.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Launch child and handle SIGCHLD reaping.
  • Forward SIGTERM to process group.
  • Basic restart policy with backoff.

Full Completion:

  • Configurable policies and deterministic logging.
  • Integration tests for crash loops and signal forwarding.

Excellence (Going Above & Beyond):

  • Multi-child supervision with health checks.
  • Control socket for runtime management.