Project 4: Signal-Aware Process Supervisor

Build a supervisor that launches a child, forwards signals, and restarts it on failure.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1 week
Main Programming Language	C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	Level 3: Systems Hardening
Business Potential	2: The “Ops Utility”
Prerequisites	signals and handlers, waitpid, process groups
Key Topics	SIGCHLD, signal forwarding, restart policies

1. Learning Objectives

By completing this project, you will:

Build a working implementation of signal-aware process supervisor and verify it with deterministic outputs.
Explain the underlying Unix and terminal primitives involved in the project.
Diagnose common failure modes with logs and targeted tests.
Extend the project with performance and usability improvements.

2. All Theory Needed (Per-Concept Breakdown)

Signal Propagation and Process Supervision

Fundamentals Signal Propagation and Process Supervision is the core contract that makes the project behave like a real terminal tool. It sits at the boundary between raw bytes and structured state, so you must treat it as both a protocol and a data model. The goal of the fundamentals is to understand what assumptions the system makes about ordering, buffering, and ownership, and how those assumptions surface as user-visible behavior. Key terms include: SIGCHLD, process group, waitpid, backoff, zombie. In practice, the fastest way to gain intuition is to trace a single input through the pipeline and note where it can be delayed, reordered, or transformed. That exercise reveals why Signal Propagation and Process Supervision needs explicit invariants and why even small mistakes can cascade into broken rendering or stuck input.
Deep Dive into the concept A deep understanding of Signal Propagation and Process Supervision requires thinking in terms of state transitions and invariants. You are not just implementing functions; you are enforcing a contract between producers and consumers of bytes, and that contract persists across time. Most failures in this area are caused by violating ordering guarantees, dropping state updates, or misunderstanding how the operating system delivers events. This concept is built from the following pillars: SIGCHLD, process group, waitpid, backoff, zombie. A reliable implementation follows a deterministic flow: Start child in its own process group. -> Install signal handlers that set flags. -> Use waitpid to reap child on SIGCHLD. -> Forward SIGINT/SIGTERM to child group. -> Restart or exit based on policy.. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness. When you design the project, treat each key term as a source of constraints. For example, if a term implies buffering, decide the buffer size and how overflow is handled. If a term implies state, decide how that state is initialized, updated, and reset. Finally, validate your assumptions with deterministic fixtures so you can reproduce bugs. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness.
How this fit on projects This concept is the backbone of the project because it defines how data and control flow move through the system.
Definitions & key terms
- SIGCHLD -> signal delivered when a child changes state
- process group -> a job-control grouping for signal delivery
- waitpid -> system call to reap child status and avoid zombies
- backoff -> delay strategy between restarts to avoid crash loops
- zombie -> a terminated process that has not been reaped
Mental model diagram (ASCII)

[Input] -> [Signal Propagation and Process Supervision] -> [State] -> [Output]

How it works (step-by-step, with invariants and failure modes)
1. Start child in its own process group.
2. Install signal handlers that set flags.
3. Use waitpid to reap child on SIGCHLD.
4. Forward SIGINT/SIGTERM to child group.
5. Restart or exit based on policy.
Minimal concrete example

kill(-child_pgid, SIGTERM); // send to process group

Common misconceptions
- “SIGCHLD always means the child exited” -> it can mean stopped/continued too.
- “waitpid once is enough” -> you must loop to reap all children.
Check-your-understanding questions
- Why use a process group when forwarding signals?
- How do you avoid zombies?
- What is a safe action inside a signal handler?
Check-your-understanding answers
- It ensures the entire job receives the signal.
- Call waitpid in the main loop until no children remain.
- Set flags or write to a self-pipe; avoid malloc/printf.
Real-world applications
- daemons
- service managers
- tmux server child management
Where you’ll apply it
- See Section 3.2 Functional Requirements and Section 5.4 Concepts You Must Understand First.
- Also used in: Project 3: Unix Domain Socket Chat, Project 5: Event-Driven I/O Multiplexer.
References
- APUE Ch. 10
- TLPI Ch. 34
Key insights Signal Propagation and Process Supervision works best when you treat it as a stateful contract with explicit invariants.
Summary You now have a concrete mental model for Signal Propagation and Process Supervision and can explain how it affects correctness and usability.
Homework/Exercises to practice the concept
- Write a supervisor that restarts /bin/false with backoff.
Solutions to the homework/exercises
- Use nanosleep and a restart counter.

3. Project Specification

3.1 What You Will Build

A CLI supervisor that launches a child command, forwards signals to its process group, restarts on crash with backoff, and exits with a clear status.

3.2 Functional Requirements

Requirement 1: Launch child process in its own process group.
Requirement 2: Forward SIGINT/SIGTERM to the child group.
Requirement 3: Detect child exit via SIGCHLD and waitpid.
Requirement 4: Restart on crash with backoff and max retries.
Requirement 5: Exit cleanly if child exits normally with status 0.

3.3 Non-Functional Requirements

Performance: Avoid blocking I/O; batch writes when possible.
Reliability: Handle partial reads/writes and cleanly recover from disconnects.
Usability: Provide clear CLI errors, deterministic output, and helpful logs.

3.4 Example Usage / Output

    $ ./proc_supervisor -- ./long_task
[supervisor] child pid=5012
^C
[supervisor] forwarding SIGINT
[supervisor] child exited status=130
[supervisor] restarting after 500ms
[exit code: 0]

$ ./proc_supervisor -- ./missing_cmd
[error] exec failed: no such file
[exit code: 127]

3.5 Data Formats / Schemas / Protocols

    Supervisor log format: ISO timestamp, event, pid, status.

3.6 Edge Cases

Child exec fails
Rapid crash loop
SIGTERM during restart sleep

3.7 Real World Outcome

This section defines a deterministic, repeatable outcome. Use fixed inputs and set TZ=UTC where time appears.

3.7.1 How to Run (Copy/Paste)

make
./proc_supervisor -- ./long_task

3.7.2 Golden Path Demo (Deterministic)

The “success” demo below is a fixed scenario with a known outcome. It should always match.

3.7.3 If CLI: provide an exact terminal transcript

    $ ./proc_supervisor -- ./long_task
[supervisor] child pid=5012
^C
[supervisor] forwarding SIGINT
[supervisor] child exited status=130
[supervisor] restarting after 500ms
[exit code: 0]

Failure Demo (Deterministic)

    $ ./proc_supervisor -- ./missing_cmd
[error] exec failed: no such file
[exit code: 127]

3.7.8 If TUI

At least one ASCII layout for the UI:

    +------------------------------+
    | Signal-Aware Process Supervisor           |
    | [content area]               |
    | [status / hints]             |
    +------------------------------+

4. Solution Architecture

4.1 High-Level Design

    +-----------+     +-----------+     +-----------+
    |  Client   | <-> |  Server   | <-> |  PTYs     |
    +-----------+     +-----------+     +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |-----------|----------------|---------------| | Supervisor loop | Tracks child state and restarts. | Use a simple state machine. | | Signal handler | Wakes main loop on SIGCHLD or SIGTERM. | Set atomic flags only. | | Backoff controller | Implements restart delays. | Use monotonic clock for timing. |

4.4 Data Structures (No Full Code)

    typedef struct {
    pid_t child_pid;
    int restarts;
    int max_restarts;
    int backoff_ms;
} SupervisorState;

4.4 Algorithm Overview

Key Algorithm: Supervision with exponential backoff

Fork child and setpgid.
Main loop waits for signals or timers.
On SIGCHLD, reap child and inspect status.
If crash, sleep backoff and restart.
If normal exit, terminate supervisor.

Complexity Analysis:

Time O(restarts); Space O(1).

5. Implementation Guide

5.1 Development Environment Setup

    cc --version
make --version

5.2 Project Structure

    proc-supervisor/
|-- src/
|   |-- main.c
|   `-- signals.c
|-- include/
|   `-- signals.h
`-- Makefile

5.3 The Core Question You’re Answering

“How do you manage child lifecycles and signals in a long-lived server?”

5.4 Concepts You Must Understand First

SIGCHLD
- Why it matters and how it impacts correctness.
signal forwarding
- Why it matters and how it impacts correctness.
restart policies
- Why it matters and how it impacts correctness.

5.5 Questions to Guide Your Design

How do you avoid race conditions between SIGCHLD and waitpid?
What restart policy prevents tight crash loops?
Should SIGTERM stop restarts immediately?
5.6 Thinking Exercise

Write a state machine for running -> crashed -> restarting -> running.

5.7 The Interview Questions They’ll Ask

Difference between SIGTERM and SIGKILL?
How do you reap zombies?
5.8 Hints in Layers
Use waitpid(pid, &status, WNOHANG) in a loop.
Use a self-pipe to wake poll.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Signals | Advanced Programming in the UNIX Environment | Ch. 10 |

5.10 Implementation Phases

Phase 1: Foundation (1 week)

Goals:

Establish the core data structures and loop.
Prove basic I/O or rendering works.

Tasks:

Implement the core structs and minimal main loop.
Add logging for key events and errors.

Checkpoint: You can run the tool and see deterministic output.

Phase 2: Core Functionality (1 week)

Goals:

Implement the main requirements and pass basic tests.
Integrate with OS primitives.

Tasks:

Implement remaining functional requirements.
Add error handling and deterministic test fixtures.

Checkpoint: All functional requirements are met for the golden path.

Phase 3: Polish & Edge Cases (1 week)

Goals:

Handle edge cases and improve UX.
Optimize rendering or I/O.

Tasks:

Add edge-case handling and exit codes.
Improve logs and documentation.

Checkpoint: Failure demos behave exactly as specified.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
I/O model	blocking vs non-blocking	non-blocking	avoids stalls in multiplexed loops
Logging	text vs binary	text for v1	easier to inspect and debug

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate components	parser, buffer, protocol
Integration Tests	Validate interactions	end-to-end CLI flow
Edge Case Tests	Handle boundary conditions	resize, invalid input

6.2 Critical Test Cases

Child crash triggers restart.
SIGINT forwards to child group.
Normal exit stops supervisor.
6.3 Test Data

text Run supervisor on a script that exits with status 1 every 2 seconds.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Log all state transitions with timestamps.
Use ps to verify process groups.
7.3 Performance Traps
Busy-looping on WNOHANG without sleep.

8. Extensions & Challenges

8.1 Beginner Extensions

Add –max-restarts flag.
Add –backoff-ms flag.
8.2 Intermediate Extensions
Add jitter to backoff to avoid sync storms.
Persist restart counters.
8.3 Advanced Extensions
Add health-check hooks before marking healthy.

9. Real-World Connections

9.1 Industry Applications

Service supervisors
Container init processes
9.2 Related Open Source Projects
systemd
runit
9.3 Interview Relevance
Event loops, terminal I/O, and state machines are common interview topics.

10. Resources

10.1 Essential Reading

Advanced Programming in the UNIX Environment by W. Richard Stevens - Ch. 10
10.2 Video Resources
Signals and process lifecycle (lecture).
10.3 Tools & Documentation
ps: ps
kill: kill
strace: strace
10.4 Related Projects in This Series
Project 3: Unix Domain Socket Chat - Builds prerequisites
Project 5: Event-Driven I/O Multiplexer - Extends these ideas

11. Self-Assessment Checklist

11.1 Understanding

I can explain the core concept without notes
I can explain how input becomes output in this tool
I can explain the main failure modes

11.2 Implementation

All functional requirements are met
All test cases pass
Code is clean and well-documented
Edge cases are handled

11.3 Growth

I can identify one thing I’d do differently next time
I’ve documented lessons learned
I can explain this project in a job interview

12. Submission / Completion Criteria

Minimum Viable Completion:

Tool runs and passes the golden-path demo
Deterministic output matches expected snapshot
Failure demo returns the correct exit code

Full Completion:

All minimum criteria plus:
Edge cases handled and tested
Documentation covers usage and troubleshooting

Excellence (Going Above & Beyond):

Add at least one advanced extension
Provide a performance profile and improvement notes

Project 4: Signal-Aware Process Supervisor

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Signal Propagation and Process Supervision

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: provide an exact terminal transcript

Failure Demo (Deterministic)

3.7.8 If TUI

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

Use a self-pipe to wake poll.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (1 week)

Phase 2: Core Functionality (1 week)

Phase 3: Polish & Edge Cases (1 week)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

Busy-looping on WNOHANG without sleep.

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

Project 5: Event-Driven I/O Multiplexer - Extends these ideas

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria