Project 4: Signal-Aware Process Supervisor
Build a supervisor that launches a child, forwards signals, and restarts it on failure.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1 week |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 3: Systems Hardening |
| Business Potential | 2: The “Ops Utility” |
| Prerequisites | signals and handlers, waitpid, process groups |
| Key Topics | SIGCHLD, signal forwarding, restart policies |
1. Learning Objectives
By completing this project, you will:
- Build a working implementation of signal-aware process supervisor and verify it with deterministic outputs.
- Explain the underlying Unix and terminal primitives involved in the project.
- Diagnose common failure modes with logs and targeted tests.
- Extend the project with performance and usability improvements.
2. All Theory Needed (Per-Concept Breakdown)
Signal Propagation and Process Supervision
-
Fundamentals Signal Propagation and Process Supervision is the core contract that makes the project behave like a real terminal tool. It sits at the boundary between raw bytes and structured state, so you must treat it as both a protocol and a data model. The goal of the fundamentals is to understand what assumptions the system makes about ordering, buffering, and ownership, and how those assumptions surface as user-visible behavior. Key terms include: SIGCHLD, process group, waitpid, backoff, zombie. In practice, the fastest way to gain intuition is to trace a single input through the pipeline and note where it can be delayed, reordered, or transformed. That exercise reveals why Signal Propagation and Process Supervision needs explicit invariants and why even small mistakes can cascade into broken rendering or stuck input.
-
Deep Dive into the concept A deep understanding of Signal Propagation and Process Supervision requires thinking in terms of state transitions and invariants. You are not just implementing functions; you are enforcing a contract between producers and consumers of bytes, and that contract persists across time. Most failures in this area are caused by violating ordering guarantees, dropping state updates, or misunderstanding how the operating system delivers events. This concept is built from the following pillars: SIGCHLD, process group, waitpid, backoff, zombie. A reliable implementation follows a deterministic flow: Start child in its own process group. -> Install signal handlers that set flags. -> Use waitpid to reap child on SIGCHLD. -> Forward SIGINT/SIGTERM to child group. -> Restart or exit based on policy.. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness. When you design the project, treat each key term as a source of constraints. For example, if a term implies buffering, decide the buffer size and how overflow is handled. If a term implies state, decide how that state is initialized, updated, and reset. Finally, validate your assumptions with deterministic fixtures so you can reproduce bugs. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness. From a systems perspective, the tricky part is coordinating concurrency without introducing races. Even in a single-threaded loop, multiple events can arrive in the same tick, so you need deterministic ordering. This is why many implementations keep a strict sequence: read, update state, compute diff, render. Another subtlety is error handling and recovery. A robust design treats errors as part of the normal control flow: EOF is expected, partial reads are expected, and transient failures must be retried or gracefully handled. The deep dive should also cover how to observe the system, because without logs and trace points, you cannot reason about correctness.
-
How this fit on projects This concept is the backbone of the project because it defines how data and control flow move through the system.
-
Definitions & key terms
- SIGCHLD -> signal delivered when a child changes state
- process group -> a job-control grouping for signal delivery
- waitpid -> system call to reap child status and avoid zombies
- backoff -> delay strategy between restarts to avoid crash loops
- zombie -> a terminated process that has not been reaped
-
Mental model diagram (ASCII)
[Input] -> [Signal Propagation and Process Supervision] -> [State] -> [Output]
-
How it works (step-by-step, with invariants and failure modes)
- Start child in its own process group.
- Install signal handlers that set flags.
- Use waitpid to reap child on SIGCHLD.
- Forward SIGINT/SIGTERM to child group.
- Restart or exit based on policy.
-
Minimal concrete example
kill(-child_pgid, SIGTERM); // send to process group
-
Common misconceptions
- “SIGCHLD always means the child exited” -> it can mean stopped/continued too.
- “waitpid once is enough” -> you must loop to reap all children.
-
Check-your-understanding questions
- Why use a process group when forwarding signals?
- How do you avoid zombies?
- What is a safe action inside a signal handler?
-
Check-your-understanding answers
- It ensures the entire job receives the signal.
- Call waitpid in the main loop until no children remain.
- Set flags or write to a self-pipe; avoid malloc/printf.
-
Real-world applications
- daemons
- service managers
- tmux server child management
-
Where you’ll apply it
- See Section 3.2 Functional Requirements and Section 5.4 Concepts You Must Understand First.
- Also used in: Project 3: Unix Domain Socket Chat, Project 5: Event-Driven I/O Multiplexer.
-
References
- APUE Ch. 10
- TLPI Ch. 34
-
Key insights Signal Propagation and Process Supervision works best when you treat it as a stateful contract with explicit invariants.
-
Summary You now have a concrete mental model for Signal Propagation and Process Supervision and can explain how it affects correctness and usability.
-
Homework/Exercises to practice the concept
- Write a supervisor that restarts /bin/false with backoff.
-
Solutions to the homework/exercises
- Use nanosleep and a restart counter.
3. Project Specification
3.1 What You Will Build
A CLI supervisor that launches a child command, forwards signals to its process group, restarts on crash with backoff, and exits with a clear status.
3.2 Functional Requirements
- Requirement 1: Launch child process in its own process group.
- Requirement 2: Forward SIGINT/SIGTERM to the child group.
- Requirement 3: Detect child exit via SIGCHLD and waitpid.
- Requirement 4: Restart on crash with backoff and max retries.
- Requirement 5: Exit cleanly if child exits normally with status 0.
3.3 Non-Functional Requirements
- Performance: Avoid blocking I/O; batch writes when possible.
- Reliability: Handle partial reads/writes and cleanly recover from disconnects.
- Usability: Provide clear CLI errors, deterministic output, and helpful logs.
3.4 Example Usage / Output
$ ./proc_supervisor -- ./long_task
[supervisor] child pid=5012
^C
[supervisor] forwarding SIGINT
[supervisor] child exited status=130
[supervisor] restarting after 500ms
[exit code: 0]
$ ./proc_supervisor -- ./missing_cmd
[error] exec failed: no such file
[exit code: 127]
3.5 Data Formats / Schemas / Protocols
Supervisor log format: ISO timestamp, event, pid, status.
3.6 Edge Cases
- Child exec fails
- Rapid crash loop
- SIGTERM during restart sleep
3.7 Real World Outcome
This section defines a deterministic, repeatable outcome. Use fixed inputs and set TZ=UTC where time appears.
3.7.1 How to Run (Copy/Paste)
make
./proc_supervisor -- ./long_task
3.7.2 Golden Path Demo (Deterministic)
The “success” demo below is a fixed scenario with a known outcome. It should always match.
3.7.3 If CLI: provide an exact terminal transcript
$ ./proc_supervisor -- ./long_task
[supervisor] child pid=5012
^C
[supervisor] forwarding SIGINT
[supervisor] child exited status=130
[supervisor] restarting after 500ms
[exit code: 0]
Failure Demo (Deterministic)
$ ./proc_supervisor -- ./missing_cmd
[error] exec failed: no such file
[exit code: 127]
3.7.8 If TUI
At least one ASCII layout for the UI:
+------------------------------+
| Signal-Aware Process Supervisor |
| [content area] |
| [status / hints] |
+------------------------------+
4. Solution Architecture
4.1 High-Level Design
+-----------+ +-----------+ +-----------+
| Client | <-> | Server | <-> | PTYs |
+-----------+ +-----------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |-----------|----------------|---------------| | Supervisor loop | Tracks child state and restarts. | Use a simple state machine. | | Signal handler | Wakes main loop on SIGCHLD or SIGTERM. | Set atomic flags only. | | Backoff controller | Implements restart delays. | Use monotonic clock for timing. |
4.4 Data Structures (No Full Code)
typedef struct {
pid_t child_pid;
int restarts;
int max_restarts;
int backoff_ms;
} SupervisorState;
4.4 Algorithm Overview
Key Algorithm: Supervision with exponential backoff
- Fork child and setpgid.
- Main loop waits for signals or timers.
- On SIGCHLD, reap child and inspect status.
- If crash, sleep backoff and restart.
- If normal exit, terminate supervisor.
Complexity Analysis:
- Time O(restarts); Space O(1).
5. Implementation Guide
5.1 Development Environment Setup
cc --version
make --version
5.2 Project Structure
proc-supervisor/
|-- src/
| |-- main.c
| `-- signals.c
|-- include/
| `-- signals.h
`-- Makefile
5.3 The Core Question You’re Answering
“How do you manage child lifecycles and signals in a long-lived server?”
5.4 Concepts You Must Understand First
- SIGCHLD
- Why it matters and how it impacts correctness.
- signal forwarding
- Why it matters and how it impacts correctness.
- restart policies
- Why it matters and how it impacts correctness.
5.5 Questions to Guide Your Design
- How do you avoid race conditions between SIGCHLD and waitpid?
- What restart policy prevents tight crash loops?
- Should SIGTERM stop restarts immediately?
5.6 Thinking Exercise
Write a state machine for running -> crashed -> restarting -> running.
5.7 The Interview Questions They’ll Ask
- Difference between SIGTERM and SIGKILL?
- How do you reap zombies?
5.8 Hints in Layers
- Use waitpid(pid, &status, WNOHANG) in a loop.
-
Use a self-pipe to wake poll.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Signals | Advanced Programming in the UNIX Environment | Ch. 10 |
5.10 Implementation Phases
Phase 1: Foundation (1 week)
Goals:
- Establish the core data structures and loop.
- Prove basic I/O or rendering works.
Tasks:
- Implement the core structs and minimal main loop.
- Add logging for key events and errors.
Checkpoint: You can run the tool and see deterministic output.
Phase 2: Core Functionality (1 week)
Goals:
- Implement the main requirements and pass basic tests.
- Integrate with OS primitives.
Tasks:
- Implement remaining functional requirements.
- Add error handling and deterministic test fixtures.
Checkpoint: All functional requirements are met for the golden path.
Phase 3: Polish & Edge Cases (1 week)
Goals:
- Handle edge cases and improve UX.
- Optimize rendering or I/O.
Tasks:
- Add edge-case handling and exit codes.
- Improve logs and documentation.
Checkpoint: Failure demos behave exactly as specified.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| I/O model | blocking vs non-blocking | non-blocking | avoids stalls in multiplexed loops |
| Logging | text vs binary | text for v1 | easier to inspect and debug |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate components | parser, buffer, protocol |
| Integration Tests | Validate interactions | end-to-end CLI flow |
| Edge Case Tests | Handle boundary conditions | resize, invalid input |
6.2 Critical Test Cases
- Child crash triggers restart.
- SIGINT forwards to child group.
- Normal exit stops supervisor.
6.3 Test Data
text
Run supervisor on a script that exits with status 1 every 2 seconds.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Zombie processes | Child remains defunct | Reap with waitpid in a loop. | | Signal storms | Multiple SIGCHLDs | Use flags and drain waitpid. | | Restart thrash | CPU spikes | Add exponential backoff. |
7.2 Debugging Strategies
- Log all state transitions with timestamps.
- Use ps to verify process groups.
7.3 Performance Traps
-
Busy-looping on WNOHANG without sleep.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add –max-restarts flag.
- Add –backoff-ms flag.
8.2 Intermediate Extensions
- Add jitter to backoff to avoid sync storms.
- Persist restart counters.
8.3 Advanced Extensions
- Add health-check hooks before marking healthy.
9. Real-World Connections
9.1 Industry Applications
- Service supervisors
- Container init processes
9.2 Related Open Source Projects
- systemd
- runit
9.3 Interview Relevance
- Event loops, terminal I/O, and state machines are common interview topics.
10. Resources
10.1 Essential Reading
- Advanced Programming in the UNIX Environment by W. Richard Stevens - Ch. 10
10.2 Video Resources
- Signals and process lifecycle (lecture).
10.3 Tools & Documentation
- ps: ps
- kill: kill
- strace: strace
10.4 Related Projects in This Series
- Project 3: Unix Domain Socket Chat - Builds prerequisites
-
Project 5: Event-Driven I/O Multiplexer - Extends these ideas
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the core concept without notes
- I can explain how input becomes output in this tool
- I can explain the main failure modes
11.2 Implementation
- All functional requirements are met
- All test cases pass
- Code is clean and well-documented
- Edge cases are handled
11.3 Growth
- I can identify one thing I’d do differently next time
- I’ve documented lessons learned
- I can explain this project in a job interview
12. Submission / Completion Criteria
Minimum Viable Completion:
- Tool runs and passes the golden-path demo
- Deterministic output matches expected snapshot
- Failure demo returns the correct exit code
Full Completion:
- All minimum criteria plus:
- Edge cases handled and tested
- Documentation covers usage and troubleshooting
Excellence (Going Above & Beyond):
- Add at least one advanced extension
- Provide a performance profile and improvement notes