Project 2: Mini Process Supervisor
Build a minimal init-like supervisor that loads unit-style configs, resolves dependencies, starts services, reaps children, and enforces restart policies.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Advanced |
| Time Estimate | 2-4 weeks |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 5: Hardcore Systems Nerd |
| Business Potential | Level 3: Infrastructure Core |
| Prerequisites | fork/exec, signals, IPC, graphs, config parsing |
| Key Topics | process lifecycle, dependency graphs, supervision and restart logic |
1. Learning Objectives
By completing this project, you will:
- Implement a service state machine with explicit transitions.
- Resolve dependency graphs and build deterministic start transactions.
- Correctly handle SIGCHLD and reap children without zombies.
- Enforce restart policies with rate limiting and timeouts.
- Provide a CLI that explains service state and failures.
2. All Theory Needed (Per-Concept Breakdown)
Concept 1: Process Lifecycle, Signals, and Reaping
Fundamentals
A supervisor is a parent process responsible for starting and monitoring child processes. On Unix, this is built on fork(), execve(), and waitpid(). If the parent does not reap children, they become zombies and consume process table entries. Signals deliver asynchronous events such as child exit (SIGCHLD) or termination requests (SIGTERM). A reliable supervisor must install signal handlers, detect exits, and update state without missing events. Because signal handlers can run at any time, you must follow async-signal-safe rules and avoid unsafe operations inside handlers. If you understand process lifecycle, signals, and reaping, you can build a supervisor that is stable under load and behaves predictably during crashes and restarts.
Deep Dive into the Concept
The lifecycle begins with fork(), which creates a child process with a copy of the parent’s memory and file descriptor table. The child then calls execve() to replace itself with the service binary. The parent must record the child’s PID, set up bookkeeping, and continue running. The parent is responsible for handling SIGCHLD, which is delivered when a child exits or changes state. If a parent never calls waitpid(), the kernel retains the child’s exit status and PID in the process table, producing a zombie. An init-like supervisor may create and manage many children, so even a small leak is catastrophic.
Signal handling is subtle. A signal handler can interrupt a blocking system call and must only call async-signal-safe functions. The canonical design is the self-pipe or eventfd pattern: the handler writes a byte to a pipe, and the main loop uses poll() or select() to wake up and run safe logic. Another modern option is signalfd, which converts signals into file descriptor events and avoids handler restrictions. Either way, you must ensure that no SIGCHLD event is lost. The safe pattern is to reap in a loop: while ((pid = waitpid(-1, &st, WNOHANG)) > 0) { ... }. One SIGCHLD may represent multiple child exits.
Signal semantics also include process groups and session control. If your supervisor launches a service, you may want to place it in its own process group so you can send signals to the whole group for shutdown. If you ignore this, a service that spawns children may survive after you think it has stopped. For long-running daemons, you often want to send SIGTERM, wait with a timeout, then SIGKILL. This sequence must be deterministic and logged.
PID reuse is a hidden failure mode. If a service exits and the PID is quickly reused by another process, a naive supervisor might send signals to the wrong process or misreport state. Linux provides pidfd_open to obtain a file descriptor that refers to a specific process instance. If available, using pidfds avoids PID reuse races. If pidfds are not available, you can store additional metadata like start time from /proc/<pid>/stat and verify it before acting.
Another subtlety is exec failure. fork() can succeed, but execve() can fail due to missing binaries, permission issues, or missing libraries. You must detect this and treat it as a start failure. A common strategy is to create a pipe between parent and child; the child writes an error code if execve() fails. If the parent does not receive a success signal in time, it marks the start as failed.
Finally, consider PID 1 semantics. PID 1 ignores some default signal behaviors; if your supervisor runs as a normal process, you still need to mimic the important aspects: reaping, signal forwarding, and clean shutdown. A good supervisor has a clear state machine and logs each transition with a reason, making debugging practical. Also account for SIGPIPE and inherited signal masks so services start with predictable signal behavior.
How this fit on projects
This concept powers your core event loop and state transitions. You will use it in Section 3.2, Section 4.4, and Section 5.10 Phase 2.
Definitions & key terms
- Zombie -> Exited child process that has not been reaped.
- SIGCHLD -> Signal delivered when a child changes state.
- Process group -> A group of processes that can receive signals together.
- pidfd -> A file descriptor that refers to a specific process instance.
- Self-pipe -> A pattern to safely handle signals in an event loop.
Mental model diagram (ASCII)
supervisor (parent)
|
+-- fork --> child
| |
| +-- execve(service)
|
+-- SIGCHLD -> self-pipe -> event loop -> waitpid()
How it works (step-by-step)
- Parent forks a child and records the PID.
- Child execs the service binary.
- Parent enters event loop and waits for SIGCHLD events.
- On SIGCHLD, parent reaps children in a loop.
- Parent updates service state and applies restart policy.
Invariants: Every child exit is reaped exactly once; state transitions are logged.
Failure modes: missing exec, lost SIGCHLD events, PID reuse, or zombie accumulation.
Minimal concrete example
static void on_sigchld(int sig) {
(void)sig;
write(sigpipe_fd, "x", 1); /* async-signal-safe */
}
Common misconceptions
- “Ignoring SIGCHLD is fine” -> You lose exit status and leak zombies.
- “waitpid() once is enough” -> Multiple children can exit at once.
- “PIDs are stable” -> They can be reused quickly.
Check-your-understanding questions
- Why must you reap children in a loop?
- What operations are unsafe inside signal handlers?
- How does pidfd prevent PID reuse bugs?
- Why should a service be placed in a process group?
Check-your-understanding answers
- One SIGCHLD can represent multiple exits; loop reaps all.
- Memory allocation, stdio, and most non-async-safe functions.
- It ties actions to a specific process instance, not a numeric PID.
- To signal all child processes during shutdown.
Real-world applications
- systemd, runit, s6, supervisord, and container init processes.
- Service managers in embedded systems and appliances.
Where you’ll apply it
- This project: Section 3.2, Section 4.4, Section 5.10 Phase 2.
- Also used in: P06-container-runtime-systemd-integration.md for PID 1 behavior.
References
- “Advanced Programming in the UNIX Environment” (signals and process chapters).
man 2 fork,man 2 execve,man 2 waitpid,man 2 signalfd.
Key insights
Supervision is mostly about correctly handling asynchronous child exits.
Summary
If you can fork, exec, and reap reliably under signal pressure, you can build a stable supervisor.
Homework/exercises to practice the concept
- Write a program that forks three children and reaps them correctly.
- Simulate an exec failure and propagate the error to the parent.
- Implement a self-pipe signal handler.
Solutions to the homework/exercises
- Use
waitpid(-1, &st, WNOHANG)in a loop. - Use a pipe and write errno from the child if exec fails.
- Create a pipe and write one byte in the handler.
Concept 2: Dependency Graphs and Transactions
Fundamentals
Services depend on each other. A dependency graph captures these relationships and allows a supervisor to start services in a safe order. If web.service depends on db.service, the database must start first. This is modeled as a directed edge. A topological sort converts the graph into an ordered list. If there is a cycle, ordering is impossible and the supervisor must refuse to start. A transaction is the validated plan: which services will be started, which are already running, and which will be skipped or failed. This concept makes your supervisor deterministic and explains start failures. It also guides clean shutdown ordering in reverse.
Deep Dive into the Concept
A dependency graph is a directed graph where each node is a service and each edge represents a requirement. The simplest model is a strict requires edge: if the dependency fails, the dependent fails. You can extend the model with soft dependencies (wants) that do not block startup on failure. For a minimal supervisor, start with strict dependencies and add soft dependencies later.
Topological sorting can be implemented with Kahn’s algorithm. Compute in-degree for each node, push all nodes with in-degree zero into a queue, and repeatedly remove nodes, decreasing in-degree of neighbors. If you process all nodes, you have a valid order. If not, there is a cycle. When you detect a cycle, you should report it with the chain of services involved. This is critical for usability; otherwise, users cannot fix their configs.
A transaction is more than an order. It is a plan that includes skip logic. If a dependency is already active, you may skip starting it. If a dependency is optional and fails, you continue; if it is required and fails, you stop. The transaction can also handle restart requests: if web.service is already running but a dependency is restarted, you may need to restart web.service depending on policy. This is optional for your project but helps you think like a real service manager.
Graph representation matters. Use adjacency lists for outgoing edges and a reverse adjacency list for incoming edges. The reverse edges allow you to answer “what depends on this service?” and are useful for error propagation or stop ordering. For deterministic results, sort node names before ordering so that the output is stable and testable.
Another important aspect is error reporting. If a dependency is missing, the supervisor should stop early with a clear error that names the missing unit and the unit that required it. If multiple independent subgraphs exist, you can start them in parallel, but you should still provide a deterministic ordering for logs and tests. A transaction can also incorporate policy decisions such as “skip if already active” or “restart if stale,” which are beyond minimal scope but help you model systemd behavior. Even in a simplified model, define explicit semantics for “optional” edges and make those semantics visible in output so users can reason about startup order.
Edge cases include missing nodes and cyclic dependencies. Missing nodes should be treated as configuration errors and surfaced clearly. Cycles should be reported with the chain, not just a generic error. Another edge case is independent subgraphs: two groups of services can be started in parallel. Your supervisor can start them in sorted order, but the algorithm should not require them to be in a single chain.
How this fit on projects
Dependency resolution is what makes your supervisor a real init-like system rather than a script. You will use it in Section 3.2, Section 4.4, and Section 5.10 Phase 1.
Definitions & key terms
- Topological sort -> An ordering where dependencies come first.
- In-degree -> Number of incoming edges for a node.
- Cycle -> A dependency loop that prevents ordering.
- Transaction -> A validated plan of start/stop jobs.
- Soft dependency -> A dependency that does not fail the parent.
Mental model diagram (ASCII)
web.service -> db.service -> network.target
|
+--> cache.service
How it works (step-by-step)
- Parse configs and build graph nodes and edges.
- Collect the subgraph needed for the requested service.
- Run topological sort on that subgraph.
- Start services in order, skipping those already active.
- If a required dependency fails, abort dependents.
Invariants: Each dependency starts before its dependent in the transaction.
Failure modes: cycles, missing dependencies, or non-deterministic ordering.
Minimal concrete example
web.service: requires db.service, cache.service
cache.service: requires network.target
Common misconceptions
- “Ordering and dependency are the same” -> Ordering alone does not pull in services.
- “Cycles are rare” -> Misconfigured services commonly create them.
- “Start order does not matter” -> Race conditions appear immediately.
Check-your-understanding questions
- What does a topological sort guarantee?
- How do you detect a dependency cycle?
- Why should you keep reverse edges?
- What is the difference between a required and optional dependency?
Check-your-understanding answers
- Every dependency appears before its dependent.
- Kahn’s algorithm leaves unprocessed nodes when a cycle exists.
- It helps propagate failures and explain blast radius.
- Required dependencies fail the parent; optional ones do not.
Real-world applications
- Boot ordering and service orchestration in init systems.
- Build systems and CI pipelines with dependency graphs.
Where you’ll apply it
- This project: Section 3.2, Section 4.4, Section 5.10 Phase 1.
- Also used in: P01-service-health-dashboard.md for dependency visualization.
References
- “Algorithms” by Sedgewick (graph chapters).
- systemd.unit documentation (dependency directives).
Key insights
Deterministic startup requires explicit dependency ordering, not intuition.
Summary
Dependency graphs translate declarative configs into a safe start order and failure model.
Homework/exercises to practice the concept
- Implement Kahn’s algorithm for a 6-node graph.
- Create a cycle and confirm your algorithm detects it.
- Add optional dependencies and test that failures do not block startup.
Solutions to the homework/exercises
- Use a queue of zero in-degree nodes and remove edges iteratively.
- If processed nodes < total nodes, a cycle exists.
- Mark edges as soft and log warnings on failure.
Concept 3: Supervision, Restart Policies, and Rate Limiting
Fundamentals
A supervisor must decide when to restart a service. If a service crashes, Restart=on-failure might restart it; if it exits cleanly, it may remain stopped. Without limits, a failing service can restart in a tight loop, consuming CPU and spamming logs. systemd prevents this using start limits: a maximum number of restarts within an interval. Your mini supervisor should implement a similar mechanism. You also need timeouts so that a service that never becomes ready does not hang your start transaction forever. These policies turn raw exit codes into reliable behavior. They also prevent cascading failures when dependencies are unstable.
Deep Dive into the Concept
Restart policies translate exit status into actions. A basic policy set includes no, always, and on-failure. always restarts on any exit; on-failure restarts on non-zero exit or signal; no never restarts. The supervisor must record how the process ended using waitpid() status macros: WIFEXITED and WEXITSTATUS for normal exits, WIFSIGNALED and WTERMSIG for signal exits. This allows you to decide if the exit should trigger a restart.
Rate limiting prevents restart storms. Implement a sliding window of restart timestamps. If the count in the last interval exceeds StartLimitBurst, mark the service as failed and stop restarting until manual intervention. This is similar to systemd’s StartLimitIntervalSec and StartLimitBurst. It should be deterministic: with a fixed clock input, the same failures produce the same restart decisions. For testability, allow a fake time source.
Timeouts protect against hung startups. When a service enters activating, start a timer. If the service does not report ready (or you have no readiness signal) before the timeout, treat it as failed. For a minimal supervisor, the timeout can simply be a maximum start time; for more advanced behavior, you can define readiness conditions like a pidfile or a socket check.
State machine design is crucial. Define explicit states such as inactive, activating, active, failed, and cooldown. Each event transitions between states. For example, “start requested” moves inactive -> activating; “exec succeeded” may move to active; “exit non-zero” moves to failed and may trigger a restart. A clear state machine makes the system testable and understandable.
Backoff strategies are another layer. Instead of restarting immediately, you can add a delay or exponential backoff to reduce thrashing. This also gives time for dependencies to recover. Record the last failure reason and expose it in status output. When a service is in cooldown, report the remaining delay so operators know when it will try again. These small details make your supervisor feel professional and predictable. They also reduce noise in logs and help you distinguish between a transient failure and a persistent misconfiguration.
Finally, consider manual stops. If a user requests stop, you should not treat the service as failed or attempt to restart it. This means you need a flag to distinguish crashes from operator-intended stops. Without this, your supervisor will fight the operator.
How this fit on projects
Restart logic is the reliability core of the supervisor. You will use it in Section 3.2, Section 4.4, Section 5.10 Phase 2, and Section 7.
Definitions & key terms
- Restart policy -> Rules for when to restart a service.
- Start limit -> Burst and interval that cap restart storms.
- Timeout -> Maximum allowed time for startup or shutdown.
- Flapping -> Repeated crash-restart cycles.
Mental model diagram (ASCII)
inactive -> activating -> active
^ | |
| v v
+------- failed <---- auto-restart
|
+--> cooldown (start limit hit)
How it works (step-by-step)
- Start service and mark
activating. - If it exits with failure, record timestamp.
- If restart policy allows and start limit not exceeded, restart.
- If limit exceeded, enter
cooldownorfailed-permanent. - If stopped manually, do not restart.
Invariants: Restart decisions are based on exit status and policy.
Failure modes: infinite restart loops, timeouts, or misclassifying manual stops.
Minimal concrete example
restart=on-failure
start_limit_burst=3
start_limit_interval=60s
Common misconceptions
- “Always restart is safe” -> It can create infinite loops.
- “Timeouts are only for slow services” -> They also catch deadlocks.
- “Exit 0 always means success” -> Not for long-running daemons.
Check-your-understanding questions
- Why do restart storms happen?
- How do you decide whether a crash should restart?
- How does start limit differ from restart policy?
- Why track manual stops separately?
Check-your-understanding answers
- A failing service restarts immediately without limits.
- Inspect exit status and apply policy rules.
- Restart policy decides if you restart; start limit caps how often.
- Manual stops should not be treated as failures.
Real-world applications
- Service reliability and high-availability systems.
- Container restarters and job runners.
Where you’ll apply it
- This project: Section 3.2, Section 4.4, Section 5.10 Phase 2.
- Also used in: P01-service-health-dashboard.md for health mapping.
References
- systemd.service documentation on Restart and StartLimit.
- “The Linux Programming Interface” (process control chapters).
Key insights
Supervision is safe only when restarts are controlled by policy and limits.
Summary
Restart policies, timeouts, and start limits transform raw exits into reliable behavior.
Homework/exercises to practice the concept
- Simulate a failing service and implement restart limits.
- Add a startup timeout and kill the process on timeout.
- Log the reason when restarts are disabled.
Solutions to the homework/exercises
- Track restart timestamps and stop after N failures.
- Set a timer when entering
activatingand enforce it. - Log “StartLimit hit; manual intervention required”.
Concept 4: Supervision Loops, Restart Policies, and Rate Limiting
Fundamentals
A process supervisor is more than a launcher; it is a feedback loop that keeps services running and stable. That means you need to detect failure, decide whether to restart, and prevent restart storms that can melt a system. Restart policies encode when a service should be restarted (on-failure, always, on-abnormal), while rate limiting determines how many restarts are allowed within a time window. A robust supervisor also needs timeouts for startup and graceful shutdown so it can declare a service failed when it hangs. These concepts are core to systemd, and implementing them forces you to think like an operator: reliability is not only “can it start” but “can it recover safely without making things worse”.
Deep Dive into the Concept
A supervision loop monitors child process state, reacts to exits, and applies policy. The fundamental building block is the SIGCHLD signal (or waitpid in an event loop) that tells you when a child changes state. When a service exits, you must classify the exit: normal exit (status 0), non-zero exit, signal termination, or timeout. systemd uses this to decide whether the unit result is success, exit-code, signal, core-dump, or timeout. Your supervisor needs a similar classification to implement meaningful policy.
Restart policy is a decision function: given the exit classification, should the service be restarted, and if so, after what delay? Common policies include no (never restart), on-failure (restart on non-zero exit or signal), always (restart regardless), and on-abnormal (restart on signal or core dump). You can model this with an enum and a predicate. The delay is often a fixed value (e.g., 1s) but many supervisors implement exponential backoff to avoid rapid flapping. Backoff is especially important for services that fail immediately due to configuration errors; without it, your supervisor will spin, consuming CPU and filling logs.
Rate limiting adds an explicit guardrail: allow at most N restarts within a sliding window of M seconds. If the limit is exceeded, the service is marked failed and is not restarted until manually reset. This is how systemd prevents a restart storm. Implementing this requires tracking timestamps of recent restarts. A simple approach is a ring buffer of timestamps; on each restart attempt, drop entries older than M seconds and compare the count with N. If exceeded, you set the unit to failed and emit a clear reason such as “start limit hit”. In a real system, this is crucial: a crash loop can take down critical resources like CPU, disk, or journald itself.
Time-based supervision also includes startup and stop timeouts. A service might fork and hang without ever becoming ready. If you treat process spawn as “started” you can mask failures. systemd addresses this with Type=notify and watchdogs, but in a minimal supervisor you can implement a simple startup timeout: after spawning the process, wait for a readiness signal (e.g., a file or pipe message) or for the process to stay alive for a minimum period. If the timeout expires, treat it as failure. Stop timeouts are equally important: if a process ignores SIGTERM, you should escalate to SIGKILL after a grace period. This introduces a state machine with timers: stopping transitions to killed if the timeout expires. Without this, a supervisor can hang forever on a misbehaving service.
Supervision is also about observability. A supervisor should record restart counts, last exit reason, and the time since last start. This information is essential for debugging. It should also publish clear logs when it decides to restart or to stop retrying. In systemd, these are visible as unit properties and journal entries. In your mini supervisor, you can print to stderr or write to a structured log file. The key is that policy decisions should be transparent; operators need to know why a service did not restart.
Finally, the supervisor must be careful about race conditions. If a service exits while you are in the middle of a restart sequence, you might process stale state. Use a consistent state machine and ensure that child reaping logic updates the state atomically. One common approach is to funnel all events (SIGCHLD, timers, user commands) into a single event loop and apply a deterministic state transition function. This prevents inconsistent states like “active” with no child process. The loop itself is the heart of supervision: it is the engine that keeps the system converging toward the desired state.
How this fit on projects
This concept drives your state machine design and restart behavior. You will implement it in Section 3.2 (Functional Requirements: restart policies), Section 4.2 (Key Components: supervisor loop), Section 5.10 (Phase 2), and Section 7.1 (restart storms). It also informs the failure demo and exit codes in Section 3.7.3.
Definitions & key terms
- Restart policy -> Rules determining when a service should be restarted.
- Rate limiting -> Restricting restart frequency to prevent flapping.
- Start limit -> The maximum number of starts allowed in a time window.
- Backoff -> Increasing delay between retries to reduce load.
- Graceful stop -> Sending SIGTERM and waiting before SIGKILL.
Mental model diagram (ASCII)
+-------- failure --------+
| v
inactive -> starting -> active -> failed
^ | | | | | +-- stop --+
+-- cooldown/retry <---+
How it works (step-by-step)
- Spawn the service and record start time.
- Wait for SIGCHLD or readiness signal.
- On exit, classify the result (success, signal, timeout).
- Apply policy: restart? backoff? or mark failed.
- Update restart counters and timestamps.
- If rate limit exceeded, stop restarting and report.
Invariants: A service in active state has a running process.
Failure modes: Crash loops, stuck shutdown, or missed SIGCHLD.
Minimal concrete example
if (exit_code != 0 || signaled) {
if (within_start_limit()) {
sleep(restart_delay);
restart();
} else {
mark_failed("start limit hit");
}
}
Common misconceptions
- “Restart always improves reliability” -> Blind restarts can hide bugs and cause storms.
- “A non-zero exit is always a failure” -> Some oneshot services use non-zero for control flow.
- “SIGKILL is fine for cleanup” -> It skips cleanup and can corrupt state.
Check-your-understanding questions
- Why is rate limiting required for restarts?
- How do you classify an exit caused by SIGSEGV?
- What is the purpose of a stop timeout?
- When would you use exponential backoff?
Check-your-understanding answers
- To prevent infinite restart loops that consume resources.
- It is a signal-based failure, typically
signalorcore-dump. - To avoid hanging forever when a service ignores SIGTERM.
- When repeated immediate failures would otherwise hammer the system.
Real-world applications
- Supervising daemons in embedded systems.
- Ensuring critical services restart safely after crashes.
- Building internal init systems for specialized appliances.
Where you’ll apply it
- This project: Section 3.2, Section 4.2, Section 5.10 Phase 2, Section 7.1.
- Also used in: P06-container-runtime-systemd-integration.md for restart behavior of container init processes.
References
systemd.service(5)restart policy documentation.- “The Linux Programming Interface” chapters on process control.
- “Operating Systems: Three Easy Pieces” supervision and failure chapters.
Key insights
Supervision is a control loop; restarts are a policy decision, not a reflex.
Summary
A reliable supervisor classifies exits, enforces restart limits, and uses timeouts to avoid hangs and storms.
Homework/exercises to practice the concept
- Implement a restart counter with a 5-restarts-per-minute limit.
- Add a stop timeout that escalates from SIGTERM to SIGKILL.
- Simulate a crash loop and verify your supervisor stops restarting.
Solutions to the homework/exercises
- Keep a list of timestamps and drop entries older than 60 seconds before counting.
- Use
alarmor an event loop timer; send SIGKILL if SIGTERM does not exit. - Create a service that exits immediately; observe the restart limit trigger.
3. Project Specification
3.1 What You Will Build
A CLI tool called minisysd that:
- Reads service definitions from a config directory.
- Builds a dependency graph.
- Starts services in dependency order.
- Tracks process state and restarts on failure.
- Provides
start,stop,status, andlistcommands.
Included: config parsing, dependency ordering, supervision, restart limits.
Excluded: full systemd unit syntax, socket/timer units, cgroups.
3.2 Functional Requirements
- Config Loader: Read service definitions with name, command, dependencies.
- Dependency Resolver: Topologically sort and detect cycles.
- Supervisor Loop: Fork/exec, signal handling, reaping.
- Restart Policies: on-failure, always, never.
- Start Limits: burst/interval with cooldown state.
- CLI: start/stop/status/list with clear messages.
- State Reporting: PID, state, last exit code, last restart time.
- Determinism: optional fake clock for tests.
3.3 Non-Functional Requirements
- Performance: Handle 100+ services with low overhead.
- Reliability: No zombie processes; all exits captured.
- Usability: Clear error output for cycles and failures.
3.4 Example Usage / Output
$ minisysd start webapp
[mini] starting db.service
[mini] starting cache.service
[mini] starting webapp.service
[mini] webapp active (pid 4821)
$ minisysd status
webapp.service: active (pid 4821)
cache.service: active (pid 4722)
3.5 Data Formats / Schemas / Protocols
Service config (YAML):
name: webapp.service
command: "/usr/local/bin/webapp"
requires:
- db.service
- cache.service
restart: on-failure
start_limit_burst: 3
start_limit_interval: 60
3.6 Edge Cases
- Dependency cycles.
- Services that exit instantly.
- PID reuse when services restart quickly.
- Permission errors when launching commands.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./minisysd start webapp.service
./minisysd status
3.7.2 Golden Path Demo (Deterministic)
- Set
MINISYSD_FAKE_TIME=2026-01-01T12:00:00Z. - Use test configs in
examples/.
3.7.3 If CLI: exact terminal transcript
$ MINISYSD_FAKE_TIME=2026-01-01T12:00:00Z ./minisysd start webapp.service
[mini] starting db.service
[mini] starting cache.service
[mini] starting webapp.service
[mini] webapp active (pid 5001)
$ ./minisysd status
webapp.service: active (pid 5001)
Failure demo:
$ ./minisysd start badcycle.service
ERROR: dependency cycle detected: badcycle.service -> foo.service -> badcycle.service
exit code: 5
Exit codes:
0success2usage error4exec failure5dependency cycle6start limit reached
4. Solution Architecture
4.1 High-Level Design
+-----------+ +--------------+ +----------------+
| CLI Parser|-->| Config Store |-->| Dependency |
+-----------+ +--------------+ | Resolver |
+--------+-------+
|
v
+----------------+
| Supervisor |
| Event Loop |
+----------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Config Loader | Parse service definitions | YAML vs INI |
| Dependency Resolver | Topological sort, cycles | Kahn vs DFS |
| Supervisor Loop | fork/exec, signals, reaping | self-pipe vs signalfd |
| State Store | Track PID/state/restart counts | in-memory structs |
4.3 Data Structures (No Full Code)
struct Service {
char name[64];
char command[256];
char** requires;
int restart_burst;
int restart_interval_sec;
pid_t pid;
enum state {INACTIVE, ACTIVATING, ACTIVE, FAILED, COOLDOWN} state;
};
4.4 Algorithm Overview
Key Algorithm: Start Transaction
- Resolve dependency subgraph for requested service.
- Topologically sort the subgraph.
- Start each service in order.
- If a required dependency fails, abort dependents.
Complexity Analysis:
- Time: O(N + E)
- Space: O(N)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install -y build-essential libyaml-dev
make
5.2 Project Structure
minisysd/
├── src/
│ ├── main.c
│ ├── config.c
│ ├── graph.c
│ ├── supervisor.c
│ └── signals.c
├── examples/
│ └── webapp.yaml
├── tests/
└── Makefile
5.3 The Core Question You’re Answering
“What is the minimal set of mechanisms that make systemd reliable?”
5.4 Concepts You Must Understand First
- Process lifecycle and reaping.
- Dependency graph ordering.
- Restart policy and rate limiting.
5.5 Questions to Guide Your Design
- How will you persist restart counters across restarts of minisysd?
- What does “healthy” mean for a oneshot service?
- How do you separate manual stop from crash?
5.6 Thinking Exercise
Draw a state machine for a service that restarts on failure but stops after 3 crashes in 60 seconds.
5.7 The Interview Questions They’ll Ask
- “Why is PID 1 special?”
- “How do you prevent zombie processes?”
- “What is a restart storm and how do you prevent it?”
5.8 Hints in Layers
Hint 1: Start with one service and no dependencies.
Hint 2: Add dependency ordering.
Hint 3: Add SIGCHLD handling.
Hint 4: Add restart limits.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Processes | “Advanced Programming in the UNIX Environment” | Process chapters |
| Signals | “The Linux Programming Interface” | Signals chapters |
| Graphs | “Algorithms” | Graph chapters |
5.10 Implementation Phases
Phase 1: Foundation (4-6 days)
Goals: parse config and order dependencies.
Checkpoint: minisysd start starts services in correct order.
Phase 2: Core Functionality (7-10 days)
Goals: supervision loop with SIGCHLD.
Checkpoint: crashes are detected and state updates correctly.
Phase 3: Reliability and Limits (4-6 days)
Goals: restart policies and timeouts.
Checkpoint: restart storms stop at configured limit.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Signal handling | self-pipe vs signalfd | self-pipe | portable and simple |
| Config format | INI vs YAML | YAML | easier dependency lists |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———|———|———-| | Unit Tests | Graph ordering | detect cycles | | Integration Tests | Process management | fork/exec and waitpid | | Edge Case Tests | Restart storms | exceed burst limit |
6.2 Critical Test Cases
- Dependency cycle returns exit code 5.
- Service exits non-zero triggers restart.
- Restart limit stops restarts after 3 failures.
6.3 Test Data
A -> B -> C
C -> A (cycle)
7. Common Pitfalls and Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | No SIGCHLD handling | zombies accumulate | install handler + waitpid loop | | Wrong graph ordering | services start too early | topological sort | | No rate limiting | infinite restarts | implement burst/interval |
7.2 Debugging Strategies
- Use
strace -fto confirm fork/exec flow. - Add verbose logging for state transitions.
7.3 Performance Traps
Polling child processes instead of waiting on SIGCHLD wastes CPU.
8. Extensions and Challenges
8.1 Beginner Extensions
- Add
oneshotservice type. - Add
stopcommand with SIGTERM then SIGKILL.
8.2 Intermediate Extensions
- Implement
Afterordering separate fromRequires. - Persist state across supervisor restart.
8.3 Advanced Extensions
- Implement
Type=notifyreadiness via UNIX socket. - Add cgroup tracking for resource usage.
9. Real-World Connections
9.1 Industry Applications
- Minimal init systems for containers and appliances.
- Supervisors in embedded devices.
9.2 Related Open Source Projects
- runit, s6, supervisord (alternative supervisors).
9.3 Interview Relevance
- Discuss process lifecycle and restart policies confidently.
10. Resources
10.1 Essential Reading
- “The Linux Programming Interface” (signals, process control).
- “Advanced Programming in the UNIX Environment” (fork/exec/wait).
10.2 Video Resources
- FOSDEM talks on init systems and supervision.
10.3 Tools and Documentation
man 2 fork,man 2 execve,man 2 waitpid.
10.4 Related Projects in This Series
- P01-service-health-dashboard.md for visualization.
- P06-container-runtime-systemd-integration.md for PID 1 behavior.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how SIGCHLD works.
- I can describe topological sorting.
- I can explain restart storms and rate limiting.
11.2 Implementation
- Services start in dependency order.
- Zombies do not accumulate.
- Restart limits prevent flapping.
11.3 Growth
- I can explain my state machine to someone else.
- I documented at least one tricky bug.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Start and stop a service with dependencies.
- Correctly track service state and exit codes.
Full Completion:
- Restart policies and timeouts implemented.
- Cycle detection with clear error output.
Excellence (Going Above and Beyond):
- Implements readiness notification and cgroup tracking.