Systems Integration Projects: From Development to Production Reality

Goal: Build the instincts and technical depth required to make software survive in production. You will learn how operating system boundaries really behave: file descriptors, signals, sockets, resource limits, and memory mappings. You will practice diagnosing failures that only appear under rotation, load, timeouts, and process churn. By the end, you will be able to design tools and services that fail gracefully, remain observable, and recover predictably when the environment changes.

Introduction

Systems integration is the discipline of making independently correct components work together under real operating system constraints. The code does not fail because a loop is wrong. It fails because the OS, filesystem, network, and other processes interact in ways your local tests did not exercise.

What you will build (by the end of this guide):

A log tailer that survives rotation, truncation, and multi-file ordering
A resilient HTTP connection pool with failure injection and recovery logic
A process supervisor that handles signals correctly and never leaks zombies
A memory-mapped ring buffer IPC system with zero-copy data exchange
An environment diff tool that explains “works on my machine” failures
A final deployment pipeline tool that integrates all prior components

Scope (what is included):

Linux system calls and behavior (file descriptors, signals, /proc, mmap)
Integration failures at OS boundaries: rotation, timeouts, limits, orphaned children
Defensive design: timeouts, retries, backoff, observability, and safe shutdown

Out of scope (for this guide):

Distributed consensus, cluster orchestration, or cloud control planes
Kernel development or driver-level integration
Production-grade deployment pipelines for large multi-service systems

The Big Picture (Mental Model)

        Your App Code
            |
            v
     Syscall Boundary
  (open/read/write/kill)
            |
            v
      Kernel Objects
 (FD table, sockets, VM)
            |
            v
   External Reality Layer
 (files rotate, net fails,
  processes die, limits hit)

Key Terms You Will See Everywhere

File descriptor (FD): Small integer handle to an open file or socket.
Open file description: Kernel object that tracks file offset and flags.
Inode: Identity of a file on disk, independent of its path name.
SIGCHLD: Signal delivered when a child process changes state.
Half-open connection: One side thinks TCP is alive, the other is dead.
Resource limit (RLIMIT_NOFILE): Max number of file descriptors a process can open.
MAP_SHARED: mmap flag that makes writes visible across processes.

How to Use This Guide

Read the primer first. The projects are designed to punish missing mental models.
Build one thin slice at a time. Get a minimal version working, then add resilience.
Use evidence, not guesses. Every project should produce logs, traces, or counters.
Treat failures as features. Inject timeouts, kill processes, rotate files on purpose.

Suggested cadence:

2 to 4 sessions per week
60 to 120 minutes per session
One session to build, one to observe and debug

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Programming Skills:

Solid C fundamentals: pointers, structs, memory allocation
Ability to read compiler errors and warnings
Comfort with basic Unix CLI tools

Systems Fundamentals:

Processes and PIDs, what fork/exec/wait do at a high level
File permissions and basic filesystem navigation
System calls and errno-based error handling

Recommended Reading:

“The Linux Programming Interface” by Michael Kerrisk - Ch. 4, 20
“Advanced Programming in the UNIX Environment” by Stevens and Rago - Ch. 8

Helpful But Not Required

TCP/IP basics (you will learn during Project 2)
I/O multiplexing (select/poll/epoll)
Memory ordering and atomics (Project 4 will introduce the basics)

Self-Assessment Questions

Can you explain what a file descriptor is and why it is limited?
Can you write a C program that opens a file, reads it, and handles errors?
Have you used man 2 open or man 2 read to read syscall documentation?
Do you know what SIGCHLD is and when it fires?
Can you compile and link a multi-file C project with gcc or make?

If you answered “no” to 3 or more, spend a week on basics before starting.

Development Environment Setup

Required tools:

# Compiler and build tools
gcc --version || clang --version
make --version

# Debugging and tracing
strace -V || true
lsof -v || true

# Networking utilities
curl --version
nc -h || true

Recommended tools:

valgrind --version || true
tcpdump --version || true
perf --version || true

OS: Linux is strongly recommended (Ubuntu 22.04+, Debian 12+). WSL2 works with minor signal differences. macOS can work, but behavior differs for /proc and some signals.

Time Investment

Project 1 and 5: 1 to 2 weeks each
Projects 2 to 4: 2 to 3 weeks each
Final integration project: 3 to 4 weeks

Total sprint: 10 to 20 weeks depending on depth.

Important Reality Check

These projects are hard on purpose. The skills you are building are about what happens after your code compiles and passes unit tests. Expect to debug non-obvious failures and revisit the theory multiple times. That is the point.

Big Picture / Mental Model

Systems integration lives at boundaries. Each boundary has failure modes that are invisible in isolated unit tests.

[Your Process]
   |
   |  (FDs, signals, syscalls)
   v
[Kernel Boundary] <--- resource limits, scheduling, VM
   |
   |  (filesystems, sockets)
   v
[External World]  <--- log rotation, network loss, process crashes
   |
   |  (other processes)
   v
[Integration Reality]

Key integration rules:

A file path is not the file. An FD points to an open file description that survives renames.
Signals can interrupt you at any instruction. Only async-signal-safe calls are safe in handlers.
TCP connections have lifecycle states and can be half-open.
mmap MAP_SHARED makes changes visible to other processes, but flush behavior is explicit.

Theory Primer

This is the mini-book. Read it before coding. Every concept here maps to at least one project.

Chapter 1: File Descriptors, Open File Descriptions, and Resource Limits

Fundamentals

File descriptors are small integers that index a per-process table of open file descriptors. They are not files. They are references to kernel objects called open file descriptions, which contain the current file offset and file status flags. When you call open(), you get the lowest unused descriptor number and a new open file description in the system-wide table. That descriptor number is limited by resource limits (RLIMIT_NOFILE), so even perfectly correct code can fail when you hit the limit. If you fork(), the child inherits the same descriptors and therefore the same open file descriptions, which means offsets are shared unless you create a new description. These details are not trivia. They determine whether your log tailer survives rotation, whether your process supervisor leaks FDs, and whether your connection pool collapses under load.

Deep Dive

An open file description is a kernel object created by open() and referenced by one or more file descriptors. A file descriptor is just an index into a process table. That distinction matters because the OS tracks the offset and flags in the open file description, not in your file descriptor value. This is why two descriptors can refer to the same underlying file description and share offsets (dup, fork), and why closing one descriptor does not necessarily close the file for another process. The open() man page describes that a file descriptor is a reference to an open file description and that this reference is unaffected if the pathname is removed or modified to refer to a different file. This is why naive log tailing breaks during log rotation: your FD still points to the old file identity even when the path points to a new file.

Resource limits are part of integration reality. The kernel enforces limits on how many descriptors you can open. RLIMIT_NOFILE is defined as one greater than the maximum FD number a process can open, and exceeding it yields EMFILE. This limit is often far lower in production environments than on developer machines. Services that open one socket per connection and forget to close will appear healthy until a quiet threshold is crossed. Then everything fails at once because open() or accept() starts returning EMFILE. This is not a bug in your socket code; it is an integration failure between your design and OS constraints.

Understanding open file descriptions also explains behavior across exec. By default, FDs remain open across exec, which allows accidental inheritance into child processes. This is a classic integration failure: a child process holds a log file open, preventing rotation, or it keeps a listening socket open, preventing graceful restarts. The design solution is to mark FDs with FD_CLOEXEC or use O_CLOEXEC when opening, and to explicitly close or pass only the FDs you intend. This will show up directly in Projects 1, 3, 5, and the final integration project.

Integration also includes error propagation. When open fails with EMFILE, you need a strategy: log, backoff, drop work, and recover later. If you just crash, your system becomes brittle. If you silently ignore, you lose data without detection. Designing for failure is part of integration. Your log tailer should emit a structured error and keep trying; your connection pool should shrink and retry; your supervisor should detect the difference between transient and fatal errors.

There is also a practical API layer to master. The fcntl() call can set or clear FD_CLOEXEC and is a common way to harden inherited descriptors. dup() and dup2() create additional references to the same open file description, which is convenient for redirecting stdout/stderr but can unexpectedly share file offsets if you assume independent streams. Pipes and sockets are just FDs too, which means the same lifetime and limit rules apply. If you do not close unused pipe ends in parent and child, you can keep a pipe open forever, causing readers to block even after writers exit. These are the subtle integration bugs you will surface repeatedly in this sprint.

How This Fits in Projects

Project 1 uses open file descriptions to follow log identity across rotations.
Project 2 uses FDs for sockets and must handle EMFILE gracefully.
Project 3 uses FD inheritance rules to avoid leaking FDs into child processes.
Project 5 reads /proc/self/fd and /proc/self/limits to explain environment differences.
Project 6 uses all of the above in one integrated tool.

Definitions & Key Terms

File descriptor: Integer index into a process table of open FDs.
Open file description: Kernel record storing file offset and status flags.
RLIMIT_NOFILE: Max FD number + 1, exceeding yields EMFILE.
FD_CLOEXEC: Flag that closes an FD across exec.

Mental Model Diagram

Process FD Table            System-Wide Open File Table
+-----+-------------+        +--------------+--------------+
|  3  |  ------+    |        | open desc A  | offset=1024 |
|  4  |  ------+----+------> | file flags   | O_APPEND     |
|  5  |  ------+    |        +--------------+--------------+
+-----+-------------+

Two FDs can point to the same open file description
(fork/dup), sharing offsets and flags.

How It Works (Step-by-Step)

open(path, flags) allocates an open file description.
The process FD table gets a new index pointing to that description.
fork() copies the FD table, so child and parent share the same descriptions.
dup() creates another FD pointing to the same description.
close(fd) removes that FD entry; the description is freed only when no FDs reference it.

Minimal Concrete Example

int fd = open("/var/log/app.log", O_RDONLY);
if (fd < 0) perror("open");

pid_t pid = fork();
if (pid == 0) {
    // Child shares the same open file description
    char buf[64];
    read(fd, buf, sizeof(buf));
    _exit(0);
} else {
    // Parent reads from same offset
    char buf[64];
    read(fd, buf, sizeof(buf));
    wait(NULL);
}

Common Misconceptions

“The file path identifies the file.” -> The FD identifies the open file description. The path can change.
“Closing in the parent closes it for the child.” -> Only if no other FDs reference it.
“FD numbers are unlimited.” -> RLIMIT_NOFILE is enforced and often low in production.

Check-Your-Understanding Questions

Why does tail -f sometimes keep reading a deleted file?
What is shared between two FDs obtained via dup()?
Why can fork() cause hidden FD leaks in child processes?

Check-Your-Understanding Answers

Because the FD still references the open file description even if the path is renamed or removed.
They share the same open file description, including file offset and flags.
The child inherits all open FDs unless explicitly closed or marked with FD_CLOEXEC.

Real-World Applications

Log tailers and log shippers
High-concurrency servers with many sockets
Process supervisors and service managers

Where You Will Apply It

Project 1, Project 2, Project 3, Project 5, Project 6

References

open(2) man page (open file descriptions, FD behavior) - https://man7.org/linux/man-pages/man2/open.2.html
getrlimit(2) man page (RLIMIT_NOFILE, EMFILE) - https://man7.org/linux/man-pages/man2/getrlimit.2.html

Key Insight

Open file descriptions, not paths, define file identity across rotations and forks.

Summary

File descriptors are small integers that point to kernel-managed open file descriptions. Resource limits and inheritance rules determine system behavior, and they are the source of many real integration failures.

Homework / Exercises

Write a program that opens a file, forks, and shows shared offsets.
Use ulimit -n to lower FD limit and observe EMFILE errors.

Solutions

Use lseek or read in both parent and child, then print offsets.
Set ulimit -n 64, open files in a loop, and detect EMFILE.

Chapter 2: File Identity, Log Rotation, and I/O Multiplexing

Fundamentals

A log tailer that works in dev can fail in production because log files do not stay put. Rotation can rename a file, truncate it, or replace it with a new inode. If your tailer follows by file descriptor, it may keep reading an old file that no longer has a path, while missing new lines from the new log file. GNU tail -F follows by name and retries when the file is rotated, which is why it survives rotations better than tail -f. When rotation uses copytruncate, the inode stays the same but the file shrinks; your tailer must detect truncation and restart at the new start. I/O multiplexing adds another layer: when you watch multiple logs you must decide between select, poll, epoll, or inotify, each with different scalability and failure modes. select() is limited by FD_SETSIZE (often 1024), which is too small for large log sets.

Deep Dive

A log tailer must answer three questions reliably: (1) which file am I reading, (2) is it the same file as before, and (3) how do I wait for new data without blocking forever? File identity is about inodes and open file descriptions. When log rotation renames a file, the inode stays with the old file and the path now points to a new inode. If your tailer holds an open FD, it continues reading the old inode until EOF, which means it may miss new log entries written to the new file. This is why a tailer that follows the descriptor by default is wrong for rotated logs. The GNU coreutils documentation explains that --follow=descriptor (default for -f) keeps reading an unlinked file, while --follow=name (used by -F) reopens the file name when it is removed or renamed. That is exactly the behavior you need for log rotation.

Rotation strategies matter. With rename rotation (move log to .1 and create a new file), you must detect inode changes and reopen. With copytruncate, the original file is copied and then truncated in place. The inode remains, but the file size shrinks to zero. If your tailer only watches for new bytes at the old offset, it will get stuck. You must detect truncation (size decreased) and reset your offset. The logrotate documentation describes copytruncate as copying and truncating the original file in place. This is a distinct behavior from renaming and requires separate handling logic.

Now consider multiplexing. If you read multiple log files, you need a single event loop. select() can monitor multiple FDs but has a fixed FD_SETSIZE limit and requires O(n) scans. The man page warns that select cannot monitor FDs >= FD_SETSIZE and that modern applications should use poll or epoll instead. That matters because a tailer that works on 10 logs may break at 2000. The correct integration decision depends on expected scale: use poll() for moderate numbers and epoll for many.

Correct multiplexing also requires non-blocking I/O. If any FD blocks, your event loop stalls and other logs are not read. A robust tailer sets O_NONBLOCK, uses poll or epoll, reads until EAGAIN, then returns to the loop. It also needs timers so that name reopens happen when no data arrives. The tail implementation in coreutils uses inotify when available, but falls back to polling with a sleep interval. That is a real world integration detail: if inotify is unavailable (e.g., on some network filesystems), you still need correctness.

Edge cases are the real test. Consider a file that is rotated (rename), then the writer keeps writing to the old FD for a short time before reopening the new file. Your tailer may receive lines from both the old and new inode during a rotation window. If you care about ordering, you must decide which stream wins or tag entries with inode and timestamp so the consumer can decide. That level of correctness is rarely needed in toy examples but becomes critical for forensic log analysis.

How This Fits in Projects

Project 1 is the primary application of this entire chapter.
Project 6 uses these concepts when aggregating logs during deploys.

Definitions & Key Terms

Log rotation: The process of renaming, truncating, or replacing log files.
copytruncate: Rotation method that copies then truncates in place.
Follow by name: Reopen file when path changes (tail -F).
I/O multiplexing: Waiting for readiness on multiple FDs.

Mental Model Diagram

Rotation by rename               Rotation by copytruncate

/app.log -> inode 100           /app.log -> inode 200
mv app.log app.log.1            copy app.log -> app.log.1
create new app.log inode 101    truncate inode 200 to size 0

FD still points to inode 100    FD still points to inode 200
Path now points to inode 101    Path still points to inode 200

How It Works (Step-by-Step)

On startup, open the file and record (device, inode).
Use poll/epoll to wait for readable events.
On read, if size shrinks or inode changes, reopen by name.
If the file disappears, retry with exponential backoff.

Minimal Concrete Example

struct stat st;
fstat(fd, &st);
uint64_t inode = st.st_ino;

// Later in the loop...
struct stat st2;
stat(path, &st2);
if (st2.st_ino != inode) {
    close(fd);
    fd = open(path, O_RDONLY | O_NONBLOCK);
    inode = st2.st_ino;
}

Common Misconceptions

“tail -f is enough” -> -F is needed for rotation by name.
“If file size shrinks, it is an error” -> It may be copytruncate; reset offset.
“select works for any number of FDs” -> It is limited by FD_SETSIZE.

Check-Your-Understanding Questions

What is the difference between follow-by-name and follow-by-descriptor?
Why does copytruncate break naive tailers?
Why does select not scale to large FD sets?

Check-Your-Understanding Answers

Follow-by-name reopens the path if the file is replaced; follow-by-descriptor keeps the same FD even if the path changes.
The inode stays the same but size shrinks; without detecting truncation you never read new content.
select only tracks FDs < FD_SETSIZE and scans all FDs each time.

Real-World Applications

Log shippers and monitoring agents
Real-time ingestion pipelines
File-watching build systems

Where You Will Apply It

Project 1, Project 6

References

GNU coreutils tail documentation (-F and follow=name) - https://www.gnu.org/software/coreutils/manual/html_node/tail-invocation.html
logrotate man page (copytruncate behavior) - https://manpages.org/logrotate/8
select(2) man page (FD_SETSIZE limit) - https://man7.org/linux/man-pages/man2/select.2.html

Key Insight

If you cannot identify the file beyond its path, your tailer will fail under rotation.

Summary

Log tailing is an integration problem: paths, inodes, rotation strategy, and I/O multiplexing interact in ways that naive code misses. This chapter arms you with the correct mental model.

Homework / Exercises

Build a small tailer that can detect copytruncate.
Rotate a log file with rename and verify the tailer reopens by name.

Solutions

Compare file size with the last offset; if smaller, reset to 0.
Check inode changes with stat, then reopen.

Chapter 3: Process Lifecycle, Signals, and Supervision

Fundamentals

A process supervisor must handle life and death events that can happen at any time. When a child exits, the kernel sends SIGCHLD and the parent must call wait or waitpid to reap it. If you do not, the process becomes a zombie, and your system accumulates dead entries. Signals can interrupt your code at any instruction, and only a short list of async-signal-safe functions may be called safely in handlers. That means your signal handler should be minimal: set a flag, write to a pipe, or record a status. The rest of the work should be done in the main loop. This design is not optional; calling printf in a handler can deadlock or corrupt state.

Deep Dive

Process supervision is a coordination problem between two independent control flows: your supervisor and the kernel’s signal delivery. The kernel sends SIGCHLD whenever a child exits, stops, or continues. The handler runs asynchronously and can interrupt any system call, which means every line of code must be written as if it can be interrupted. The man page on signal safety explains why stdio functions like printf are unsafe in handlers: they use shared buffers and are not reentrant. Therefore, your handler must avoid non-async-signal-safe functions and instead defer work to a safe context.

A robust supervisor installs a SIGCHLD handler using sigaction with SA_SIGINFO so it can inspect details. The handler should call only async-signal-safe operations such as write to a self-pipe or set a volatile flag. In the main loop, you should call waitpid in a loop with WNOHANG until all dead children are reaped. This is important because multiple children can die before you get a chance to handle signals, and you must drain all waitable statuses to avoid zombies.

Signals also have interaction with blocking syscalls. If you call select, poll, or read, the syscall may be interrupted by a signal, returning EINTR. Your code must either restart the syscall or integrate signal handling into your event loop. The sigaction man page describes SA_RESTART and the behavior of signal handling with respect to system calls. You should understand what it means for your program: if you use SA_RESTART, certain syscalls will resume automatically; if not, you must handle EINTR explicitly.

Process groups and session leadership matter in production. If you start a child process that creates its own process group, signals to the parent may not propagate to the group. Your supervisor should optionally create a new process group for the child and then forward SIGTERM and SIGINT to the entire group so all subprocesses shut down. This prevents orphaned grandchildren and ensures clean termination.

Supervision also includes restart policies. If a child crashes repeatedly, you should implement exponential backoff or a max-restart threshold to avoid a restart loop that burns CPU. That is not just a convenience; it is an integration defense. When the system is broken due to missing config or resource limits, constant restarting makes outages worse.

Finally, child processes inherit file descriptors. If you do not close or set FD_CLOEXEC on descriptors, the child may keep critical files or sockets open, preventing rotation or port reuse. This is a common cause of “graceful restart” failures, and it is directly tied to the open file description model in Chapter 1.

Signals also coalesce. The kernel does not queue a SIGCHLD for every child by default; multiple child exits can collapse into a single delivery. That is why the waitpid loop is non-negotiable. A common integration pattern is the self-pipe trick: the handler writes one byte to a pipe, and the main loop includes the read end in its poll set. This makes signal handling part of the same event loop as I/O. On Linux, signalfd provides a cleaner interface by delivering signals as readable events on a file descriptor, but it is not portable. Knowing these techniques helps you design supervisors that stay responsive even under heavy churn.

How This Fits in Projects

Project 3 is the primary application of this chapter.
Project 6 uses the same supervision logic for service restarts.

Definitions & Key Terms

SIGCHLD: Signal indicating child state change.
Zombie: Exited process that has not been reaped by its parent.
Async-signal-safe: Functions safe to call in handlers.
SA_SIGINFO: sigaction flag that enables extended signal info.

Mental Model Diagram

Parent supervisor
   |
   | fork/exec
   v
Child process ------------------> exits
   ^                               |
   | SIGCHLD                       v
   +--------------------------- signal handler
                                 (set flag)
                                 |
                                 v
                          main loop -> waitpid()

How It Works (Step-by-Step)

Install SIGCHLD handler using sigaction.
Start child with fork and exec.
On SIGCHLD, handler writes to a self-pipe or sets a flag.
Main loop notices the flag and reaps all children with waitpid(WNOHANG).
Decide whether to restart child based on exit status and policy.

Minimal Concrete Example

static volatile sig_atomic_t child_exited = 0;

void sigchld_handler(int sig) {
    child_exited = 1; // async-signal-safe
}

// main loop
if (child_exited) {
    int status;
    while (waitpid(-1, &status, WNOHANG) > 0) {
        // handle exit status
    }
    child_exited = 0;
}

Common Misconceptions

“I can printf inside a signal handler.” -> stdio is not async-signal-safe.
“If I handle SIGCHLD once, all children are reaped.” -> Multiple exits can queue.
“EINTR means failure.” -> It often means you must retry or handle signals.

Check-Your-Understanding Questions

Why is printf unsafe in a signal handler?
Why should waitpid be called in a loop with WNOHANG?
What is the role of SA_SIGINFO?

Check-Your-Understanding Answers

stdio is not reentrant and can corrupt shared buffers.
Multiple children can exit before you handle signals, and waitpid reaps one at a time.
It lets you receive siginfo_t and ucontext for richer handling.

Real-World Applications

Process supervisors, service managers, container runtimes
Job schedulers and workers
CLI tools that spawn subprocesses

Where You Will Apply It

Project 3, Project 6

References

signal-safety(7) man page (async-signal-safe functions) - https://man7.org/linux/man-pages/man7/signal-safety.7.html
sigaction(2) man page (handler behavior, SA_SIGINFO) - https://man7.org/linux/man-pages/man2/sigaction.2.html

Key Insight

Signals are not “notifications” you can handle later; they are interruptions that shape your program design.

Summary

Process supervision is about correctness under asynchronous interruption. With the right handler strategy and wait logic, you prevent zombies and build reliable process control.

Homework / Exercises

Write a supervisor that restarts a child with exponential backoff.
Create a test child that crashes every second and ensure the supervisor does not spin.

Solutions

Track restart count and sleep with increasing delays.
Use sleep(1) then exit(1) in the child, verify backoff.

Chapter 4: TCP Connection Lifecycle and Pool Resilience

Fundamentals

TCP connections have a lifecycle defined by explicit states: LISTEN, SYN-SENT, ESTABLISHED, FIN-WAIT, TIME-WAIT, and others. A connection pool must manage these states across many sockets. If one side disappears, the other can remain in a half-open state, wasting resources until timeouts detect the failure. A robust pool sets timeouts for connect, read, and write operations, and it performs health checks to remove dead connections. This is not just networking trivia; it determines whether your service fails under intermittent packet loss or remote restarts. The TCP state machine is standardized and should guide your implementation decisions.

Deep Dive

The TCP state machine is the foundation of connection pool design. RFC 9293 defines the states and their meanings: LISTEN waits for connection requests, SYN-SENT waits for a response after an active open, ESTABLISHED is the normal data transfer state, and TIME-WAIT exists to prevent old segments from being misinterpreted as new ones. Understanding these states matters because your pool will see them through the OS view: sockets may appear writable but still be in a connect-in-progress state, or they may appear readable but only to deliver EOF.

Half-open connections occur when one side believes the connection is alive while the other has died. This can happen due to network partitions, crashes, or NAT device timeouts. The kernel will not always detect this immediately. If you write to such a connection, you may get EPIPE or ECONNRESET; if you only read, you might wait forever. That is why a pool must implement application-level timeouts and keepalive or ping logic. A key integration insight is that “connected” does not mean “healthy”. Health must be actively measured.

Non-blocking connect is another critical integration point. If you call connect on a non-blocking socket, it returns EINPROGRESS and you must use poll or select to wait for writability, then check SO_ERROR to confirm success. This is a common source of bugs because many developers assume a writable socket is connected. The correct pattern is: connect -> wait for writable -> getsockopt(SO_ERROR) -> success or failure. This pattern is essential for scalable connection pools.

Resilience design also includes backpressure and failure injection. A pool that always retries immediately can overwhelm a failing dependency and create cascading failure. Use exponential backoff and jitter. Use circuit breaker logic: if too many connections fail, stop attempting for a short window and surface the error upstream. When the pool is under load, you may need to limit the number of concurrent connection attempts to avoid exhausting file descriptors and CPU. This integrates with RLIMIT_NOFILE from Chapter 1 and with event loops from Chapter 2.

In real systems, timeouts must be layered: connect timeout, read/write timeout, idle timeout, and pool TTL. A connection might be alive but too old to reuse, especially across load balancers that reset idle connections. To make the pool safe, you should periodically validate idle connections or use a maximum reuse age. The design tradeoff is between performance (reuse) and correctness (freshness). In production, correctness wins.

The final piece is observability. A pool without metrics is a black box. You should track active connections, idle connections, connection failures by errno, average latency, and eviction reasons. These metrics provide the evidence you need during incidents. If you build them into the project now, you will later know how to debug real failures.

There is also a lifecycle cost to connection churn. Every closed connection can leave a TIME-WAIT entry, which consumes an ephemeral port for a period of time and can limit outbound capacity during spikes. A pool reduces churn by reusing live connections, which reduces TIME-WAIT pressure and latency. At the same time, reuse can amplify the impact of a dead connection if you do not detect it. That tradeoff is why pooling must be coupled with health checks and idle timeouts, not just reuse.

How This Fits in Projects

Project 2 is entirely focused on these concepts.
Project 6 uses the pool for SSH or HTTP-based deployment steps.

Definitions & Key Terms

TCP state machine: Standard connection states (LISTEN, ESTABLISHED, TIME-WAIT).
Half-open connection: One side thinks alive, the other dead.
Connect timeout: Maximum time to establish a TCP connection.
Circuit breaker: Pattern that stops requests after repeated failures.

Mental Model Diagram

Client                        Server
SYN  ----------------------->
      <-------------------  SYN-ACK
ACK  ----------------------->  ESTABLISHED
...
FIN  ----------------------->  FIN-WAIT
      <-------------------  ACK
TIME-WAIT (client holds)

How It Works (Step-by-Step)

Create non-blocking socket.
connect() -> EINPROGRESS.
poll for writability.
getsockopt(SO_ERROR) -> success or failure.
Put connection into pool with idle timer.
On failure, backoff and try again later.

Minimal Concrete Example

int fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
int rc = connect(fd, (struct sockaddr *)&addr, sizeof(addr));
if (rc < 0 && errno == EINPROGRESS) {
    poll(&pfd, 1, connect_timeout_ms);
    int err = 0; socklen_t len = sizeof(err);
    getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &len);
    if (err != 0) { /* connection failed */ }
}

Common Misconceptions

“If connect() returns EINPROGRESS, the socket is fine.” -> You must check SO_ERROR.
“A connection in ESTABLISHED state is healthy.” -> It may be half-open.
“Retries fix everything.” -> Aggressive retries can cause cascading failure.

Check-Your-Understanding Questions

Why does TIME-WAIT exist, and how does it affect pools?
How do you validate a non-blocking connect?
What is the difference between connection timeout and read timeout?

Check-Your-Understanding Answers

TIME-WAIT prevents old packets from being misinterpreted; it delays reuse of the same tuple.
Wait for writability, then check SO_ERROR.
Connection timeout bounds handshake time; read timeout bounds response wait time.

Real-World Applications

HTTP connection pools and database drivers
Service-to-service communication in microservices
Health checks and load balancer integrations

Where You Will Apply It

Project 2, Project 6

References

RFC 9293: Transmission Control Protocol (state machine) - https://www.rfc-editor.org/rfc/rfc9293

Key Insight

TCP is stateful. Your pool must be stateful too, or it will lie about health.

Summary

A resilient connection pool is a failure-management system. Correctness comes from respecting TCP states, timeouts, and observability.

Homework / Exercises

Build a small program that shows TIME-WAIT after closing sockets.
Implement a connect timeout and verify it with a blackholed IP.

Solutions

Use ss -tan and observe TIME-WAIT entries after a loop of connects.
Set a short timeout and connect to a non-routable address.

Chapter 5: Environment Observability and /proc Diagnostics

Fundamentals

Many production failures are caused by environment differences, not code changes. The /proc filesystem exposes kernel and process state as files, including limits, memory maps, open file descriptors, and environment variables. It is the fastest way to explain “works on my machine” bugs. The /proc filesystem is a pseudo-filesystem that provides an interface to kernel data structures, and most of it is read-only. By reading /proc/self/limits and /proc/self/fd, you can discover why a program fails remotely. Resource limits like RLIMIT_NOFILE define the maximum FD number a process may open, and exceeding this limit yields EMFILE. A diagnostic tool that captures these details turns mystery outages into actionable diffs.

Deep Dive

The /proc filesystem is a live window into process and kernel state. It is not a log or a snapshot. When you read /proc/[pid]/fd, you see a directory of symlinks pointing to every open FD. When you read /proc/[pid]/limits, you see resource limits such as max open files, max processes, and locked memory. The proc man page describes it as a pseudo-filesystem interface to kernel data structures, typically mounted at /proc. This is why it is a critical integration tool: it lets you observe the environment exactly as the kernel sees it.

For integration debugging, three classes of differences matter: resource limits, filesystem permissions, and dynamic linking. Resource limits include RLIMIT_NOFILE (file descriptors) and RLIMIT_NPROC (process count). If you deploy to a machine with a low RLIMIT_NOFILE, your service may accept connections until it hits EMFILE. The getrlimit man page defines RLIMIT_NOFILE as one greater than the maximum FD number and notes that attempts to exceed it yield EMFILE. That is not a bug in your program; it is a mismatch between expected capacity and actual environment.

Filesystem permissions and ownership are another common failure source. A program may run under a different UID or with different umask on a server, causing file creation to fail. You can detect this by capturing /proc/self/status (UID/GID) and verifying file ownership and permissions. Similarly, dynamic linking problems happen when LD_LIBRARY_PATH differs or when the system provides a different version of a shared library. You can read /proc/self/maps to see what libraries are actually loaded and compare across machines.

A good environment diff tool should capture: OS version, kernel parameters, ulimits, open FD counts, file permission masks (umask), environment variables, mount points, and library dependencies. Then it should produce a stable JSON snapshot and a diff. This shifts debugging from “guessing” to “comparing evidence”.

One subtle integration issue is that /proc access itself can be restricted. The proc filesystem supports mount options like hidepid that prevent access to other processes. Your tool should handle permission errors gracefully and report missing data explicitly. This is a real production constraint, especially in containerized environments.

You should also learn which /proc files are stable and useful for diagnostics. /proc/self/status contains UID/GID, memory usage summaries, and thread counts. /proc/self/cmdline shows the exact arguments used to start the process, which is often different in production. /proc/self/environ reveals environment variables that can change runtime behavior. /proc/self/mountinfo and /proc/mounts show filesystem types and mount options, which matter for features like inotify and file permissions. The /proc/sys tree exposes kernel tunables, including networking settings and file system limits. Capturing these values (with care for sensitive data) makes your environment diff tool far more effective because you can explain why the same binary behaves differently under different kernel settings.

In containerized environments, /proc/self/cgroup provides the cgroup path for the process, which lets you detect CPU or memory constraints that are invisible from inside the app. Capturing this data helps explain why a service is slower or is OOM-killed in production while it runs fine on a developer laptop.

How This Fits in Projects

Project 5 is entirely about /proc-based environment diffing.
Project 6 uses environment snapshots for deployment diagnostics.

Definitions & Key Terms

/proc: Pseudo-filesystem exposing kernel and process data.
RLIMIT_NOFILE: Max FD number + 1; exceeding yields EMFILE.
/proc/self/fd: Symlinks for each open FD (identity and leaks).
/proc/self/maps: Memory mappings including shared libraries.

Mental Model Diagram

Kernel state ----> /proc ----> Diagnostic tool ----> JSON snapshot ----> Diff

How It Works (Step-by-Step)

Read /proc/self/limits and parse RLIMIT values.
Read /proc/self/fd to count and label open descriptors.
Read /proc/self/maps to capture shared library versions.
Capture environment variables (PATH, LD_LIBRARY_PATH, etc.).
Serialize into JSON and compare snapshots across machines.

Minimal Concrete Example

FILE *f = fopen("/proc/self/limits", "r");
while (fgets(line, sizeof(line), f)) {
    // parse and store key/value pairs
}

Common Misconceptions

“/proc is only for kernel hackers” -> It is a standard debugging interface.
“Environment differences are random” -> They are measurable and diffable.
“If code passes tests, production will match” -> Ulimits and libraries differ.

Check-Your-Understanding Questions

Why is /proc/self/fd useful for debugging FD leaks?
What does RLIMIT_NOFILE define?
Why might /proc access be incomplete in containers?

Check-Your-Understanding Answers

It shows every open FD as the kernel sees it, including leaked files.
It is one greater than the maximum FD number a process may open.
Containers often restrict /proc access with hidepid or namespaces.

Real-World Applications

CI vs production environment debugging
Postmortem analysis for “works on my machine”
Security audits of process state

Where You Will Apply It

Project 5, Project 6

References

proc(5) man page (pseudo-filesystem overview) - https://man7.org/linux/man-pages/man5/proc.5.html
getrlimit(2) man page (resource limit semantics) - https://man7.org/linux/man-pages/man2/getrlimit.2.html

Key Insight

Environment state is part of your system. If you cannot measure it, you cannot debug it.

Summary

/proc turns production failures into observable evidence. It is your most powerful integration debugging tool.

Homework / Exercises

Write a program that dumps /proc/self/fd to a JSON list.
Compare RLIMIT_NOFILE across two machines and explain differences.

Solutions

Use opendir and readlink on each /proc/self/fd entry.
Run ulimit -n and parse /proc/self/limits, then diff results.

Chapter 6: Memory-Mapped I/O and Shared Memory Coordination

Fundamentals

Memory-mapped files let you treat a file as memory. With MAP_SHARED, changes are visible to other processes mapping the same region, and changes are carried through to the underlying file. This is the foundation of zero-copy IPC. It is also easy to get wrong because ordering, visibility, and durability are not automatic. If you want to guarantee that data reaches disk, you must call msync, because without it there is no guarantee that changes are written back before munmap. These details are crucial for building a shared ring buffer: you must design synchronization, detect partial writes, and handle crash recovery.

Deep Dive

mmap is a boundary crossing tool. It bypasses conventional read/write syscalls and maps file data into your address space. The mmap man page states that MAP_SHARED mappings propagate updates to other processes that map the same region and to the underlying file. This is the key integration mechanism for shared memory IPC: two processes map the same file and see each other’s updates without copying. But this behavior has constraints. First, visibility is about memory ordering. If one process writes a record and another reads it, you must use synchronization (atomics, memory fences, or lock-free protocol) to ensure the reader does not see a partially written record. Second, durability is not automatic. The msync man page says that without msync, there is no guarantee that changes are written back to the filesystem before munmap. That is a subtle but essential point: your ring buffer might appear to work during tests, then lose data after a crash because data was never flushed.

Shared memory also exposes cache behavior. If two processes run on different cores, the hardware cache coherence protocol must propagate writes, but it does not imply ordering for your data structures. You must design a ring buffer where the writer publishes data in a defined order: write payload, then write length, then publish tail with a release barrier. The reader must read tail with an acquire barrier, then read length, then read payload. Without this, you will see intermittent corruption that is almost impossible to reproduce.

The ring buffer design also has capacity and wrap-around semantics. You must decide whether to overwrite old data, block the writer when the buffer is full, or drop records. Each decision impacts backpressure and integration behavior with upstream systems. For example, if you drop records, you must report that drop rate explicitly. If you block the writer, you must ensure you do not deadlock the process when the reader is slow or dead. These are integration decisions, not purely data structure decisions.

Another integration detail is file sizing. If you mmap a file, you must ensure the file is large enough first (ftruncate). If you do not, writes may generate SIGBUS when the mapping extends beyond the file size. This failure only appears under certain write patterns and is a classic integration bug.

Finally, remember that memory-mapped I/O interacts with the page cache. Your writes may appear fast because they land in cache, not on disk. If your system needs durability, you must decide when to call msync and with what flags. MS_SYNC waits for the flush to complete; MS_ASYNC schedules it and returns. Neither is free: frequent msync calls can dominate performance, while never calling msync can lose data after a crash. A common compromise is to treat the ring buffer as ephemeral and only sync checkpoints, not every record. That tradeoff should be explicit in your design.

Lastly, cleanup and crash recovery matter. If the writer crashes, the reader must detect incomplete records. A common technique is to use a sequence number or a record header with a checksum and a validity flag. On restart, the writer can scan forward to find the last valid record and resume without corrupting the buffer.

How This Fits in Projects

Project 4 is focused on these concepts.
Project 6 optionally uses mmap for fast diff or log aggregation.

Definitions & Key Terms

mmap: Map a file or device into memory.
MAP_SHARED: Updates are visible across processes and propagate to file.
msync: Explicitly flush mapped changes to disk.
SIGBUS: Signal generated on invalid memory access to mapping.

Mental Model Diagram

Process A               Shared File               Process B
[writer]  ----mmap----> [ring buffer] <----mmap---- [reader]

How It Works (Step-by-Step)

Create file and set size with ftruncate.
mmap with MAP_SHARED into both processes.
Writer writes data + metadata, then advances tail.
Reader polls head/tail, reads in order, advances head.
Optionally msync to flush critical data.

Minimal Concrete Example

int fd = open("/tmp/ringbuf", O_RDWR | O_CREAT, 0644);
ftruncate(fd, RING_SIZE);
void *buf = mmap(NULL, RING_SIZE, PROT_READ | PROT_WRITE,
                 MAP_SHARED, fd, 0);

Common Misconceptions

“MAP_SHARED means data is durable” -> You need msync for guarantees.
“Shared memory means no synchronization” -> You still need ordering rules.
“mmap is faster by default” -> It can be slower if your access pattern is poor.

Check-Your-Understanding Questions

What does MAP_SHARED guarantee about visibility?
Why is msync needed for durability?
How can SIGBUS occur in an mmap system?

Check-Your-Understanding Answers

Updates are visible to other processes mapping the same region and to the file.
Without msync, there is no guarantee changes are flushed before munmap.
If you write beyond the mapped file size, SIGBUS can occur.

Real-World Applications

High-frequency logging pipelines
Shared memory queues in low-latency systems
Memory-mapped databases and caches

Where You Will Apply It

Project 4, Project 6

References

mmap(2) man page (MAP_SHARED semantics) - https://man7.org/linux/man-pages/man2/mmap.2.html
msync(2) man page (flush semantics) - https://man7.org/linux/man-pages/man2/msync.2.html

Key Insight

Shared memory removes copies, not complexity. Correctness still requires protocol design.

Summary

mmap is powerful, but only when you explicitly manage ordering, durability, and crash recovery.

Homework / Exercises

Build a shared counter using mmap and atomics.
Write a program that intentionally triggers SIGBUS by exceeding file size.

Solutions

Use C11 atomics in a shared struct and verify increments across processes.
mmap a small file, write beyond its length, observe SIGBUS.

Glossary

Async-signal-safe: Functions safe to call inside a signal handler.
EMFILE: Error when process exceeds RLIMIT_NOFILE.
FD_SETSIZE: select() limit for file descriptors.
Half-open connection: One side sees connection alive, other does not.
Inode: File identity on disk, independent of its name.
Open file description: Kernel entry storing offsets and flags for an open file.
SIGCHLD: Signal when a child process changes state.
TIME-WAIT: TCP state to prevent old packets from corrupting new connections.

Why Systems Integration Matters

The Modern Problem It Solves

Most outages are not caused by algorithms. They are caused by integration: configuration mistakes, resource limits, network errors, and dependency failures. The data center reliability literature consistently shows that outages are often due to IT/networking issues and human error, not “bugs” in isolation.

Recent industry findings (sources included):

Uptime Institute reports that IT and networking issues totaled 23% of impactful outages in 2024. (Uptime Institute 2025 Outage Analysis press release: https://uptimeinstitute.com/about/press-releases/uptime-institute-2025-outage-analysis)
The same report notes that nearly 40% of organizations experienced a major outage caused by human error over the past three years, and 85% of those were tied to process failures. (Uptime Institute 2025 Outage Analysis press release: https://uptimeinstitute.com/about/press-releases/uptime-institute-2025-outage-analysis)
Uptime Institute also reports that more than half of respondents said their most recent significant outage cost more than $100,000, and one in five reported more than $1 million. (Uptime Institute 2025 Outage Analysis article: https://uptimeinstitute.com/resources/research-and-reports/2025-data-center-industry-survey-outs)

These numbers translate directly to your work: if you do not design for integration failures, your system will cause real downtime and financial loss.

Old Assumption                      Reality
"My code works"                     OS boundaries fail
"Tests pass"                        Environment differs
"Network is stable"                 Timeouts and partitions exist

Context & Evolution (Brief)

Unix designed a clean abstraction: “everything is a file.” That abstraction is powerful but it hides identity and lifecycle details. Modern systems are built on top of that model, which means integration failures today often trace back to those original boundary decisions.

Concept Summary Table

Concept Cluster	What You Need to Internalize
File Descriptors & Limits	FDs are handles to open file descriptions; limits are enforced and must be handled.
File Identity & Log Rotation	Paths change; inodes and truncation semantics matter for correctness.
Signals & Process Supervision	Handlers must be safe, and wait logic prevents zombies.
TCP Lifecycle & Pool Resilience	TCP states, timeouts, and health checks drive pool correctness.
Environment Observability	/proc and resource limits are primary sources of truth.
Memory-Mapped I/O	MAP_SHARED visibility, ordering, and explicit flush are required.

Project-to-Concept Map

Project	What It Builds	Primer Chapters It Uses
Project 1: Multi-Source Log Tailer	Rotation-safe tailer	Ch. 1, Ch. 2
Project 2: HTTP Connection Pool	Resilient socket pool	Ch. 1, Ch. 2, Ch. 4
Project 3: Process Supervisor	Signal-safe supervisor	Ch. 1, Ch. 3
Project 4: Memory-Mapped IPC	Shared ring buffer	Ch. 6
Project 5: Environment Debugger	/proc-based diff tool	Ch. 5, Ch. 1
Project 6: Deployment Pipeline	Integration capstone	Ch. 1-6

Deep Dive Reading by Concept

File Descriptors and I/O

Concept	Book & Chapter	Why This Matters
Universal I/O model	“The Linux Programming Interface” - Ch. 4	Fundamental FD behavior and syscalls
File attributes and inodes	“The Linux Programming Interface” - Ch. 15	Identity vs path and stat semantics
Advanced I/O	“Advanced Programming in the UNIX Environment” - Ch. 14	Non-blocking I/O and multiplexing

Signals and Process Control

Concept	Book & Chapter	Why This Matters
Process control	“Advanced Programming in the UNIX Environment” - Ch. 8, 9	fork/exec/wait lifecycle
Signals	“The Linux Programming Interface” - Ch. 20, 21	Correct signal handling
Process API	“Operating Systems: Three Easy Pieces” - Process chapter	Conceptual model

Networking and TCP

Concept	Book & Chapter	Why This Matters
Socket programming	“TCP/IP Sockets in C” - Ch. 2-4	Practical socket design
TCP connection management	“TCP/IP Illustrated, Vol. 1” - Ch. 13	TCP states and failure modes

/proc and Environment

Concept	Book & Chapter	Why This Matters
Process and resource visibility	“How Linux Works” - Ch. 8	/proc interpretation
Linking and libraries	“Computer Systems: A Programmer’s Perspective” - Ch. 7	Dynamic linking insights

Memory Mapping

Concept	Book & Chapter	Why This Matters
Virtual memory	“Computer Systems: A Programmer’s Perspective” - Ch. 9	Paging and mappings
Memory mappings	“The Linux Programming Interface” - Ch. 49	mmap and msync semantics

Quick Start

First 48 hours (minimal path):

Day 1:

Read TLPI Ch. 4.1-4.3 (Universal I/O model).
Write a program that opens /var/log/syslog, reads 10 lines, closes it.
Run strace and observe open, read, close.

Day 2:

Read man 2 stat and man 2 open.
Write a tiny tailer and rotate the file by renaming it.
Observe how the FD keeps reading the old inode.

By the end of day 2, you have the key mental model for Project 1.

Recommended Learning Paths

Path 1: New to Systems Programming

Start with Project 5 (Environment Debugger)
Then Project 1 (Log Tailer)
Then Project 3 (Process Supervisor)
Timeline: 8 to 12 weeks

Path 2: Want Production Skills

Start with Project 2 (Connection Pool)
Then Project 1 (Log Tailer)
Then Project 5 (Environment Debugger)
Timeline: 10 to 14 weeks

Path 3: Deep Low-Level Understanding

Complete all projects in order
Add extra tests and benchmarks
Timeline: 16 to 20 weeks

Path 4: Interview Preparation

Focus on Projects 1, 2, 3
Emphasize “Interview Questions” sections
Timeline: 6 to 8 weeks

Success Metrics

You have succeeded when you can:

Explain why a log tailer misses lines after rotation and fix it.
Diagnose EMFILE errors and relate them to RLIMIT_NOFILE.
Write a signal handler that is async-signal-safe.
Build a connection pool that survives timeouts and half-open connections.
Use /proc to explain environment differences.
Build an mmap-based IPC system that survives crashes without data corruption.

Project Overview Table

#	Project	Difficulty	Time	Primary Concepts
1	Multi-Source Log Tailer	Intermediate	1-2 weeks	FDs, inodes, rotation, multiplexing
2	HTTP Connection Pool	Intermediate-Advanced	2-3 weeks	TCP states, timeouts, pooling
3	Process Supervisor	Intermediate-Advanced	2-3 weeks	Signals, process lifecycle
4	Memory-Mapped IPC	Advanced	2-3 weeks	mmap, shared memory, ordering
5	Environment Debugger	Intermediate	1-2 weeks	/proc, resource limits
6	Deployment Pipeline Tool	Advanced	3-4 weeks	Integration of all concepts

Project List

Project 1: Multi-Source Log Tailer with Rotation Handling

Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 - Useful in production
Business Potential: 3 - Observability tooling
Difficulty: Intermediate
Knowledge Area: File I/O, inodes, multiplexing
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A log tailer that can follow multiple files, survive rotation (rename and copytruncate), and preserve ordering with timestamped output.

Why it teaches systems integration: It forces you to confront file identity, FD lifetimes, and I/O multiplexing in the presence of real-world log rotation.

Core challenges you will face:

Track inode and device IDs to detect rotations
Handle copytruncate and file shrink detection
Multiplex multiple files without blocking
Gracefully handle missing files and retries

Real World Outcome

When finished, running your tool looks like this:

$ ./tailer --config logs.yml
[tailer] watching 3 files with poll() interval=250ms
[tailer] /var/log/app.log inode=4198401
[tailer] /var/log/api.log inode=4198420
[tailer] /var/log/worker.log inode=4198451

2026-01-01T12:00:01Z app.log  INFO  started pid=23312
2026-01-01T12:00:02Z api.log  WARN  upstream timeout id=9f2c
2026-01-01T12:00:02Z worker.log INFO batch=17 processed=200

# rotation event
[tailer] app.log rotated (inode changed 4198401 -> 4201130), reopening
2026-01-01T12:00:10Z app.log  INFO  rotation detected, continuing

You can also force copytruncate:

$ logrotate -f /etc/logrotate.d/app
[tailer] app.log truncated (size decreased), reset offset

The Core Question You’re Answering

“How do I follow a file that does not stay the same file?”

This is the core integration problem: your program must track identity, not just names, and handle rotations without data loss.

Concepts You Must Understand First

Open file descriptions and FD identity
- Why does the FD keep pointing to the old inode after rename?
- Book: TLPI Ch. 4
Inodes and file attributes
- How do device/inode pairs identify files?
- Book: TLPI Ch. 15
I/O multiplexing
- Why select() is limited by FD_SETSIZE?
- Book: APUE Ch. 14

Questions to Guide Your Design

How will you detect rotation by rename vs copytruncate?
What is your strategy for missing files? (retry, backoff, error)
How will you preserve ordering across multiple logs?
Do you use poll, epoll, or inotify? Why?

Thinking Exercise

Rotation Timeline

T0: app.log inode=100, size=10MB
T1: logrotate renames app.log -> app.log.1 (inode 100)
T2: new app.log inode=101, size=0
T3: writer continues to write to old FD for 2 seconds

Questions:

What does a follow-by-descriptor tailer read at T2 to T3?
What does a follow-by-name tailer read?
How do you avoid duplicate lines or missed lines?

The Interview Questions They’ll Ask

“Explain why tail -F behaves differently from tail -f.”
“How do you detect a rotated file if you already have an FD?”
“Why does select not scale for many log files?”
“How do you handle copytruncate?”
“What is the difference between inode and path?”

Hints in Layers

Hint 1: Track inode and size

struct stat st;
stat(path, &st);
uint64_t inode = st.st_ino;
off_t size = st.st_size;

Hint 2: Detect rotation

struct stat st2;
stat(path, &st2);
if (st2.st_ino != inode || st2.st_size < last_offset) {
    // reopen
}

Hint 3: Use poll

struct pollfd fds[MAX_FILES];
int n = poll(fds, count, timeout_ms);

Hint 4: Handle missing files

# exponential backoff re-open
sleep 1 -> sleep 2 -> sleep 4

Books That Will Help

Topic	Book	Chapter
File I/O model	TLPI	Ch. 4
File attributes	TLPI	Ch. 15
I/O multiplexing	APUE	Ch. 14
System call errors	Linux System Programming	Ch. 1

Common Pitfalls & Debugging

Problem 1: “Tailer stops after rotation”

Why: Following by descriptor, never reopening path.
Fix: Follow by name and reopen on inode change.
Quick test: Rotate log and verify new inode is read.

Problem 2: “Tailer loops and rereads old data”

Why: Not tracking file offset when truncation occurs.
Fix: Reset offset if size shrinks.

Problem 3: “select fails with many files”

Why: FD_SETSIZE limit.
Fix: Use poll or epoll.

Definition of Done

Follows at least 3 files simultaneously without blocking
Handles rename rotation and copytruncate correctly
Detects missing files and retries with backoff
Logs rotation events with inode changes
Passes a stress test with 100 rotations

Project 2: HTTP Connection Pool with Failure Injection

Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 - Production-grade networking
Business Potential: 4 - Infrastructure library
Difficulty: Intermediate-Advanced
Knowledge Area: TCP, sockets, resilience patterns
Main Book: “TCP/IP Illustrated, Vol. 1” by Stevens and Fall

What you will build: A reusable HTTP connection pool that supports timeouts, health checks, backoff, and failure injection. It should survive server restarts and network partitions without leaking FDs.

Why it teaches systems integration: It forces you to interact with TCP state, timeouts, and failure modes that do not appear in simple socket examples.

Core challenges you will face:

Non-blocking connect with SO_ERROR verification
Half-open connection detection and eviction
Timeout layering (connect, read, idle)
Failure injection and backoff

Real World Outcome

$ ./pooler --host 127.0.0.1 --port 8080 --size 10 --inject-failures
[pool] created 10 connections
[pool] request latency p50=3ms p95=12ms
[pool] failures: ECONNRESET=3 ETIMEDOUT=2
[pool] circuit breaker OPEN for 5s
[pool] recovered, pool size=10

When the server restarts, the pool should recover automatically:

[pool] detected dead connection (EPIPE), evicting
[pool] reconnect success after 800ms

The Core Question You’re Answering

“How do I keep connections reliable when the network is not?”

Concepts You Must Understand First

TCP state machine (LISTEN, ESTABLISHED, TIME-WAIT)
Non-blocking connect and SO_ERROR
Timeout and retry strategies

Questions to Guide Your Design

How do you detect half-open sockets?
When do you evict a connection vs retry a request?
How do you enforce pool size under failure?
What metrics will you emit for observability?

Thinking Exercise

Half-Open Simulation

Client sends request -> server crashes after ACK
Client waits for response -> no data

What should the pool do after 5s? 30s? 2 minutes?

The Interview Questions They’ll Ask

“What is TIME-WAIT and why does it exist?”
“How do you validate a non-blocking connect?”
“How do you detect half-open connections?”
“Explain circuit breaker vs retry.”

Hints in Layers

Hint 1: Non-blocking connect

int fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
connect(fd, (struct sockaddr *)&addr, sizeof(addr));

Hint 2: Wait for connect completion

poll(&pfd, 1, timeout_ms);
getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &len);

Hint 3: Health check

send(fd, "HEAD / HTTP/1.0\r\n\r\n", ...);

Hint 4: Backoff

sleep 100ms -> 200ms -> 400ms -> 800ms

Books That Will Help

Topic	Book	Chapter
Socket programming	TCP/IP Sockets in C	Ch. 2-4
TCP state machine	TCP/IP Illustrated Vol. 1	Ch. 13
Failure patterns	Release It!	Ch. 5

Common Pitfalls & Debugging

Problem 1: “Pool reports connected but requests hang”

Why: Half-open connection not detected
Fix: Use read timeout + health checks

Problem 2: “Connections never succeed under load”

Why: Exhausted RLIMIT_NOFILE
Fix: Track FD usage and cap pool size.

Problem 3: “Retry storm”

Why: No backoff or circuit breaker
Fix: Implement exponential backoff and open circuit

Definition of Done

Pool handles connect, read, and idle timeouts
Detects and evicts half-open connections
Exposes metrics for failures and latency
Survives server restart without manual reset
Failure injection mode demonstrates recovery

Project 3: Process Supervisor with Signal Forwarding

Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 - Operationally essential
Business Potential: 3 - Service management tooling
Difficulty: Intermediate-Advanced
Knowledge Area: Signals, process lifecycle
Main Book: “Advanced Programming in the UNIX Environment” by Stevens and Rago

What you will build: A supervisor that launches a child process, monitors its health, forwards signals, restarts on crash, and never leaves zombies.

Why it teaches systems integration: It forces you to handle signals correctly and manage child processes under crash conditions.

Core challenges you will face:

SIGCHLD handling and reaping
Signal-safe handlers
Restart policies and backoff
Process group management

Real World Outcome

$ ./supervisor -- ./worker --port 9000
[supervisor] started child pid=4412
[supervisor] forwarding SIGTERM to process group
[supervisor] child exited status=0
[supervisor] restart in 2s

When the child crashes:

[supervisor] child died signal=11 (SIGSEGV)
[supervisor] restart in 4s (backoff)

The Core Question You’re Answering

“How do I manage processes safely when the OS can interrupt me at any time?”

Concepts You Must Understand First

SIGCHLD and waitpid
Async-signal-safe functions
Process groups and signal forwarding

Questions to Guide Your Design

What logic runs in the handler vs the main loop?
How do you avoid signal handler races?
How do you decide restart policy for crash loops?

Thinking Exercise

Crash Loop Scenario

Child exits every 1 second with SIGSEGV.
What restart policy avoids CPU burn and alert fatigue?

The Interview Questions They’ll Ask

“Why is printf unsafe inside a signal handler?”
“How do you avoid zombie processes?”
“What is SA_SIGINFO and why use it?”
“How do you handle EINTR?”

Hints in Layers

Hint 1: Minimal handler

volatile sig_atomic_t child_exited = 0;
void handler(int sig) { child_exited = 1; }

Hint 2: Reap in loop

while (waitpid(-1, &status, WNOHANG) > 0) { ... }

Hint 3: Process group

setpgid(0, 0); // child becomes leader
kill(-pgid, SIGTERM); // signal entire group

Hint 4: Backoff

sleep 1 -> 2 -> 4 -> 8

Books That Will Help

Topic	Book	Chapter
Process control	APUE	Ch. 8-9
Signals	TLPI	Ch. 20-21
Process API	OSTEP	Process chapter

Common Pitfalls & Debugging

Problem 1: “Zombies remain”

Why: Not calling waitpid in a loop
Fix: Reap until waitpid returns 0

Problem 2: “Signal handler deadlocks”

Why: Using printf or malloc inside handler
Fix: Use only async-signal-safe operations.

Problem 3: “Child keeps running after supervisor dies”

Why: Process group not managed
Fix: Use process groups and forward signals

Definition of Done

Starts and monitors a child process
Reaps all children, no zombies
Forwards SIGTERM and SIGINT correctly
Implements restart backoff
Handles EINTR gracefully

Project 4: Memory-Mapped Ring Buffer IPC

Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5 - High-performance IPC
Business Potential: 4 - Infrastructure building block
Difficulty: Advanced
Knowledge Area: mmap, shared memory, ordering
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A ring buffer shared by two processes using a memory-mapped file. A writer inserts log entries, a reader consumes them with minimal latency.

Why it teaches systems integration: It bypasses the kernel for data movement, forcing you to handle ordering, visibility, and crash recovery.

Core challenges you will face:

Correct MAP_SHARED setup and file sizing
Ordering with atomics
Detecting partial writes after crashes
Optional durability with msync

Real World Outcome

$ ./writer --rate 100000
[writer] wrote 100k msgs/sec into ring

$ ./reader
[reader] msg=932183 latency=220us
[reader] msg=932184 latency=210us

After killing the writer:

[reader] writer gone, waiting for restart

The Core Question You’re Answering

“How do I share data between processes without syscalls, safely?”

Concepts You Must Understand First

MAP_SHARED visibility rules
msync durability semantics
Memory ordering and atomics

Questions to Guide Your Design

How will you detect a partially written record?
Will you drop or block when the ring is full?
How will you recover on restart?

Thinking Exercise

Partial Write Scenario

Writer writes header length=100
Writer crashes before payload
Reader sees header. What should it do?

The Interview Questions They’ll Ask

“What does MAP_SHARED guarantee?”
“Why do you need msync?”
“How do you ensure reader sees a consistent record?”

Hints in Layers

Hint 1: File sizing

ftruncate(fd, RING_SIZE);

Hint 2: Map shared

void *buf = mmap(NULL, RING_SIZE, PROT_READ|PROT_WRITE,
                 MAP_SHARED, fd, 0);

Hint 3: Record format

struct rec { uint32_t len; uint32_t crc; char data[]; };

Hint 4: Publish order

write payload -> write header -> atomic_store_release(tail)

Books That Will Help

Topic	Book	Chapter
Memory mappings	TLPI	Ch. 49
Virtual memory	CS:APP	Ch. 9
Memory ordering	Rust Atomics and Locks	Ch. 3

Common Pitfalls & Debugging

Problem 1: “Reader sees corrupted data”

Why: Missing ordering barriers
Fix: Use atomic release/acquire around head/tail

Problem 2: “SIGBUS on write”

Why: File not sized before mmap
Fix: ftruncate before mapping

Problem 3: “Data lost after crash”

Why: No msync for critical checkpoints
Fix: msync important segments.

Definition of Done

Shared ring buffer works across two processes
Writer and reader handle wrap-around correctly
Partial write detection prevents corruption
Optional durability with msync
Throughput benchmark reported

Project 5: “Works On My Machine” Environment Debugger

Main Programming Language: C
Alternative Programming Languages: Python, Go
Coolness Level: Level 3 - Deeply practical
Business Potential: 3 - Support tooling
Difficulty: Intermediate
Knowledge Area: /proc, resource limits, env
Main Book: “How Linux Works” by Brian Ward

What you will build: A tool that captures a machine fingerprint (limits, env, libs, open FDs) and produces a diff between two machines.

Why it teaches systems integration: It turns invisible environment differences into concrete evidence.

Core challenges you will face:

Parsing /proc files reliably
Normalizing environment output for diffs
Capturing FD and library state

Real World Outcome

$ ./envdiff capture > local.json
$ ./envdiff capture --pid 2231 > server.json
$ ./envdiff diff local.json server.json

DIFF:
- RLIMIT_NOFILE: 1024 -> 65535
- LD_LIBRARY_PATH: (unset) -> /opt/lib
- /proc/self/fd count: 12 -> 84
- libc.so.6: 2.35 -> 2.31

The Core Question You’re Answering

“What is different between two environments that makes one fail?”

Concepts You Must Understand First

/proc filesystem as kernel interface
Resource limits and RLIMIT_NOFILE
Dynamic library loading

Questions to Guide Your Design

Which /proc files give the most debugging value?
How will you normalize values for stable diffs?
How will you handle permission errors gracefully?

Thinking Exercise

Diff Prioritization

Two machines differ in 50 variables.
Which 5 are most likely to explain "too many open files" errors?

The Interview Questions They’ll Ask

“What is /proc and what does it contain?”
“How do you detect FD leaks in production?”
“What does RLIMIT_NOFILE mean?”

Hints in Layers

Hint 1: Read limits

fopen("/proc/self/limits", "r");

Hint 2: Enumerate FDs

ls -l /proc/self/fd

Hint 3: Libraries

cat /proc/self/maps | grep ".so"

Hint 4: Diff JSON

jq --sort-keys . local.json > a.json
jq --sort-keys . server.json > b.json

Books That Will Help

Topic	Book	Chapter
/proc and processes	How Linux Works	Ch. 8
Resource limits	TLPI	Ch. 36
Linking	CS:APP	Ch. 7

Common Pitfalls & Debugging

Problem 1: “Tool fails on permission errors”

Why: /proc access restricted by hidepid
Fix: Detect EACCES and report missing data.

Problem 2: “Diff is noisy”

Why: Unstable values like timestamps
Fix: Normalize or exclude volatile keys

Definition of Done

Captures limits, env, libs, and FDs
Produces deterministic JSON snapshot
Diffs two snapshots with clear output
Handles missing permissions gracefully

Project 6: Deployment Pipeline Tool (Final Integration)

Main Programming Language: C
Alternative Programming Languages: Go, Rust
Coolness Level: Level 5 - Real integration capstone
Business Potential: 4 - DevOps tool
Difficulty: Advanced
Knowledge Area: Integration of all prior concepts
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A deployment tool that watches a local directory, syncs changes to a remote host, restarts a supervised service, aggregates logs, and reports health.

Why it teaches systems integration: It forces you to orchestrate file watching, networking, process supervision, and environment diagnostics in one system.

Core challenges you will face:

Coordinating multiple subsystems reliably
Handling partial deployments and rollbacks
Observability across local and remote

Real World Outcome

$ ./deploy watch ./src user@server:/opt/app
[deploy] detected change: main.c
[deploy] rsync complete in 420ms
[deploy] restarting service via supervisor
[deploy] service healthy in 1.2s
[deploy] log stream attached

The Core Question You’re Answering

“How do I make multiple systems act as one coherent tool under failure?”

Concepts You Must Understand First

FD and rotation handling (Project 1)
Connection pooling and timeouts (Project 2)
Process supervision and signals (Project 3)
/proc environment diagnostics (Project 5)

Questions to Guide Your Design

What happens if sync succeeds but restart fails?
How do you guarantee logs are not lost during restarts?
How do you shut down gracefully on SIGTERM?

Thinking Exercise

Partial Deploy Scenario

Sync 60% of files, then network drops.
Should you roll back, retry, or fail fast?

The Interview Questions They’ll Ask

“How would you design a tool that survives partial failure?”
“How do you avoid FD leaks in a long-running deployment tool?”
“How do you ensure logs keep flowing across restarts?”

Hints in Layers

Hint 1: File watching

// Use inotify or poll-based scanning

Hint 2: Reuse pool

// Use Project 2 pool for SSH/HTTP commands

Hint 3: Supervisor integration

// Use Project 3 logic to restart and forward signals

Hint 4: Log aggregation

// Use Project 1 tailer to follow remote logs

Books That Will Help

Topic	Book	Chapter
I/O and FDs	TLPI	Ch. 4
Process control	APUE	Ch. 8-9
Networking	TCP/IP Sockets in C	Ch. 2-4

Common Pitfalls & Debugging

Problem 1: “Deploy tool hangs on network loss”

Why: No timeout on remote operations
Fix: Use timeouts and backoff

Problem 2: “Logs stop after restart”

Why: Tailer follows old inode
Fix: Reopen logs on inode change (Project 1)

Problem 3: “Zombie child processes”

Why: Supervisor not reaping children
Fix: Use waitpid loop (Project 3)

Definition of Done

Detects local file changes reliably
Syncs to remote with retries and timeouts
Restarts service and verifies health
Aggregates logs across rotations
Produces a final deployment report

Last updated: 2026-01-01 Projects: 6 | Difficulty: Intermediate to Advanced | Timeline: 10-20 weeks