Understanding BSD vs Linux & Unix Variants: A Deep Dive Through Building

Goal: Master the fundamental architectural differences between Unix variants—BSD (FreeBSD, OpenBSD), Linux, and illumos—by building real systems that expose their distinct design philosophies. You’ll understand why these systems differ, not just how, enabling you to choose the right tool for each job and write truly portable systems code.


Why This Knowledge Matters

In 1969 at Bell Labs, Ken Thompson and Dennis Ritchie created Unix. From that single ancestor, an entire family of operating systems evolved—each branch making different design choices that echo through every system call you make today.

The professional reality:

Understanding these systems isn’t academic—it’s understanding why your containers work the way they do, why some firewalls are easier to configure than others, and why certain security models exist.

The core question: Why did these systems evolve differently from the same ancestor?

                                 ┌────────────────────────────────────────┐
                                 │        Original Unix (1969)            │
                                 │        Bell Labs (Thompson/Ritchie)    │
                                 └─────────────────┬──────────────────────┘
                                                   │
                      ┌────────────────────────────┴────────────────────────────┐
                      │                                                         │
                      ▼                                                         ▼
        ┌─────────────────────────────┐                          ┌──────────────────────────┐
        │      BSD (1977)             │                          │    System V (AT&T)       │
        │   UC Berkeley               │                          │                          │
        │   "Academic/Research"       │                          │    "Commercial Unix"     │
        └──────────────┬──────────────┘                          └────────────┬─────────────┘
                       │                                                      │
       ┌───────────────┼───────────────┬─────────────────┐                   │
       │               │               │                 │                   │
       ▼               ▼               ▼                 ▼                   ▼
  ┌─────────┐    ┌─────────┐    ┌─────────┐      ┌────────────┐      ┌────────────────┐
  │ FreeBSD │    │ OpenBSD │    │ NetBSD  │      │  Darwin/   │      │    Solaris     │
  │ (1993)  │    │ (1995)  │    │ (1993)  │      │   macOS    │      │    (1992)      │
  │         │    │         │    │         │      │            │      │                │
  │ Focus:  │    │ Focus:  │    │ Focus:  │      │ Mach +     │      │ DTrace, ZFS    │
  │ Perf,   │    │ Security│    │ Porta-  │      │ BSD        │      │                │
  │ Features│    │ Correct │    │ bility  │      │ userland   │      └────────┬───────┘
  └─────────┘    └─────────┘    └─────────┘      └────────────┘               │
                                                                              ▼
                                                                      ┌────────────────┐
        ┌─────────────────────────────────────────────────────────┐   │    illumos     │
        │                  Linux (1991)                            │   │    (2010)      │
        │           NOT Unix descendant—Unix-LIKE                  │   │                │
        │           Reimplementation of Unix ideas                 │   │ OpenSolaris    │
        │           Linux kernel + GNU userland                    │   │ fork           │
        └─────────────────────────────────────────────────────────┘   └────────────────┘

The key insight that will click once you build these projects:

  • Linux = A kernel with userland assembled from various sources (Lego blocks)
  • BSD = Complete, integrated operating systems (Finished product)
  • illumos = Enterprise Unix with native observability (DTrace) and storage (ZFS)

This fundamental difference shapes EVERYTHING: security models, container implementations, networking APIs, and more.


The Design Philosophy Deep Dive

Linux: “The Bazaar”

Linux follows the “cathedral vs bazaar” model from Eric Raymond—many independent contributors, rapid iteration, features from everywhere. The kernel is separate from userland (GNU tools, systemd, etc.).

┌─────────────────────────────────────────────────────────────────────┐
│                         Linux System                                 │
├─────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌────────────┐ │
│  │   systemd   │  │ GNU coreutils│  │  glibc      │  │   bash     │ │
│  │ (Lennart P.)│  │  (FSF)      │  │  (FSF)      │  │  (FSF)     │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └────────────┘ │
│                    ▲ Different projects, different maintainers      │
├────────────────────┼────────────────────────────────────────────────┤
│                    │                                                 │
│                    │  Linux Kernel (Torvalds et al.)                │
│    ┌───────────────┴──────────────────────────────────────────┐     │
│    │ Monolithic kernel with loadable modules                   │     │
│    │ syscall interface is the stable API boundary              │     │
│    └──────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────┘

Implications:

  • Security features come from many sources (seccomp, SELinux, AppArmor, namespaces)
  • Containers are “assembled” from primitives (namespaces + cgroups + seccomp + …)
  • Updates can be partial (update kernel, keep userland or vice versa)

BSD: “The Cathedral”

BSD maintains the entire operating system as one project. Kernel, libc, core utilities, documentation—all versioned together.

┌─────────────────────────────────────────────────────────────────────┐
│                    FreeBSD/OpenBSD System                            │
├─────────────────────────────────────────────────────────────────────┤
│           Single source tree, single project                         │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  /usr/src                                                     │    │
│  │   ├── sys/          (kernel source)                          │    │
│  │   ├── lib/          (libc, libm, etc.)                       │    │
│  │   ├── bin/          (core utilities: ls, cat, etc.)          │    │
│  │   ├── sbin/         (system utilities: mount, ifconfig)      │    │
│  │   ├── usr.bin/      (user utilities: grep, awk, etc.)        │    │
│  │   └── share/        (docs, man pages)                        │    │
│  │                                                               │    │
│  │   ALL maintained by the SAME project, versioned TOGETHER     │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  Result: Tight integration, consistent coding style, unified docs   │
└─────────────────────────────────────────────────────────────────────┘

Implications:

  • Security features are built-in (jails, Capsicum on FreeBSD; pledge/unveil on OpenBSD)
  • Containers are “first-class” (jail(2) is a single system call)
  • Updates are atomic (upgrade entire base system together)

Security Model Comparison: A Critical Difference

The security philosophy differences are profound:

OpenBSD: Promise-Based Security (pledge/unveil)

// OpenBSD: Tell the kernel what you WILL do, reveal what you WILL see
int main() {
    // After this, only these capabilities remain
    if (pledge("stdio rpath wpath", NULL) == -1)
        err(1, "pledge");

    // Only reveal these filesystem paths
    if (unveil("/var/log", "rw") == -1)
        err(1, "unveil");
    if (unveil(NULL, NULL) == -1)  // Lock it down
        err(1, "unveil");

    // Now the program runs with minimal privileges
    // Any violation = immediate SIGABRT (uncatchable)
}

Philosophy: “Surrender capabilities at runtime. Promise what you’ll do, reveal what you’ll see.” Simple, auditable, comprehensible by mortals.

FreeBSD: Capability-Based Security (Capsicum)

// FreeBSD: Limit capabilities on file descriptors
int main() {
    int fd = open("/etc/passwd", O_RDONLY);

    cap_rights_t rights;
    cap_rights_init(&rights, CAP_READ, CAP_SEEK);
    cap_rights_limit(fd, &rights);  // This fd can now ONLY read/seek

    cap_enter();  // Enter capability mode - no more global namespace access

    // fd is now the ONLY way to access that file
    // Cannot open new files, cannot access network
}

Philosophy: “Capabilities are tokens attached to file descriptors.” Fine-grained control, but more complex.

Linux: Filter-Based Security (seccomp-bpf)

// Linux: Write BPF program to filter syscalls
struct sock_filter filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = { .len = 4, .filter = filter };
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

Philosophy: “Maximum flexibility through programmability.” You write a BPF program that filters syscalls. Powerful but complex—easy to make mistakes.

┌───────────────────────────────────────────────────────────────────────────┐
│                    Security Model Comparison                               │
├────────────────────┬──────────────────────┬───────────────────────────────┤
│     OpenBSD        │      FreeBSD         │          Linux                │
│   pledge/unveil    │      Capsicum        │      seccomp-bpf              │
├────────────────────┼──────────────────────┼───────────────────────────────┤
│                    │                      │                               │
│ "I promise to only"│ "This fd can only"  │ "If syscall matches filter"  │
│                    │                      │                               │
│   Simple strings   │  Capability rights   │    BPF bytecode program      │
│   "stdio rpath"    │  CAP_READ, CAP_SEEK  │    Complex filter rules      │
│                    │                      │                               │
│   Easy to audit    │  Medium complexity   │    Hard to get right         │
│   ~10 lines code   │  ~30 lines code     │    ~100+ lines code          │
│                    │                      │                               │
└────────────────────┴──────────────────────┴───────────────────────────────┘

Isolation Architecture: Containers vs Jails vs Zones

This is where the “first-class concept” vs “building blocks” difference becomes crystal clear:

Linux: Assemble from Primitives

┌─────────────────────────────────────────────────────────────────────┐
│                    Linux "Container"                                 │
│            (NOT a kernel concept—assembled from parts)               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  You must combine:                                                   │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │
│  │ PID namespace│ │mount namespace│ │network ns   │                 │
│  │ clone(CLONE_ │ │ clone(CLONE_ │ │ clone(CLONE_│                 │
│  │    NEWPID)   │ │    NEWNS)    │ │    NEWNET)  │                 │
│  └──────────────┘ └──────────────┘ └──────────────┘                 │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │
│  │ UTS namespace│ │ IPC namespace│ │ user ns     │                 │
│  │   (hostname) │ │ (semaphores, │ │ (uid/gid    │                 │
│  │              │ │  msg queues) │ │  mapping)   │                 │
│  └──────────────┘ └──────────────┘ └──────────────┘                 │
│  ┌──────────────┐ ┌──────────────┐                                  │
│  │    cgroups   │ │   seccomp    │  + AppArmor/SELinux + ...       │
│  │  (resource   │ │  (syscall    │                                  │
│  │   limits)    │ │   filter)    │                                  │
│  └──────────────┘ └──────────────┘                                  │
│                                                                      │
│  Result: ~500+ lines of C code to create a container                │
└─────────────────────────────────────────────────────────────────────┘

FreeBSD: First-Class Jail

┌─────────────────────────────────────────────────────────────────────┐
│                      FreeBSD Jail                                    │
│               (First-class kernel concept)                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   Single system call:                                                │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  jail(2)                                                     │   │
│   │                                                              │   │
│   │  struct jail j = {                                           │   │
│   │      .version = JAIL_API_VERSION,                            │   │
│   │      .path = "/jails/myjail",                                │   │
│   │      .hostname = "myjail",                                   │   │
│   │      .jailname = "myjail",                                   │   │
│   │      .ip4s = 1,                                              │   │
│   │      .ip4 = &jail_ip,                                        │   │
│   │  };                                                          │   │
│   │  jail(&j);  // That's it. You're in a jail.                  │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  + VNET for network virtualization                                  │
│  + rctl for resource limits                                         │
│  + ZFS clones for instant filesystem snapshots                      │
│                                                                      │
│  Result: ~100 lines of C code for equivalent isolation              │
└─────────────────────────────────────────────────────────────────────┘

illumos: Zones with SMF Integration

┌─────────────────────────────────────────────────────────────────────┐
│                     illumos Zone                                     │
│              (Enterprise-grade isolation)                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  zone_create() / zonecfg + zoneadm                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                              │   │
│  │  - Full process isolation                                    │   │
│  │  - Delegated ZFS datasets                                    │   │
│  │  - Resource pools                                            │   │
│  │  - Network virtualization (crossbow)                         │   │
│  │  - SMF (Service Management Facility) integration             │   │
│  │  - DTrace visibility across zones                            │   │
│  │  - LX branded zones (run Linux binaries!)                    │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Zones predated Docker by over a decade (2004)                      │
└─────────────────────────────────────────────────────────────────────┘

Event-Driven I/O: kqueue vs epoll

Both solve the C10K problem (handling 10,000+ concurrent connections), but with different elegance:

┌─────────────────────────────────────────────────────────────────────┐
│                    kqueue (BSD)                                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  // One call to register AND wait                                   │
│  struct kevent changes[100];    // What we want to monitor          │
│  struct kevent events[100];     // What happened                    │
│                                                                      │
│  // Register 100 file descriptors in ONE system call                │
│  kevent(kq, changes, 100, events, 100, NULL);                       │
│                                                                      │
│  Benefits:                                                           │
│  ✓ Batch updates (register many fds in one syscall)                 │
│  ✓ Generic (handles files, sockets, signals, processes, timers)    │
│  ✓ Cleaner API design                                               │
│                                                                      │
│  Filter types: EVFILT_READ, EVFILT_WRITE, EVFILT_VNODE,            │
│               EVFILT_PROC, EVFILT_SIGNAL, EVFILT_TIMER             │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    epoll (Linux)                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  // Separate calls for each modification                             │
│  for (int i = 0; i < 100; i++) {                                    │
│      epoll_ctl(epfd, EPOLL_CTL_ADD, fds[i], &event);  // 100 calls!│
│  }                                                                   │
│  epoll_wait(epfd, events, 100, -1);                                 │
│                                                                      │
│  Limitations:                                                        │
│  ✗ One syscall per modification                                     │
│  ✗ Socket-focused (need eventfd/signalfd/timerfd for other types)  │
│  ✗ More system calls under high churn                               │
│                                                                      │
│  But: Still O(1) and very fast in practice                          │
└─────────────────────────────────────────────────────────────────────┘

Core Concept Analysis

To truly understand BSD vs Linux (and other Unix-likes), you need to grasp these fundamental architectural differences:

Concept Area Linux FreeBSD OpenBSD illumos
Design Philosophy Modular kernel + GNU userland (pieces from everywhere) Integrated “complete OS” (kernel + userland as one) Security-first, minimal attack surface Enterprise features (DTrace, ZFS native)
Isolation Namespaces + cgroups (building blocks) Jails (first-class kernel concept) chroot + pledge/unveil Zones (first-class containers)
Event I/O epoll kqueue kqueue Event ports
Packet Filter nftables/iptables pf (ported from OpenBSD) pf (native) IPFilter
Security Model seccomp-bpf, SELinux, AppArmor Capsicum, MAC Framework pledge(2), unveil(2) Privileges, zones
Tracing eBPF, perf DTrace (ported) ktrace DTrace (native)
Init System systemd (mostly) rc scripts rc scripts SMF

Key insight: Linux is a kernel with userland assembled from various sources. BSDs are complete, integrated operating systems. This fundamental difference shapes everything else.


Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Before diving into these projects, you should have:

  1. C Programming Proficiency
    • Comfortable writing multi-file C programs
    • Understanding of pointers, structs, and memory management
    • Experience with manual memory allocation (malloc/free)
    • Familiarity with header files and compilation
  2. Basic Unix/Linux Experience
    • Comfortable with command-line navigation
    • Basic shell scripting (bash/sh)
    • Understanding of file permissions and ownership
    • Experience using text editors (vi/vim, emacs, or nano)
  3. Systems Concepts
    • What processes are and how they execute
    • Understanding of file descriptors
    • Basic knowledge of system calls vs library functions
    • Concept of user space vs kernel space
  4. Networking Fundamentals
    • TCP/IP basics (what IP addresses, ports, and sockets are)
    • Understanding of client-server architecture
    • Basic familiarity with HTTP or other protocols

Helpful But Not Required

You’ll learn these during the projects, but having them beforehand accelerates understanding:

  • Advanced C: Function pointers, unions, bit manipulation
  • Operating Systems Theory: From a CS course or textbook
  • Assembly Language: Basic understanding helps with low-level debugging
  • Git: For version control across multiple OS environments
  • Virtualization: Experience with VirtualBox, VMware, or cloud VMs

Self-Assessment Questions

Before starting, ask yourself:

  • Can I write a C program that reads from a file and writes to another file?
  • Do I understand what fork() does at a conceptual level?
  • Can I explain the difference between a file descriptor and a FILE pointer?
  • Have I compiled and run programs on at least one Unix-like system?
  • Can I use man pages to look up function documentation?
  • Do I know what a socket is (even if I’ve never programmed one)?
  • Can I debug a segmentation fault using print statements or a debugger?

If you answered “no” to more than 2 questions, consider starting with:

  • “The C Programming Language” by Kernighan & Ritchie (K&R) - Chapters 1-6
  • “The Linux Programming Interface” by Kerrisk - Chapters 1-5
  • “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau - Introduction through Chapter 13

Development Environment Setup

Required Software

You’ll need access to multiple operating systems. Options:

Option 1: Virtual Machines (Recommended for Beginners)

# Install VirtualBox or VMware
# Download ISOs for:
- FreeBSD 14.x (freebsd.org/where)
- OpenBSD 7.5+ (openbsd.org/ftp.html)
- Ubuntu/Debian Linux (ubuntu.com or debian.org)
- OmniOS or SmartOS (omnios.org or smartos.org) [for illumos]

# Allocate at least:
- 2 GB RAM per VM
- 20 GB disk per VM
- 2 CPU cores per VM

Option 2: Cloud VMs (For Faster Setup)

# Providers with BSD support:
- DigitalOcean: FreeBSD droplets available
- Vultr: FreeBSD, OpenBSD available
- AWS/GCP/Azure: Various Linux, some BSD support

# Cost: ~$5-10/month per VM (can destroy when not in use)

Option 3: Bare Metal (Advanced)

  • Dual/triple boot on physical hardware
  • Most educational, but highest setup complexity

Install on each OS:

# Development tools
- gcc or clang (compiler)
- make (build system)
- git (version control)
- gdb or lldb (debugger)
- valgrind (memory checker, where available)

# FreeBSD
pkg install git gcc make gdb

# OpenBSD
pkg_add git gcc gmake gdb

# Linux (Ubuntu/Debian)
apt install build-essential git gdb valgrind

# OmniOS (illumos)
pkg install git gcc

SSH Setup for Easy Access

# On your host machine, set up SSH keys
ssh-keygen -t ed25519

# Copy to each VM
ssh-copy-id user@freebsd-vm
ssh-copy-id user@openbsd-vm
ssh-copy-id user@linux-vm

# Configure ~/.ssh/config for easy access
Host freebsd
    HostName 192.168.1.10
    User youruser

Host openbsd
    HostName 192.168.1.11
    User youruser

Time Investment

Realistic time estimates per project:

Project Setup Time Coding Time Debugging Time Total
Project 1 (Sandboxed Service) 4-6 hours 12-20 hours 8-12 hours 2-3 weeks
Project 2 (Event Server) 2-4 hours 8-12 hours 4-8 hours 1-2 weeks
Project 3 (Container/Jail) 6-10 hours 20-30 hours 10-20 hours 3-4 weeks
Project 4 (Firewall Tool) 3-5 hours 10-15 hours 5-10 hours 2 weeks
Project 5 (Tracer) 4-8 hours 15-25 hours 10-15 hours 2-3 weeks

Total for all projects: 10-14 weeks (3-4 months) working part-time (10-15 hours/week)

Important Reality Check

These projects are challenging. Expect to:

  • Read documentation extensively - Man pages, wikis, source code
  • Hit walls - Code that compiles on Linux but not FreeBSD
  • Debug obscure errors - “Segmentation fault” with no obvious cause
  • Restart VMs frequently - Kernel panics happen when you’re learning
  • Google error messages - And find forum posts from 2008
  • Question your life choices - “Why is this so hard?” (It’s worth it!)

The payoff:

  • Deep understanding of OS internals
  • Ability to write truly portable systems code
  • Skills that 99% of developers don’t have
  • Confidence debugging any Unix-like system
  • Real answers to “How does Docker actually work?”

What Success Looks Like

After completing these projects, you will:

✓ Understand why Netflix chose FreeBSD over Linux for their CDN ✓ Be able to explain the architectural differences between containers and jails ✓ Write security-hardened code using pledge, Capsicum, or seccomp ✓ Debug network performance issues using DTrace or eBPF ✓ Contribute to cross-platform open-source projects ✓ Architect systems that take advantage of each OS’s strengths

Support Resources

When you get stuck:

  • FreeBSD: forums.freebsd.org, #freebsd on Libera.Chat IRC
  • OpenBSD: misc@openbsd.org mailing list (read archives first!)
  • Linux: Stack Overflow, /r/linuxquestions
  • illumos: illumos.org/docs, #illumos on Libera.Chat
  • General Unix: comp.unix.programmer (Usenet, via Google Groups)

Concept Summary Table

Concept Cluster What You Need to Internalize
Design Philosophy Linux = bazaar (components from everywhere); BSD = cathedral (integrated system). This shapes everything.
Security Models OpenBSD pledge/unveil = promise-based; FreeBSD Capsicum = capability-based; Linux seccomp = filter-based. Trade-offs between simplicity and flexibility.
Isolation Architecture Jails/Zones are first-class kernel concepts; Linux containers are assembled from namespaces+cgroups. Complexity vs elegance.
Event I/O kqueue is more elegant (batch ops, generic); epoll is socket-focused. Both solve C10K.
The Unix Heritage BSD descends from original Unix; Linux is a reimplementation. This explains API differences.
Observability DTrace (Solaris/illumos native, ported to BSD) vs eBPF (Linux). Both let you instrument running kernels.
Networking BSD’s TCP/IP stack is the reference implementation. pf originated on OpenBSD.

Deep Dive Reading by Concept

Unix History & Design Philosophy

Concept Book & Chapter
Unix origins and philosophy The UNIX Programming Environment by Kernighan & Pike — Ch. 1: “UNIX for Beginners”
BSD history and development The Design and Implementation of the FreeBSD Operating System by McKusick et al. — Ch. 1
Linux kernel architecture Understanding the Linux Kernel, 3rd Edition by Bovet & Cesati — Ch. 1-2
System calls deep dive Advanced Programming in the UNIX Environment, 3rd Edition by Stevens & Rago — Ch. 1-3

Security Models

Concept Book & Chapter
OpenBSD security philosophy Absolute OpenBSD by Michael W. Lucas — Ch. 1 & security chapters
FreeBSD Capsicum Absolute FreeBSD, 3rd Edition by Michael W. Lucas — Ch. 8
Linux security mechanisms The Linux Programming Interface by Michael Kerrisk — Ch. 23 (Timers & Seccomp)
General Unix security Mastering FreeBSD and OpenBSD Security by Hope, Potter & Korff — Full book

Isolation & Containers

Concept Book & Chapter
Linux namespaces The Linux Programming Interface by Michael Kerrisk — Ch. 28-29 (Process Creation) + online resources
FreeBSD jails Absolute FreeBSD, 3rd Edition by Michael W. Lucas — Ch. 12: “Jails”
Linux cgroups How Linux Works, 3rd Edition by Brian Ward — Ch. 8
General process isolation Operating Systems: Three Easy Pieces by Arpaci-Dusseau — Part II: “Virtualization”

Networking & I/O

Concept Book & Chapter
Event-driven I/O The Linux Programming Interface by Michael Kerrisk — Ch. 63: “Alternative I/O Models”
TCP/IP fundamentals TCP/IP Illustrated, Volume 1 by W. Richard Stevens — Full book (BSD reference impl)
Socket programming UNIX Network Programming, Volume 1 by Stevens, Fenner & Rudoff — Ch. 1-6
FreeBSD networking The Design and Implementation of the FreeBSD Operating System by McKusick et al. — Ch. 12

System Tracing & Debugging

Concept Book & Chapter
DTrace fundamentals DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD by Brendan Gregg — Full book
eBPF/BPF on Linux BPF Performance Tools by Brendan Gregg — Ch. 1-5
General debugging The Art of Debugging with GDB, DDD, and Eclipse by Matloff & Salzman — Ch. 1-3

The Unix Family Tree (Context)

Understanding the genealogy helps:

Original Unix (Bell Labs, 1970s)
├── BSD (Berkeley, 1977)
│   ├── FreeBSD (1993) → focus: performance, features, ZFS
│   ├── OpenBSD (1995) → focus: security, correctness, simplicity
│   ├── NetBSD (1993) → focus: portability
│   └── Darwin/macOS (2000) → Mach microkernel + BSD userland
│
├── System V (AT&T)
│   └── Solaris (Sun, 1992)
│       └── illumos (2010) → OpenSolaris fork, DTrace/ZFS native
│
└── Linux (1991) → NOT Unix lineage, but Unix-like
    └── GNU userland + Linux kernel

Linux is the “odd one out”—it’s a reimplementation of Unix ideas, not a descendant. This explains why it often does things differently.


Quick Start: First 48 Hours

Feeling overwhelmed? Start here:

Day 1: Setup & Exploration (4-6 hours)

Morning: Get Your VMs Running

# Install VirtualBox
# Download and install:
1. Ubuntu 22.04 LTS (easiest to start)
2. FreeBSD 14.x
3. OpenBSD 7.5

# Boot each one, create a user, enable SSH
# Verify you can ssh into each VM from your host

Afternoon: Your First Cross-Platform Program

Write a simple “Hello World” that shows OS differences:

// hello_unix.c
#include <stdio.h>
#include <sys/utsname.h>

int main() {
    struct utsname info;
    uname(&info);

    printf("Hello from %s!\n", info.sysname);
    printf("Version: %s\n", info.release);
    printf("Architecture: %s\n", info.machine);

    #ifdef __linux__
    printf("I'm Linux - assembled from parts\n");
    #elif defined(__FreeBSD__)
    printf("I'm FreeBSD - integrated & performant\n");
    #elif defined(__OpenBSD__)
    printf("I'm OpenBSD - secure by default\n");
    #endif

    return 0;
}

Compile and run on all three systems:

# On each OS:
cc hello_unix.c -o hello_unix
./hello_unix

# Notice: Same source, different output!

What you learned: Compile-time OS detection, uname struct, basic differences

Day 2: Experience a Real Difference (4-6 hours)

Morning: Try kqueue vs epoll

Pick one simple program that shows event I/O differences:

// Simple event watcher - see how syntax differs
// On BSD: kqueue
// On Linux: epoll

Read the man pages:

  • Linux: man epoll
  • FreeBSD/OpenBSD: man kqueue

Notice the API design philosophy differences.

Afternoon: Explore Security Models

On OpenBSD:

# Read about pledge
man pledge
# See real programs using it
grep -r "pledge" /usr/src/usr.bin/

On Linux:

# Read about seccomp
man seccomp
# See Docker's use of it
cat /proc/1/status | grep Seccomp

What you learned: Different APIs for the same goal, philosophy emerges from usage

Weekend Goal

By the end of 48 hours, you should:

  • Have 3 VMs running and accessible via SSH
  • Compiled and run your first program on all 3
  • Read man pages for kqueue/epoll
  • Read man pages for pledge/seccomp
  • Noticed that BSD man pages are often clearer (it’s not just you!)

Next step: Dive into Project 2 (Event Server) - it builds on what you just learned.


Different backgrounds need different approaches. Pick the path that matches you:

Path 1: “I’m a Linux Developer Who Wants to Understand BSD”

Your advantage: You already know Linux syscalls and patterns Your challenge: Unlearning “the Linux way is the only way”

Recommended sequence:

  1. Start with Project 2 (Event Server) - kqueue vs epoll is the clearest contrast
  2. Then Project 4 (Firewall Tool) - pf syntax will make you jealous of BSD
  3. Then Project 1 (Sandboxed Service) - you’ll appreciate pledge’s simplicity
  4. Then Project 3 (Container/Jail) - “wait, jail is just ONE syscall?”
  5. Finally Project 5 (Tracer) - port your eBPF knowledge to DTrace

Key insight to internalize: “Simple” and “powerful” aren’t opposites

Path 2: “I’m a BSD User Curious About Linux”

Your advantage: You understand integrated systems and clean APIs Your challenge: Linux’s “Lego block” approach will seem chaotic

Recommended sequence:

  1. Start with Project 2 (Event Server) - see epoll’s socket focus
  2. Then Project 3 (Container/Jail) - this will shock you (7 namespace types!)
  3. Then Project 1 (Sandboxed Service) - seccomp-bpf is… complex
  4. Then Project 5 (Tracer) - eBPF’s power (and complexity)
  5. Finally Project 4 (Firewall Tool) - nftables vs pf

Key insight to internalize: “Flexible” comes at the cost of complexity

Path 3: “I’m New to Systems Programming”

Your advantage: No preconceptions or bad habits Your challenge: Everything is new

Recommended sequence:

  1. Read first:
    • “The Linux Programming Interface” Ch. 1-5 (basics)
    • “Absolute FreeBSD” Ch. 1-3 (BSD intro)
  2. Start with the Quick Start above (get comfortable)
  3. Then Project 2 (Event Server) - simplest concepts, clear outcome
  4. Then Project 4 (Firewall Tool) - builds on networking knowledge
  5. Then Project 1 (Sandboxed Service) - introduces security concepts
  6. Then Project 3 (Container/Jail) - you’re ready for complexity now
  7. Finally Project 5 (Tracer) - advanced, but you’ll have the foundation

Key insight to internalize: Unix variants are a family, not enemies

Path 4: “I Want to Understand Containers Deeply”

Your focus: Container/isolation technologies Your goal: Truly understand Docker/Kubernetes internals

Recommended sequence:

  1. Start with Project 3 (Container/Jail/Zone) - go straight to the core
  2. Then Project 1 (Sandboxed Service) - understand security primitives
  3. Then Project 2 (Event Server) - containers need networking
  4. Then Project 4 (Firewall Tool) - network isolation details
  5. Finally Project 5 (Tracer) - observe containers in production

Supplemental reading:

Key insight to internalize: Containers are “assembled” on Linux, “first-class” on BSD/Solaris

Path 5: “I’m Preparing for Interviews at Systems Companies”

Your goal: Be able to answer “How does X actually work?” Your timeline: 2-3 months

Recommended sequence:

  1. Week 1-2: Project 2 (Event Server) - common interview topic
  2. Week 3-5: Project 3 (Container/Jail) - “Explain how Docker works”
  3. Week 6-7: Project 1 (Sandboxed Service) - security awareness
  4. Week 8-10: Project 5 (Tracer) - shows depth of knowledge
  5. Week 11-12: Project 4 (Firewall) - networking fundamentals

Interview prep focus:

  • Keep a “learning journal” for each project
  • Write blog posts explaining what you learned
  • Be ready to answer: “What surprised you most?”
  • Practice: “Design a container system from scratch”

Key insight to internalize: Deep understanding beats breadth of frameworks

Path 6: “I Want to Contribute to Open Source OS Projects”

Your goal: Understand codebases well enough to contribute Your approach: Build confidence through projects first

Recommended sequence:

  1. Projects 1-3 in any order (build confidence)
  2. Pick an OS to focus on (FreeBSD, OpenBSD, or Linux kernel)
  3. Read that OS’s source code for the features you implemented
  4. Compare your implementation to the real one
  5. Find “good first issue” bugs in that OS
  6. Projects 4-5 to deepen knowledge
  7. Submit your first patch

Resources:

Key insight to internalize: OS developers are humans who appreciate good contributions


Project 1: “Cross-Platform Sandboxed Service”

Attribute Value
Language C
Difficulty Advanced
Time 1-2 weeks
Coolness ★★★★☆ Hardcore
Portfolio Value Portfolio Piece

What you’ll build: A file-watching daemon that monitors directories for changes and logs events—implemented with native sandboxing on each OS (pledge/unveil on OpenBSD, Capsicum on FreeBSD, seccomp on Linux).

Why it teaches Unix differences: You can’t abstract away the security models—you must understand each one’s philosophy. OpenBSD’s “promise what you’ll do, reveal what you’ll see” model (pledge/unveil) is fundamentally different from Linux’s “filter syscalls at BPF level” (seccomp) or FreeBSD’s capability-based approach (Capsicum).

Core challenges you’ll face:

  • Challenge 1: Understanding pledge promises (“stdio rpath wpath”) vs seccomp BPF filters (maps to security model philosophy)
  • Challenge 2: Using unveil() vs Capsicum cap_rights_limit() for filesystem restriction (maps to capability models)
  • Challenge 3: Building without libc abstractions that hide OS differences (maps to syscall interface understanding)
  • Challenge 4: Handling graceful degradation when security features aren’t available

Key Concepts:

Prerequisites: C programming, basic Unix syscalls (open, read, write), comfort with man pages

Real world outcome:

  • A daemon that prints “CREATED: /path/to/file” or “MODIFIED: /path/to/file” to stdout/syslog
  • Running with minimal privileges on each OS—demonstrable by attempting forbidden operations and seeing them blocked
  • A single codebase with #ifdef __OpenBSD__, #ifdef __FreeBSD__, #ifdef __linux__ blocks showing the architectural differences

Learning milestones:

  1. Get basic file watching working on one OS → understand inotify (Linux) vs kqueue EVFILT_VNODE (BSD)
  2. Add sandboxing on OpenBSD with pledge/unveil → understand “promise-based” security
  3. Port sandboxing to FreeBSD Capsicum → understand capability-based security
  4. Port to Linux seccomp-bpf → understand filter-based security and why it’s “harder”
  5. Compare: which was easiest? Which is most secure? Why?

Real World Outcome

When you complete this project, you’ll have a security-hardened file-watching daemon that demonstrates the fundamental differences between Unix security models.

What you’ll see running on OpenBSD:

$ ./filewatcher /var/log

[filewatcher] Starting with pledge("stdio rpath wpath cpath") and unveil("/var/log", "rw")
[filewatcher] Security sandbox active. Attempting forbidden operation...
[filewatcher] BLOCKED: Cannot access /etc/passwd (unveil restriction)
[filewatcher] Monitoring /var/log for changes...

[2024-12-22 14:32:01] CREATED: /var/log/messages.1
[2024-12-22 14:32:05] MODIFIED: /var/log/auth.log
[2024-12-22 14:32:10] DELETED: /var/log/old.log

# If you try to violate pledge:
$ ./filewatcher_bad /var/log
[filewatcher] Starting...
[filewatcher] Attempting network connection (not pledged)...
Abort trap (core dumped)  # SIGABRT - pledge violation!

What you’ll see running on FreeBSD with Capsicum:

$ ./filewatcher /var/log

[filewatcher] Entering capability mode...
[filewatcher] File descriptor rights limited: CAP_READ, CAP_EVENT
[filewatcher] Capability mode active. Global namespace access revoked.
[filewatcher] Monitoring /var/log for changes...

[2024-12-22 14:32:01] CREATED: /var/log/messages.1

# Attempting to open new file after cap_enter():
[filewatcher] ERROR: open("/etc/passwd") failed: Not permitted in capability mode

What you’ll see running on Linux with seccomp:

$ ./filewatcher /var/log

[filewatcher] Installing seccomp-bpf filter...
[filewatcher] Allowed syscalls: read, write, inotify_add_watch, inotify_rm_watch, exit_group
[filewatcher] Filter installed. Monitoring...

[2024-12-22 14:32:01] CREATED: /var/log/messages.1

# Attempting forbidden syscall:
$ ./filewatcher_bad /var/log
[filewatcher] Attempting socket() syscall (not allowed)...
Bad system call (core dumped)  # SIGSYS - seccomp violation!

Your codebase will look like:

// Conditional compilation showing the architectural differences
#ifdef __OpenBSD__
    // ~15 lines: pledge() + unveil()
    pledge("stdio rpath wpath", NULL);
    unveil(watch_path, "rw");
    unveil(NULL, NULL);
#elif defined(__FreeBSD__)
    // ~30 lines: Capsicum capability mode
    cap_rights_t rights;
    cap_rights_init(&rights, CAP_READ, CAP_EVENT, CAP_FCNTL);
    cap_rights_limit(dir_fd, &rights);
    cap_enter();
#elif defined(__linux__)
    // ~100+ lines: seccomp-bpf filter program
    struct sock_filter filter[] = { /* BPF program */ };
    // ... complex filter setup
#endif

The Core Question You’re Answering

“Why do different Unix systems take such radically different approaches to application sandboxing, and what are the real-world trade-offs?”

This project forces you to confront a fundamental truth: security is a design philosophy, not just a feature list. OpenBSD’s pledge/unveil says “tell us what you need, we’ll kill you if you lie.” FreeBSD’s Capsicum says “capabilities are tokens on file descriptors.” Linux’s seccomp says “here’s a programmable filter—go wild.”

By implementing the same functionality on all three, you’ll viscerally understand why OpenBSD can sandbox their entire base system while Linux applications rarely use seccomp directly.


Concepts You Must Understand First

Stop and research these before coding:

  1. System Calls as the Security Boundary
    • What is a system call? How does it differ from a library function?
    • Why is the syscall interface the natural place to enforce security?
    • How does the kernel know which process is making the call?
    • Book Reference: Advanced Programming in the UNIX Environment, 3rd Edition by Stevens & Rago — Ch. 1-3
  2. The Principle of Least Privilege
    • What does it mean for a program to have “minimal privileges”?
    • Why should a file watcher not have network access?
    • What’s the difference between DAC (discretionary) and MAC (mandatory) access control?
    • Book Reference: Mastering FreeBSD and OpenBSD Security by Hope, Potter & Korff — Ch. 1-2
  3. OpenBSD pledge/unveil Model
  4. FreeBSD Capsicum Model
    • What is “capability mode” and why can’t you leave it?
    • How do cap_rights_t work? What’s CAP_READ vs CAP_WRITE?
    • What’s the difference between cap_rights_limit() and cap_enter()?
    • Book Reference: Absolute FreeBSD, 3rd Edition by Michael W. Lucas — Ch. 8
  5. Linux seccomp-bpf Model
    • What is BPF (Berkeley Packet Filter)? Why is it used for syscall filtering?
    • How do you write a BPF filter program?
    • What’s the difference between SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_ALLOW?
    • Book Reference: The Linux Programming Interface by Michael Kerrisk — Ch. 23
  6. File System Event Notification
    • Linux: How does inotify work? What events can you watch?
    • BSD: How does kqueue EVFILT_VNODE work? What’s the kevent structure?
    • Why are these fundamentally different APIs?
    • Book Reference: The Linux Programming Interface by Michael Kerrisk — Ch. 19

Questions to Guide Your Design

Before implementing, think through these:

  1. What exactly needs sandboxing?
    • What system calls does a file watcher need? (open, read, stat, inotify_add_watch/kevent, write to log)
    • What system calls should be BLOCKED? (socket, execve, fork, ptrace, mount)
    • How do you enumerate the minimal set?
  2. How do you test the sandbox?
    • How can you verify that forbidden operations are actually blocked?
    • What happens when a sandboxed program tries a forbidden syscall?
    • How do you distinguish “sandbox blocked it” from “other error”?
  3. How do you handle initialization vs runtime?
    • Most programs need more privileges during startup (opening config files, binding ports)
    • How do pledge/Capsicum/seccomp handle the “initialize, then restrict” pattern?
    • When exactly should you “lock down”?
  4. What about error handling?
    • If pledge() fails, should you continue without sandboxing or exit?
    • How do you write code that gracefully degrades on systems without these features?
    • How do you log sandbox violations for debugging?
  5. Cross-platform abstraction?
    • Should you create a common API that hides the OS differences?
    • Or should you embrace the differences with #ifdef?
    • What are the trade-offs of each approach?

Thinking Exercise

Before coding, trace this scenario by hand:

Your file watcher needs to:

  1. Open a directory for watching
  2. Read file system events
  3. Write events to a log file
  4. Optionally: send alerts over the network (for a “premium” version)

Map to each security model:

┌─────────────────────────────────────────────────────────────────────┐
│                    OpenBSD pledge/unveil                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Step 1: What promises do we need?                                  │
│          stdio (read/write to fds)                                  │
│          rpath (read files)                                         │
│          wpath (write files)                                        │
│          cpath (create files - for log rotation?)                   │
│          inet (ONLY if network alerts enabled)                      │
│                                                                      │
│  Step 2: What paths do we reveal?                                   │
│          unveil("/var/log", "rw")  - watch and log here             │
│          unveil("/etc/filewatcher.conf", "r") - config file         │
│          unveil(NULL, NULL) - lock it down                          │
│                                                                      │
│  Step 3: What happens if we try socket() without "inet" promise?    │
│          → Process receives SIGABRT, core dump created              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    FreeBSD Capsicum                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Step 1: Open all needed file descriptors BEFORE cap_enter()        │
│          int dir_fd = open("/var/log", O_RDONLY|O_DIRECTORY);       │
│          int log_fd = open("/var/log/filewatcher.log", O_WRONLY);   │
│          int kq = kqueue();                                          │
│                                                                      │
│  Step 2: Limit rights on each fd                                     │
│          cap_rights_limit(dir_fd, &(CAP_READ|CAP_EVENT|CAP_LOOKUP));│
│          cap_rights_limit(log_fd, &(CAP_WRITE|CAP_SEEK));           │
│                                                                      │
│  Step 3: Enter capability mode                                       │
│          cap_enter();  // No way back!                               │
│                                                                      │
│  Step 4: What happens if we try open("/etc/passwd")?                │
│          → Returns -1, errno = ECAPMODE                              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    Linux seccomp-bpf                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Step 1: Enumerate all syscalls we need (this is the hard part!)    │
│          read, write, close, fstat, mmap, mprotect,                 │
│          inotify_init1, inotify_add_watch, inotify_rm_watch,        │
│          epoll_create1, epoll_ctl, epoll_wait,                      │
│          openat (with restrictions?), exit_group, ...               │
│                                                                      │
│  Step 2: Write BPF filter program                                   │
│          For each syscall: ALLOW if in whitelist, KILL otherwise    │
│          Must handle syscall arguments for openat restrictions!     │
│                                                                      │
│  Step 3: What happens if we try socket()?                           │
│          → Process receives SIGSYS, terminated                       │
│                                                                      │
│  Challenge: How do you restrict openat() to specific paths?         │
│             BPF can't easily inspect string arguments!              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key questions from this exercise:

  • Why is OpenBSD’s model so much simpler?
  • Why does Capsicum require pre-opening all file descriptors?
  • Why is Linux’s path restriction so much harder?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What’s the difference between pledge, Capsicum, and seccomp?”
    • pledge: Promise-based, operates on “promise” categories, simple strings
    • Capsicum: Capability-based, operates on file descriptors, fine-grained
    • seccomp: Filter-based, operates on syscalls with BPF, most flexible but complex
  2. “Why did OpenBSD choose the pledge model?”
    • Simplicity enables adoption (90%+ of base system is pledge’d)
    • Auditable by humans (you can read “stdio rpath” and understand it)
    • Fail-closed philosophy (violation = death, no recovery)
  3. “What are the limitations of each approach?”
    • pledge: Coarse-grained (can’t say “only read /etc/passwd”)
    • Capsicum: Requires restructuring code to pre-open descriptors
    • seccomp: Hard to restrict syscall arguments (e.g., which paths for open)
  4. “How would you sandbox a web browser?”
    • Chromium uses seccomp-bpf on Linux
    • Capsicum was designed with Chromium in mind (FreeBSD port exists)
    • This is a great real-world comparison point
  5. “What’s the attack surface reduction of each model?”
    • pledge: Reduces syscall surface to promised categories
    • Capsicum: Removes global namespace entirely after cap_enter()
    • seccomp: Reduces to explicit syscall whitelist
  6. “Can you escape these sandboxes?”
    • All have had vulnerabilities (nothing is perfect)
    • Complexity = more bugs (seccomp filters have had escapes)
    • OpenBSD’s simplicity has security benefits

Hints in Layers

Hint 1: Start with file watching (no sandbox)

Get the core functionality working first:

// Linux inotify
int fd = inotify_init1(IN_NONBLOCK);
inotify_add_watch(fd, "/var/log", IN_CREATE | IN_MODIFY | IN_DELETE);
// Read events in a loop

// BSD kqueue
int kq = kqueue();
struct kevent ev;
EV_SET(&ev, dir_fd, EVFILT_VNODE, EV_ADD | EV_ENABLE | EV_CLEAR,
       NOTE_WRITE | NOTE_DELETE | NOTE_RENAME, 0, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);
// Wait for events with kevent()

Hint 2: Add OpenBSD pledge first (simplest)

#ifdef __OpenBSD__
#include <unistd.h>

// After opening watch directory but before main loop:
if (pledge("stdio rpath wpath", NULL) == -1)
    err(1, "pledge");

// After all setup, lock down paths:
if (unveil(watch_path, "rw") == -1)
    err(1, "unveil");
if (unveil(NULL, NULL) == -1)  // No more unveil calls allowed
    err(1, "unveil");
#endif

Hint 3: FreeBSD Capsicum requires restructuring

#ifdef __FreeBSD__
#include <sys/capsicum.h>

// Open EVERYTHING you need FIRST
int dir_fd = open(watch_path, O_RDONLY | O_DIRECTORY);
int log_fd = open(log_path, O_WRONLY | O_APPEND | O_CREAT, 0644);
int kq = kqueue();

// Limit capabilities on each
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_EVENT, CAP_FCNTL);
cap_rights_limit(dir_fd, &rights);

cap_rights_init(&rights, CAP_WRITE, CAP_SEEK);
cap_rights_limit(log_fd, &rights);

// Enter capability mode - no turning back!
if (cap_enter() == -1)
    err(1, "cap_enter");
#endif

Hint 4: Linux seccomp is the most complex

#ifdef __linux__
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>

// Use libseccomp for sane API:
#include <seccomp.h>

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

// Whitelist needed syscalls
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(inotify_add_watch), 0);
// ... many more

seccomp_load(ctx);
#endif

Hint 5: Test sandbox violations

void test_sandbox(void) {
    // Try something we shouldn't be able to do
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    if (sock == -1) {
        printf("GOOD: socket() blocked as expected\n");
    } else {
        printf("BAD: socket() succeeded, sandbox not working!\n");
        close(sock);
    }
}

Books That Will Help

Topic Book Chapter
System calls fundamentals Advanced Programming in the UNIX Environment, 3rd Edition by Stevens & Rago Ch. 1-3
OpenBSD security philosophy Absolute OpenBSD by Michael W. Lucas Ch. 1 + security chapters
FreeBSD Capsicum Absolute FreeBSD, 3rd Edition by Michael W. Lucas Ch. 8
Linux seccomp-bpf The Linux Programming Interface by Michael Kerrisk Ch. 23
File watching (inotify) The Linux Programming Interface by Michael Kerrisk Ch. 19
BSD kqueue The Design and Implementation of the FreeBSD Operating System by McKusick et al. Ch. 6
Security principles Mastering FreeBSD and OpenBSD Security by Hope, Potter & Korff Ch. 1-4
BPF internals BPF Performance Tools by Brendan Gregg Ch. 2 (BPF basics)

Project 2: “Event-Driven TCP Echo Server (kqueue vs epoll)”

Attribute Value
Language C
Difficulty Advanced
Time 1-2 weeks
Coolness ★★★★☆ Hardcore
Portfolio Value Startup-Ready

What you’ll build: A high-performance TCP echo server handling 10,000+ concurrent connections using native event APIs—kqueue on BSD, epoll on Linux—with no abstraction libraries.

Why it teaches Unix differences: The kqueue vs epoll comparison reveals deep kernel design philosophy differences. kqueue is more general (handles files, signals, processes, timers—not just sockets) and allows batch updates. epoll is socket-focused and requires one syscall per change. Building the same server on both forces you to internalize these differences.

Core challenges you’ll face:

  • Challenge 1: kevent() batch operations vs epoll_ctl() single operations (maps to API design philosophy)
  • Challenge 2: Handling EVFILT_READ, EVFILT_WRITE vs EPOLLIN, EPOLLOUT (maps to event model differences)
  • Challenge 3: Edge-triggered vs level-triggered behavior on both systems
  • Challenge 4: Scaling to C10K connections and measuring performance differences

Key Concepts:

Prerequisites: Socket programming basics, understanding of file descriptors

Real world outcome:

  • A server that accepts connections and echoes back whatever clients send
  • Benchmark output: “Handled 10,000 concurrent connections, 50,000 req/sec on FreeBSD kqueue” vs “45,000 req/sec on Linux epoll”
  • Performance graphs comparing both implementations under load

Learning milestones:

  1. Build blocking echo server → understand why it doesn’t scale
  2. Convert to epoll on Linux → understand event-driven I/O
  3. Port to kqueue on FreeBSD/OpenBSD → notice the cleaner API
  4. Add benchmarking with wrk or custom client → quantify the differences
  5. Try macOS kqueue → understand Darwin’s BSD heritage

Real World Outcome

When you complete this project, you’ll have a high-performance TCP echo server that handles thousands of concurrent connections using native OS event APIs.

What you’ll see running on FreeBSD with kqueue:

$ ./echo_server 8080
[echo_server] kqueue() created, fd=3
[echo_server] Listening on port 8080
[echo_server] Registered listener with EVFILT_READ
[echo_server] Entering event loop...

[14:32:01] Client connected from 192.168.1.10:52341 (fd=4)
[14:32:01] Client connected from 192.168.1.11:48923 (fd=5)
[14:32:01] Received 1024 bytes from fd=4, echoing back
[14:32:01] Client connected from 192.168.1.12:39847 (fd=6)
...
[14:32:05] Active connections: 847
[14:32:10] Active connections: 2,341
[14:32:15] Active connections: 5,892
[14:32:20] Active connections: 10,003  # C10K achieved!

# Performance stats:
[echo_server] kevent() calls: 15,234
[echo_server] Events processed: 1,247,892
[echo_server] Avg events per kevent(): 81.9
[echo_server] Throughput: 52,341 req/sec

What you’ll see running on Linux with epoll:

$ ./echo_server 8080
[echo_server] epoll_create1() returned fd=3
[echo_server] Listening on port 8080
[echo_server] Added listener to epoll with EPOLLIN
[echo_server] Entering event loop...

[14:32:01] Client connected from 192.168.1.10:52341 (fd=4)
[14:32:01] epoll_ctl(EPOLL_CTL_ADD, fd=4)  # One syscall per fd!
[14:32:01] Client connected from 192.168.1.11:48923 (fd=5)
[14:32:01] epoll_ctl(EPOLL_CTL_ADD, fd=5)
...
[14:32:20] Active connections: 10,003

# Performance stats:
[echo_server] epoll_wait() calls: 18,456
[echo_server] epoll_ctl() calls: 45,234  # More syscalls than kqueue!
[echo_server] Events processed: 1,198,234
[echo_server] Throughput: 48,721 req/sec

Benchmark comparison output:

$ ./benchmark_comparison.sh

========================================
   kqueue vs epoll Performance Test
========================================

Test: 10,000 concurrent connections, 60 seconds

BSD (FreeBSD 14) - kqueue:
  Requests/sec:     52,341
  Latency avg:      1.2ms
  Latency p99:      4.8ms
  Syscalls:         15,234 kevent()

Linux (Ubuntu 24.04) - epoll:
  Requests/sec:     48,721
  Latency avg:      1.4ms
  Latency p99:      5.2ms
  Syscalls:         63,690 (epoll_wait + epoll_ctl)

Analysis:
  - kqueue batches updates: ONE kevent() call for multiple changes
  - epoll requires one epoll_ctl() per fd modification
  - Under high connection churn, kqueue has fewer syscalls
  - Both handle C10K easily, but kqueue is more elegant

Your codebase comparison:

// BSD kqueue - batch register and wait in ONE call
struct kevent changes[MAX_EVENTS];  // What to change
struct kevent events[MAX_EVENTS];   // What happened
int nchanges = 0;

// Add multiple fds to changes array
EV_SET(&changes[nchanges++], client_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
EV_SET(&changes[nchanges++], another_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);

// ONE syscall does everything!
int nevents = kevent(kq, changes, nchanges, events, MAX_EVENTS, NULL);

// ─────────────────────────────────────────────────────────────────

// Linux epoll - separate calls for modification and waiting
struct epoll_event ev, events[MAX_EVENTS];

// Each fd requires its own syscall
ev.events = EPOLLIN;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);  // Syscall 1

ev.data.fd = another_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, another_fd, &ev); // Syscall 2

// Then wait
int nevents = epoll_wait(epfd, events, MAX_EVENTS, -1);  // Syscall 3

The Core Question You’re Answering

“Why is kqueue considered technically superior to epoll, and what does this teach us about API design in operating systems?”

This project reveals a truth about Unix API design: elegance matters for performance. kqueue’s ability to batch operations into a single syscall means fewer context switches under load. But epoll works “well enough” and ships with the dominant server OS.

You’ll understand why Nginx, HAProxy, and other high-performance servers have different code paths for different OSes, and why some developers prefer BSD for networking workloads.


Concepts You Must Understand First

Stop and research these before coding:

  1. The C10K Problem
    • What is the C10K problem and why was it revolutionary in 1999?
    • Why don’t traditional threading models scale to 10K connections?
    • How do event-driven architectures solve this?
    • Reference: Dan Kegel’s C10K Paper
  2. Blocking vs Non-Blocking I/O
    • What happens when you call read() on a blocking socket with no data?
    • How does O_NONBLOCK change socket behavior?
    • What does EAGAIN/EWOULDBLOCK mean?
    • Book Reference: The Linux Programming Interface by Michael Kerrisk — Ch. 63
  3. Level-Triggered vs Edge-Triggered
    • Level-triggered: “notify while condition exists”
    • Edge-triggered: “notify when condition changes”
    • Why does edge-triggered require draining the buffer completely?
    • Which is default for kqueue? For epoll?
    • Book Reference: Linux System Programming, 2nd Edition by Robert Love — Ch. 4
  4. File Descriptors and the Kernel
    • What is a file descriptor really? (index into per-process table)
    • How does the kernel track which fds to monitor?
    • Why is select() O(n) while epoll/kqueue are O(1)?
    • Book Reference: Advanced Programming in the UNIX Environment by Stevens & Rago — Ch. 3
  5. kqueue Architecture
    • What is a kevent structure?
    • What are filters? (EVFILT_READ, EVFILT_WRITE, EVFILT_VNODE, EVFILT_TIMER…)
    • Why can kqueue batch changes?
    • Reference: Kernel Queue: Complete Guide
  6. epoll Architecture
    • What do epoll_create, epoll_ctl, epoll_wait do?
    • Why separate calls for modification and waiting?
    • What is EPOLLONESHOT and when would you use it?
    • Book Reference: The Linux Programming Interface by Michael Kerrisk — Ch. 63

Questions to Guide Your Design

Before implementing, think through these:

  1. Server Architecture
    • Will you use a single-threaded event loop or multiple threads with separate event loops?
    • How will you handle the accept() of new connections?
    • Should accept() be level-triggered or edge-triggered?
  2. Event Handling
    • When a read event fires, how much data should you read?
    • What if the client sends more data than your buffer size?
    • How do you handle partial writes when the send buffer is full?
  3. Connection Lifecycle
    • How do you detect client disconnection?
    • When should you remove a fd from the event set?
    • How do you avoid use-after-free when closing connections?
  4. Performance Measurement
    • How will you count syscalls to compare the APIs?
    • How will you generate load for benchmarking?
    • What metrics matter: throughput, latency, syscall count?
  5. Error Handling
    • What happens if kqueue()/epoll_create() fails?
    • How do you handle EINTR during kevent()/epoll_wait()?
    • What if a client causes an error—crash the server or just close that connection?

Thinking Exercise

Before coding, trace this scenario by hand:

You have 1000 clients connected. 100 of them send data simultaneously.

┌─────────────────────────────────────────────────────────────────────┐
│                    kqueue Event Processing                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  State: 1000 fds registered with EVFILT_READ                        │
│                                                                      │
│  Step 1: 100 clients send data simultaneously                       │
│                                                                      │
│  Step 2: kevent(kq, NULL, 0, events, 1000, NULL)                    │
│          Returns: 100 events (only the ready ones)                  │
│          Syscalls so far: 1                                          │
│                                                                      │
│  Step 3: Process all 100 events, read data, echo back               │
│                                                                      │
│  Step 4: 50 clients disconnect                                       │
│          We need to remove them from kqueue                          │
│          Build array: changes[50] = {EV_DELETE for each fd}         │
│                                                                      │
│  Step 5: kevent(kq, changes, 50, events, 1000, NULL)                │
│          Removes 50 fds AND waits for new events in ONE call        │
│          Syscalls so far: 2                                          │
│                                                                      │
│  Total syscalls for this cycle: 2                                    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    epoll Event Processing                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  State: 1000 fds registered with EPOLLIN                            │
│                                                                      │
│  Step 1: 100 clients send data simultaneously                       │
│                                                                      │
│  Step 2: epoll_wait(epfd, events, 1000, -1)                         │
│          Returns: 100 events                                         │
│          Syscalls so far: 1                                          │
│                                                                      │
│  Step 3: Process all 100 events, read data, echo back               │
│                                                                      │
│  Step 4: 50 clients disconnect                                       │
│          We need to remove them from epoll                           │
│          epoll_ctl(epfd, EPOLL_CTL_DEL, fd1, NULL) // syscall 2     │
│          epoll_ctl(epfd, EPOLL_CTL_DEL, fd2, NULL) // syscall 3     │
│          ... 48 more times ...                                       │
│          Syscalls so far: 51                                         │
│                                                                      │
│  Step 5: epoll_wait() for next batch                                │
│          Syscalls so far: 52                                         │
│                                                                      │
│  Total syscalls for this cycle: 52                                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key insight: Under high connection churn (many connects/disconnects), kqueue’s batching advantage becomes significant. Under stable connection pools, the difference is minimal.


The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why is kqueue technically superior to epoll?”
    • Batch updates: one syscall for multiple changes
    • More generic: handles files, signals, processes, not just sockets
    • Cleaner API: changes and events can be in the same call
  2. “If kqueue is better, why does everyone use Linux?”
    • epoll is “good enough” for most workloads
    • Linux has better hardware support, more developers, larger ecosystem
    • Most applications use abstraction layers (libevent, libuv) anyway
  3. “What’s the difference between level-triggered and edge-triggered?”
    • Level: kernel keeps notifying as long as fd is ready
    • Edge: kernel notifies once when state changes from not-ready to ready
    • Edge requires you to drain the buffer completely or you’ll miss data
  4. “How would you handle 1 million connections?”
    • C10K is 20+ years old; C1M is the new challenge
    • Need multiple event loops (one per core)
    • Need to think about memory per connection
    • SO_REUSEPORT helps distribute accept() load
  5. “What do Nginx and HAProxy use?”
    • Both have epoll and kqueue backends
    • Code is mostly the same, event API is abstracted
    • They prove the performance difference is measurable but not critical
  6. “Why didn’t Linux just implement kqueue?”
    • NIH (Not Invented Here) syndrome
    • Different kernel architecture made direct porting hard
    • By the time kqueue was proven, epoll was already deployed

Hints in Layers

Hint 1: Start with a blocking echo server

Understand the baseline before optimization:

int client_fd = accept(listen_fd, NULL, NULL);
while (1) {
    ssize_t n = read(client_fd, buf, sizeof(buf));
    if (n <= 0) break;
    write(client_fd, buf, n);  // Echo back
}
close(client_fd);
// Problem: only handles ONE client at a time!

Hint 2: Non-blocking sockets are essential

int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

// Now read() returns -1 with errno=EAGAIN instead of blocking

Hint 3: kqueue skeleton

int kq = kqueue();
struct kevent ev;

// Register listener
EV_SET(&ev, listen_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);

while (1) {
    struct kevent events[64];
    int n = kevent(kq, NULL, 0, events, 64, NULL);

    for (int i = 0; i < n; i++) {
        int fd = events[i].ident;
        if (fd == listen_fd) {
            // Accept new connection, add to kqueue
        } else if (events[i].filter == EVFILT_READ) {
            // Read from client, echo back
        }
    }
}

Hint 4: epoll skeleton

int epfd = epoll_create1(0);
struct epoll_event ev, events[64];

// Register listener
ev.events = EPOLLIN;
ev.data.fd = listen_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);

while (1) {
    int n = epoll_wait(epfd, events, 64, -1);

    for (int i = 0; i < n; i++) {
        int fd = events[i].data.fd;
        if (fd == listen_fd) {
            // Accept new connection
            int client = accept(listen_fd, NULL, NULL);
            ev.events = EPOLLIN;
            ev.data.fd = client;
            epoll_ctl(epfd, EPOLL_CTL_ADD, client, &ev);  // Extra syscall!
        } else {
            // Read from client, echo back
        }
    }
}

Hint 5: Benchmark with a simple load generator

# Using netcat and yes for simple load
for i in $(seq 1 1000); do
    (yes "hello" | nc localhost 8080 &)
done

# Or use wrk for HTTP if you add HTTP parsing
wrk -t4 -c10000 -d30s http://localhost:8080/

Books That Will Help

Topic Book Chapter
I/O multiplexing fundamentals The Linux Programming Interface by Michael Kerrisk Ch. 63: “Alternative I/O Models”
kqueue deep dive The Design and Implementation of the FreeBSD Operating System by McKusick et al. Ch. 6
epoll internals Linux System Programming, 2nd Edition by Robert Love Ch. 4
Non-blocking sockets UNIX Network Programming, Volume 1 by Stevens, Fenner & Rudoff Ch. 16
High-performance networking TCP/IP Illustrated, Volume 1 by W. Richard Stevens Ch. 17-18
The C10K problem Dan Kegel’s C10K paper (online) Full document
Event loop design Network Programming with Go by Adam Woodbeck Ch. 3 (concepts transfer to C)

Project 3: “Build Your Own Container/Jail/Zone”

Attribute Value
Language C (alt: Rust, Go)
Difficulty Advanced
Time 1-2 weeks
Coolness ★★★★☆ Hardcore
Portfolio Value Portfolio Piece

What you’ll build: A minimal container runtime from scratch—using Linux namespaces+cgroups, FreeBSD jails, and illumos zones—to understand how OS-level virtualization differs fundamentally across Unix systems.

Why it teaches Unix differences: As Jessie Frazelle explains, “Jails and Zones are first-class kernel concepts. Containers are NOT—they’re just a term for combining Linux namespaces and cgroups.” Building all three reveals why FreeBSD’s jail(2) is a single syscall while Linux requires orchestrating 7+ namespace types plus cgroups.

Core challenges you’ll face:

  • Challenge 1: Linux—combining mount, PID, network, user, UTS, IPC namespaces manually (maps to “building blocks” philosophy)
  • Challenge 2: FreeBSD—single jail() syscall with jailparams (maps to “first-class concept” philosophy)
  • Challenge 3: Networking inside containers—veth pairs (Linux) vs VNET jails (FreeBSD)
  • Challenge 4: Filesystem isolation—overlay/bind mounts (Linux) vs ZFS clones (FreeBSD/illumos)
  • Challenge 5: Resource limits—cgroups v2 (Linux) vs rctl (FreeBSD)

Key Concepts:

Prerequisites: C programming, basic understanding of processes and filesystems

Real world outcome:

  • Run ./mycontainer /bin/sh and get an isolated shell with its own PID 1, network stack, and filesystem view
  • Demonstrate isolation: processes inside can’t see host processes; networking is separate
  • Show the difference in complexity: ~500 lines for Linux namespace container vs ~100 lines for FreeBSD jail wrapper

Learning milestones:

  1. Linux: Create PID namespace, see process isolation → understand namespace concept
  2. Linux: Add mount namespace, overlay filesystem → understand filesystem isolation
  3. Linux: Add network namespace with veth pair → understand network virtualization
  4. FreeBSD: Create jail with single syscall → notice the dramatic simplicity difference
  5. FreeBSD: Add VNET networking to jail → understand VNET architecture
  6. Compare codebase sizes and complexity → internalize the design philosophy difference

Real World Outcome

When you complete this project, you’ll have built minimal container runtimes that demonstrate the fundamental design philosophy differences between Linux and FreeBSD.

What you’ll see on Linux (your container runtime):

$ sudo ./mycontainer run /bin/sh

[mycontainer] Creating namespaces...
[mycontainer]   PID namespace: clone(CLONE_NEWPID) - PID 1 inside!
[mycontainer]   Mount namespace: clone(CLONE_NEWNS) - isolated filesystem
[mycontainer]   UTS namespace: clone(CLONE_NEWUTS) - new hostname
[mycontainer]   Network namespace: clone(CLONE_NEWNET) - isolated network
[mycontainer]   User namespace: clone(CLONE_NEWUSER) - uid mapping
[mycontainer]   IPC namespace: clone(CLONE_NEWIPC) - isolated semaphores

[mycontainer] Setting up cgroups v2...
[mycontainer]   Memory limit: 256MB
[mycontainer]   CPU shares: 512

[mycontainer] Setting up root filesystem...
[mycontainer]   pivot_root() to /containers/alpine

[mycontainer] Setting up network...
[mycontainer]   Created veth pair: veth0 <-> container0
[mycontainer]   Container IP: 10.0.0.2/24
[mycontainer]   Host bridge: 10.0.0.1/24

[mycontainer] Dropping capabilities...
[mycontainer] Entering container...

/ # hostname
container-12345
/ # ps aux
PID   USER     TIME   COMMAND
    1 root       0:00 /bin/sh     <-- We are PID 1!
    2 root       0:00 ps aux
/ # cat /proc/1/cgroup
0::/mycontainer                   <-- Our cgroup
/ # ip addr
1: lo: <LOOPBACK,UP> mtu 65536
    inet 127.0.0.1/8
2: eth0: <BROADCAST,UP> mtu 1500
    inet 10.0.0.2/24              <-- Isolated network!
/ # exit

[mycontainer] Container exited with status 0
[mycontainer] Cleaning up namespaces and cgroups...

What you’ll see on FreeBSD (your jail runtime):

$ sudo ./myjail run /bin/sh

[myjail] Creating jail...
[myjail]   jail_set(2) with:
[myjail]     path = /jails/alpine
[myjail]     hostname = jail-12345
[myjail]     ip4.addr = 10.0.0.2

[myjail] That's it. One syscall. Jail created.
[myjail] Entering jail...

$ hostname
jail-12345
$ ps aux
USER   PID  %CPU %MEM   VSZ   RSS  TT  STAT STARTED    TIME COMMAND
root     1   0.0  0.1  4788  1524  -   SJ   14:32   0:00.01 /bin/sh
root     2   0.0  0.1  4788  1496  -   R+J  14:32   0:00.00 ps aux
$ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
     inet 127.0.0.1 netmask 0xff000000
jail0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
     inet 10.0.0.2 netmask 0xffffff00
$ exit

[myjail] Jail exited with status 0

Comparing your codebases:

$ wc -l linux_container.c freebsd_jail.c

   547 linux_container.c    # 500+ lines for Linux namespaces + cgroups
    98 freebsd_jail.c       # ~100 lines for FreeBSD jail

# Breakdown of Linux container complexity:
$ grep -c 'clone\|unshare' linux_container.c
12    # Many namespace operations
$ grep -c 'cgroup' linux_container.c
35    # cgroup setup is verbose
$ grep -c 'veth\|netlink' linux_container.c
48    # Network namespace setup is complex

# FreeBSD jail simplicity:
$ grep -c 'jail' freebsd_jail.c
8     # jail_set, jail_attach, jailparam_*

The visual difference:

┌────────────────────────────────────────────────────────────────────────────┐
│                        Linux Container Creation                             │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  clone(CLONE_NEWPID)   ─┐                                                  │
│  clone(CLONE_NEWNS)    ─┤                                                  │
│  clone(CLONE_NEWUTS)   ─┼─► "Assemble the parts"                          │
│  clone(CLONE_NEWNET)   ─┤    7+ syscalls just for namespaces              │
│  clone(CLONE_NEWUSER)  ─┤                                                  │
│  clone(CLONE_NEWIPC)   ─┤                                                  │
│  clone(CLONE_NEWCGROUP)─┘                                                  │
│           +                                                                 │
│  cgroup_create()       ─┐                                                  │
│  write(memory.max)     ─┼─► Setup resource limits                         │
│  write(cpu.weight)     ─┘    More file operations                          │
│           +                                                                 │
│  veth_create()         ─┐                                                  │
│  netlink_addaddr()     ─┼─► Network setup via netlink                     │
│  netlink_addroute()    ─┘    Complex socket programming                    │
│           +                                                                 │
│  pivot_root()          ───► Filesystem isolation                           │
│           +                                                                 │
│  seccomp_load()        ───► Optional syscall filtering                     │
│                                                                             │
│  Result: ~500 lines of C, deeply understanding 5+ subsystems               │
│                                                                             │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                        FreeBSD Jail Creation                                │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  struct jailparam params[] = {                                             │
│      { "path", "/jails/myjail" },                                          │
│      { "hostname", "myjail" },                                             │
│      { "ip4.addr", "10.0.0.2" },                                           │
│      { "vnet", "new" },  // VNET for network isolation                     │
│  };                                                                         │
│                                                                             │
│  jail_set(params, nparams, JAIL_CREATE | JAIL_ATTACH);                     │
│                                                                             │
│  // That's it. ONE syscall. You're in the jail.                            │
│                                                                             │
│  Result: ~100 lines of C, understanding ONE subsystem                      │
│                                                                             │
└────────────────────────────────────────────────────────────────────────────┘

The Core Question You’re Answering

“Why is a ‘container’ not a thing on Linux, and what are the real implications of this design choice?”

This project will burn into your brain the most important insight about Unix system design: Linux containers are a term for a combination of primitives. FreeBSD jails and Solaris zones are first-class kernel concepts.

As Jessie Frazelle famously wrote: “Jails and Zones are first-class concepts. Containers are NOT.” This difference explains:

  • Why container escapes happen on Linux
  • Why Docker needed years to become stable
  • Why FreeBSD jails were production-ready in 2000
  • Why illumos Zones can run Linux binaries (LX branded zones)

Concepts You Must Understand First

Stop and research these before coding:

  1. Process Isolation Fundamentals
    • What does a process see? (memory space, file descriptors, PID space)
    • What is chroot() and why isn’t it enough for isolation?
    • What is “escaping” a chroot and how is it done?
    • Book Reference: Operating Systems: Three Easy Pieces by Arpaci-Dusseau — Part II: “Virtualization”
  2. Linux Namespaces (The Building Blocks)
    • PID namespace: What does it mean to have PID 1?
    • Mount namespace: How does the filesystem view differ?
    • Network namespace: What is a network stack?
    • User namespace: How do UID/GID mappings work?
    • UTS namespace: Just the hostname, but important!
    • IPC namespace: Semaphores, message queues, shared memory
    • Book Reference: The Linux Programming Interface by Michael Kerrisk — Ch. 28-29
  3. Linux cgroups (Resource Limits)
    • What is a cgroup hierarchy?
    • cgroups v1 vs v2: Why did Linux redesign this?
    • How do you limit memory, CPU, I/O?
    • Book Reference: How Linux Works, 3rd Edition by Brian Ward — Ch. 8
  4. FreeBSD Jails (The Integrated Approach)
    • What is the jail(2) system call?
    • What’s a jailparam and how do you set them?
    • What is VNET and why does it make jails more powerful?
    • What is rctl (resource control)?
    • Book Reference: Absolute FreeBSD, 3rd Edition by Michael W. Lucas — Ch. 12: “Jails”
  5. Filesystem Isolation
    • Linux: What is pivot_root() vs chroot()?
    • Linux: What is an overlay filesystem?
    • FreeBSD: How do nullfs mounts work?
    • Both: How does ZFS make container storage better?
    • Book Reference: The Linux Programming Interface by Michael Kerrisk — Ch. 18
  6. Network Virtualization
    • Linux: What is a veth pair? What is a bridge?
    • Linux: How does netlink work?
    • FreeBSD: What is VNET? How is it different from IP-based jails?
    • Reference: FreeBSD Handbook Ch. 17: Jails

Questions to Guide Your Design

Before implementing, think through these:

  1. What defines “isolation”?
    • From the container’s view: what should it be unable to see/do?
    • From the host’s view: what should be protected?
    • What’s the threat model?
  2. How do you set up the root filesystem?
    • Where do you get a minimal rootfs? (Alpine, busybox)
    • Should changes persist or be discarded? (overlay vs bind mount)
    • How do you mount /proc, /sys, /dev inside?
  3. How do you handle networking?
    • Does the container need network access?
    • How does traffic get routed between container and host?
    • Do you need NAT for outbound connections?
  4. What about resource limits?
    • How much memory should the container have?
    • Should it have limited CPU?
    • What happens when limits are exceeded?
  5. How do you enter the container?
    • Linux: clone() with flags vs unshare() + fork()
    • FreeBSD: jail_attach() vs starting a new process in jail
    • What happens to the child process?

Thinking Exercise

Before coding, trace what Docker does on Linux:

$ strace -f docker run --rm alpine echo hello 2>&1 | grep -E 'clone|unshare|mount|pivot|cgroup'

# You'll see something like:
clone(child_stack=0x..., flags=CLONE_NEWNS|CLONE_NEWPID|...)
mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL)
mount("overlay", "/var/lib/docker/.../merged", "overlay", ...)
pivot_root(".", ".")
mount("proc", "/proc", "proc", ...)
openat(AT_FDCWD, "/sys/fs/cgroup/memory/.../memory.max", ...)
write(3, "268435456", 9)  # 256MB memory limit
clone(child_stack=0x..., flags=CLONE_NEWNET|...)

Now trace what FreeBSD does with a jail:

$ truss jail -c name=test path=/jails/test command=/bin/sh 2>&1 | grep jail

# You'll see:
jail_set(0x..., 5, 0x3)  # THAT'S IT. One syscall.

Map the complexity:

┌─────────────────────────────────────────────────────────────────────┐
│                    What Docker Does on Linux                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Layer 1: Namespace creation (7 different namespaces)               │
│                                                                      │
│  Layer 2: cgroup creation and configuration                         │
│           - Create cgroup directory                                  │
│           - Write limits to pseudo-files                             │
│           - Add process to cgroup                                    │
│                                                                      │
│  Layer 3: Filesystem setup                                           │
│           - Create overlay mount                                     │
│           - pivot_root to new root                                   │
│           - Mount /proc, /sys, /dev                                  │
│           - Mask sensitive paths                                     │
│                                                                      │
│  Layer 4: Network setup                                              │
│           - Create veth pair                                         │
│           - Move one end to container namespace                      │
│           - Configure IP addresses                                   │
│           - Set up routing                                           │
│           - Configure iptables rules                                 │
│                                                                      │
│  Layer 5: Security                                                   │
│           - Drop capabilities                                        │
│           - Install seccomp filter                                   │
│           - Set up AppArmor/SELinux profile                         │
│                                                                      │
│  TOTAL: 50+ syscalls, configuration across 5+ subsystems            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    What FreeBSD Does with Jails                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Layer 1: jail_set() with parameters                                 │
│           - path: root filesystem                                    │
│           - hostname: container name                                 │
│           - ip4.addr / vnet: network config                         │
│           - (optional) resource limits via rctl                     │
│                                                                      │
│  TOTAL: 1-3 syscalls, everything is integrated                      │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

The philosophical difference:

  • Linux: “Here are Lego blocks. Assemble them yourself. Maximum flexibility!”
  • FreeBSD: “Here’s a finished product. It works. Limited flexibility, but it’s correct.”

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What’s the difference between a container and a VM?”
    • VM: Separate kernel, hardware virtualization (hypervisor)
    • Container: Shared kernel, OS-level virtualization (namespaces/jails)
    • Container is lighter but has weaker isolation
  2. “How does Docker work under the hood on Linux?”
    • Uses clone() with namespace flags
    • Uses cgroups for resource limits
    • Uses overlay filesystem for copy-on-write layers
    • Uses pivot_root() for filesystem isolation
  3. “What is a container escape?”
    • Attacker inside container gains access to host
    • Usually through kernel vulnerabilities (shared kernel!)
    • Or misconfiguration (privileged containers, mounted docker socket)
  4. “Why are FreeBSD jails considered more secure?”
    • Single, audited subsystem vs. assembled primitives
    • Less complexity = fewer bugs
    • Jails existed since 2000, battle-tested
  5. “What is the difference between Docker and LXC/LXD?”
    • Docker: Application containers, immutable images, microservices
    • LXC/LXD: System containers, more like lightweight VMs
    • Both use the same Linux primitives underneath
  6. “How would you debug a container networking issue?”
    • Check namespace with nsenter
    • Check veth pairs with ip link
    • Check routing with ip route
    • Check iptables rules

Hints in Layers

Hint 1: Start with just PID namespace

The simplest isolation—see a different PID space:

// Linux: fork into new PID namespace
int child_pid = clone(child_func, stack + STACK_SIZE,
                      CLONE_NEWPID | SIGCHLD, NULL);

// In child_func:
printf("I am PID %d\n", getpid());  // Will print "I am PID 1"!

Hint 2: Add mount namespace for filesystem isolation

// After clone with CLONE_NEWNS:
mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);  // Private mounts
mount("/containers/alpine", "/containers/alpine", NULL, MS_BIND, NULL);
chdir("/containers/alpine");
pivot_root(".", ".");
umount2(".", MNT_DETACH);  // Unmount old root

Hint 3: FreeBSD jail is dramatically simpler

#include <sys/jail.h>
#include <jail.h>

struct jailparam params[4];
jailparam_init(&params[0], "path");
jailparam_set(&params[0], "/jails/myjail");
jailparam_init(&params[1], "host.hostname");
jailparam_set(&params[1], "myjail");
jailparam_init(&params[2], "ip4.addr");
jailparam_set(&params[2], "10.0.0.2");
jailparam_init(&params[3], "persist");
jailparam_set(&params[3], NULL);

int jid = jailparam_set(params, 4, JAIL_CREATE | JAIL_ATTACH);
// That's it! You're now in the jail.

Hint 4: Linux network namespace needs veth pair

// This requires netlink programming or shelling out to `ip`:
system("ip link add veth0 type veth peer name container0");
system("ip link set container0 netns <pid>");
system("ip addr add 10.0.0.1/24 dev veth0");
system("ip link set veth0 up");

// Inside container namespace:
system("ip addr add 10.0.0.2/24 dev container0");
system("ip link set container0 up");
system("ip route add default via 10.0.0.1");

Hint 5: cgroups v2 setup

// Create cgroup
mkdir("/sys/fs/cgroup/mycontainer", 0755);

// Set memory limit (256MB)
int fd = open("/sys/fs/cgroup/mycontainer/memory.max", O_WRONLY);
write(fd, "268435456", 9);
close(fd);

// Add process to cgroup
fd = open("/sys/fs/cgroup/mycontainer/cgroup.procs", O_WRONLY);
char pid_str[16];
sprintf(pid_str, "%d", child_pid);
write(fd, pid_str, strlen(pid_str));
close(fd);

Books That Will Help

Topic Book Chapter
Linux namespaces The Linux Programming Interface by Michael Kerrisk Ch. 28-29 (Process Creation)
Linux cgroups How Linux Works, 3rd Edition by Brian Ward Ch. 8
FreeBSD jails Absolute FreeBSD, 3rd Edition by Michael W. Lucas Ch. 12: “Jails”
Container internals Container Security by Liz Rice Full book (O’Reilly)
Process isolation theory Operating Systems: Three Easy Pieces by Arpaci-Dusseau Part II: “Virtualization”
Filesystem namespaces The Linux Programming Interface by Michael Kerrisk Ch. 18: “Directories and Links”
Network namespaces Linux Network Internals by Christian Benvenuti Ch. 1-3
illumos Zones illumos Documentation Online

Project 4: “Packet Filter Firewall Configuration Tool”

Attribute Value
Language C
Difficulty Intermediate
Time Weekend
Coolness ★★★☆☆ Genuinely Clever
Portfolio Value Portfolio Piece

What you’ll build: A command-line tool that generates and applies firewall rules—using pf on OpenBSD/FreeBSD and nftables on Linux—from a common configuration format.

Why it teaches Unix differences: OpenBSD’s pf (packet filter) is legendary for its clean syntax and powerful features. Linux’s nftables (replacing iptables) has different semantics. Building a tool that targets both forces you to understand network stack differences at the kernel level.

Core challenges you’ll face:

  • Challenge 1: pf’s stateful inspection model vs nftables’ table/chain/rule hierarchy
  • Challenge 2: pf anchors vs nftables sets for dynamic rules
  • Challenge 3: NAT handling differences
  • Challenge 4: Loading rules atomically vs incrementally

Key Concepts:

  • pf fundamentals: “Absolute OpenBSD” Ch. 7 - Michael W. Lucas
  • OpenBSD pf FAQ: OpenBSD official documentation
  • nftables design: Linux nftables wiki
  • BSD networking: “TCP/IP Illustrated Vol. 1” - W. Richard Stevens (the BSD reference implementation)
  • Packet filtering theory: “Mastering FreeBSD and OpenBSD Security” Ch. 4-5 - Hope, Potter & Korff

Prerequisites: Basic networking (TCP/IP), understanding of firewalls conceptually

Real world outcome:

  • A tool that reads YAML like allow: {port: 22, from: 10.0.0.0/8} and outputs valid pf.conf or nftables rules
  • Apply rules and demonstrate: blocked connections fail, allowed connections succeed
  • Show the same logical policy expressed in both syntaxes

Learning milestones:

  1. Write pf rules manually on OpenBSD → understand pf syntax and concepts
  2. Write equivalent nftables rules on Linux → notice the structural differences
  3. Build parser for common config format → abstract the similarities
  4. Generate native rules for each OS → encode the differences
  5. Test with real traffic → verify correctness

Real World Outcome

You’ll have a working command-line firewall configuration tool that abstracts away the differences between pf and nftables. Here’s exactly what you’ll see:

Example Input (config.yaml):

default_policy: drop
rules:
  - action: allow
    protocol: tcp
    port: 22
    from: 192.168.1.0/24
    comment: "SSH from local network"

  - action: allow
    protocol: tcp
    port: 80
    port_range: [80, 443]
    comment: "Web traffic"

  - action: allow
    state: established
    comment: "Allow return traffic"

On OpenBSD/FreeBSD (pf output):

$ ./firewall-tool --generate pf config.yaml

# Generated pf.conf
set skip on lo
block all
pass in on egress proto tcp from 192.168.1.0/24 to port 22  # SSH from local network
pass in on egress proto tcp to port { 80, 443 }  # Web traffic
pass out all keep state  # Allow return traffic

$ sudo ./firewall-tool --apply pf config.yaml
Loading pf rules...
Rules loaded successfully
Active rules: 3

On Linux (nftables output):

$ ./firewall-tool --generate nftables config.yaml

# Generated nftables rules
table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;
        iif lo accept
        ct state established,related accept
        tcp dport 22 ip saddr 192.168.1.0/24 accept comment "SSH from local network"
        tcp dport { 80, 443 } accept comment "Web traffic"
    }
    chain forward {
        type filter hook forward priority 0; policy drop;
    }
    chain output {
        type filter hook output priority 0; policy accept;
    }
}

$ sudo ./firewall-tool --apply nftables config.yaml
Flushing existing nftables rules...
Loading new ruleset...
Ruleset applied successfully

Testing the firewall:

# From allowed IP (192.168.1.10):
$ ssh user@firewall-host
Connected!

# From denied IP (10.0.0.5):
$ ssh user@firewall-host
Connection refused

# HTTP works from anywhere:
$ curl http://firewall-host
<html>...</html>

# Show active rules:
$ ./firewall-tool --status
Firewall: pf (OpenBSD)
Status: active
Rules loaded: 3
Packets blocked: 42
Packets allowed: 1,523

You’re seeing the exact same logical firewall policy expressed in two completely different syntaxes, and understanding why each OS chose its approach.


The Core Question You’re Answering

“Why does OpenBSD’s pf syntax feel so much cleaner than iptables/nftables? Is it just aesthetics, or does it reflect deeper architectural differences in how packet filtering works?”

Before writing any code, sit with this question. The answer isn’t “BSD is better”—it’s that pf was designed as a unified language for a stable kernel API, while Linux’s firewall evolved through three generations (ipchains → iptables → nftables) to add flexibility. This is the “cathedral vs bazaar” philosophy expressed in firewall syntax.


Concepts You Must Understand First

Stop and research these before coding:

  1. Stateful vs Stateless Packet Filtering
    • What does “keep state” mean in pf?
    • How does connection tracking work in nftables?
    • Why is stateful filtering essential for modern firewalls?
    • Book Reference: “Mastering FreeBSD and OpenBSD Security” Ch. 4 - Hope, Potter & Korff
  2. Packet Filter Placement in Network Stack
    • Where does packet filtering happen in the kernel?
    • What’s the difference between INPUT, FORWARD, OUTPUT chains?
    • How does the filter see packets before vs after NAT?
    • Book Reference: “TCP/IP Illustrated Vol. 1” Ch. 7 - W. Richard Stevens
  3. pf Architecture
    • What are pf tables and anchors?
    • How does pf handle rule atomicity?
    • What’s the syntax for matching criteria (proto, from, to, port)?
    • Book Reference: “Absolute OpenBSD” Ch. 7 - Michael W. Lucas
  4. nftables Architecture
    • What’s the table/chain/rule hierarchy?
    • How do nftables sets work?
    • What’s the nft family concept (ip, ip6, inet, arp)?
    • Book Reference: Linux nftables wiki, netfilter documentation

Questions to Guide Your Design

Before implementing, think through these:

  1. Configuration Format
    • YAML, JSON, or custom syntax?
    • How do you represent port ranges?
    • How do you handle comments?
    • Should you support variable substitution?
  2. Rule Translation
    • What’s the minimal common subset of features?
    • How do you handle features unique to one system?
    • Do you translate states directly or use common abstractions?
    • What happens when a feature isn’t supported?
  3. Application Strategy
    • Do you flush existing rules or merge?
    • How do you handle rollback on errors?
    • Should rules persist across reboots?
    • How do you test rules before applying?
  4. Error Handling
    • What if pf.conf syntax is invalid?
    • What if nft command fails?
    • How do you validate rules before loading?
    • Should you test connectivity after applying?

Thinking Exercise

Manual Rule Writing

Before coding, manually write equivalent firewall rules for this scenario:

Scenario: A web server that:

  • Accepts SSH (port 22) only from 10.0.1.0/24
  • Accepts HTTP (80) and HTTPS (443) from anywhere
  • Drops all other incoming traffic
  • Allows all outgoing traffic
  • Maintains state for connections

Write the rules in:

  1. pf syntax (on OpenBSD/FreeBSD)
  2. nftables syntax (on Linux)

Questions while writing:

  • Which syntax feels more intuitive?
  • How many lines did each take?
  • Which has better handling of port lists?
  • How is “state” expressed in each?
  • Which would be easier to generate programmatically?

Compare your answers to the pf(5) and nft(8) man pages.


The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the difference between iptables and nftables. Why did Linux move to nftables?”
  2. “What’s the advantage of pf’s ‘quick’ keyword and how does it compare to iptables’ rule ordering?”
  3. “How would you design a firewall rule validation system that catches errors before applying rules?”
  4. “Describe how stateful packet inspection works. What state information is tracked?”
  5. “What happens when you reload firewall rules on a production server? How do you minimize disruption?”
  6. “Explain NAT (Network Address Translation) and how it interacts with packet filtering rules.”

Hints in Layers

Hint 1: Start with Rule Parsing Don’t jump into pf/nftables syntax generation immediately. First, build a solid internal representation of firewall rules. Parse your config format (YAML/JSON) into C structs that represent abstract concepts like “allow TCP port 22 from 10.0.0.0/8”. This data structure should be OS-agnostic.

Hint 2: Study Real Configurations Before generating rules, collect examples:

# On OpenBSD/FreeBSD
cat /etc/pf.conf
pfctl -sr  # Show current rules

# On Linux
nft list ruleset
iptables-save  # See legacy rules

Notice patterns: how are ports specified? How are networks represented? Where do comments go?

Hint 3: Code Generation Strategy

// Pseudocode structure:
struct firewall_rule {
    enum action { ALLOW, DENY };
    enum protocol { TCP, UDP, ICMP, ANY };
    char *source_ip;
    int source_port;
    char *dest_ip;
    int dest_port;
    bool stateful;
};

// Then write generators:
void generate_pf(struct firewall_rule *rules, int count);
void generate_nftables(struct firewall_rule *rules, int count);

Hint 4: Use System Commands Don’t try to write binary netlink messages or ioctl calls initially. Shell out to system commands:

// For pf:
system("pfctl -f /tmp/generated.pf.conf");

// For nftables:
system("nft -f /tmp/generated.nft");

Later, you can optimize by using native APIs.


Books That Will Help

Topic Book Chapter
pf fundamentals “Absolute OpenBSD” by Michael W. Lucas Ch. 7: “Packet Filter”
pf advanced features “The Book of PF, 3rd Edition” by Peter N.M. Hansteen Full book
BSD network stack “The Design and Implementation of the FreeBSD Operating System” by McKusick Ch. 12-13
TCP/IP for filtering “TCP/IP Illustrated Vol. 1” by W. Richard Stevens Ch. 1-11
Linux netfilter “Linux Firewalls” by Steve Suehring Ch. 3-5 (iptables/nftables)
Packet filtering theory “Mastering FreeBSD and OpenBSD Security” by Hope, Potter, Korff Ch. 4-5
Network security concepts “Network Security with OpenBSD and PF” by Jacek Artymiak Full book

Common Pitfalls & Debugging

Problem 1: “Rules load but don’t work”

  • Why: Rule order matters! Most firewalls use “first match wins” (or “last match” for pf without quick)
  • Fix: Add logging to see which rule is matching:
    # pf: add 'log' keyword
    pass in log proto tcp to port 22
    
    # nftables: add counter
    tcp dport 22 counter accept
    
  • Quick test: tcpdump -i pflog0 (OpenBSD) or nft monitor (Linux)

Problem 2: “Syntax error when loading pf.conf”

  • Why: pf syntax is picky about whitespace and order
  • Fix: Use pfctl -nf /etc/pf.conf to test without loading
  • Quick test: Check man pf.conf examples, compare line-by-line

Problem 3: “nftables rules disappear after reboot”

  • Why: nftables doesn’t auto-persist (unlike some iptables setups)
  • Fix: Save rules: nft list ruleset > /etc/nftables.conf
  • Quick test: Add nft -f /etc/nftables.conf to rc.local or systemd

Problem 4: “Can’t SSH after applying firewall”

  • Why: You blocked yourself out! Forgot to allow established connections or specific SSH port
  • Fix: Always test with a second SSH session open. Use pfctl -d (disable) or nft flush ruleset as escape hatch
  • Quick test: Add a console/serial access to your VM before testing

Problem 5: “Port ranges work differently on pf vs nftables”

  • Why: pf uses { port1, port2 } for lists, port1:port2 for ranges
  • Fix: Abstract port specifications in your internal representation:
    struct port_spec {
        bool is_range;
        int start, end;
        int *list;
        int list_count;
    };
    
  • Quick test: Verify with pfctl -sr or nft list ruleset

Project 5: “DTrace/eBPF System Tracer”

Attribute Value
Language C (alt: D (DTrace), Rust, Python (BCC))
Difficulty Advanced
Time 1-2 weeks
Coolness ★★★★☆ Hardcore
Portfolio Value Portfolio Piece

What you’ll build: A system tracing tool that shows function call latencies in running processes—using DTrace on FreeBSD/illumos/macOS and eBPF on Linux.

Why it teaches Unix differences: DTrace originated in Solaris (now illumos) and was ported to FreeBSD and macOS. Linux created eBPF as a “competitor.” Both let you instrument a running kernel without rebooting, but their models differ significantly. DTrace uses the D language; eBPF uses C compiled to bytecode with complex verifier rules.

Core challenges you’ll face:

  • Challenge 1: D language scripts vs eBPF C programs (maps to “language design” philosophy)
  • Challenge 2: DTrace probes (fbt, syscall, pid) vs eBPF attach points (kprobe, tracepoint, uprobe)
  • Challenge 3: DTrace aggregations (@count, @quantize) vs eBPF maps
  • Challenge 4: Safety models—DTrace’s interpreter vs eBPF’s verifier

Key Concepts:

Prerequisites: Understanding of kernel/userspace boundary, basic C

Real world outcome:

  • Run ./mytrace -p <pid> and see output like: read() latency: min=1μs avg=50μs max=2ms histogram: [1-10μs: 500] [10-100μs: 200]
  • Same tool works on FreeBSD (DTrace) and Linux (eBPF) with different backends
  • Demonstrate tracing a real application (like nginx) to find performance bottlenecks

Learning milestones:

  1. Write simple DTrace one-liner on FreeBSD → understand probe concept
  2. Convert to D script with aggregations → understand D language
  3. Port to eBPF/BCC on Linux → notice the complexity increase
  4. Add histogram output → understand both aggregation models
  5. Trace real application → apply knowledge practically

Real World Outcome

You’ll build a cross-platform system tracer that reveals function call latencies in running processes. Here’s exactly what you’ll see:

On FreeBSD/illumos (DTrace):

$ sudo ./mytrace -p 1234
Tracing PID 1234 (nginx worker process)...
Press Ctrl-C to stop

^C
=== Function Call Latency Report ===

read() calls:
  Count: 1,247
  Min:   823 ns
  Avg:   45.3 μs
  Max:   2.1 ms

  Latency distribution (μs):
       value  ------------- Distribution ------------- count
          1 |                                          0
          2 |@@                                        42
          4 |@@@@@@                                    127
          8 |@@@@@@@@@@@@@@                            312
         16 |@@@@@@@@@@@@@@@@@@@@@                    485
         32 |@@@@@@@@@@                                201
         64 |@@@@                                      68
        128 |@                                         11
        256 |                                          1
        512 |                                          0

write() calls:
  Count: 892
  Min:   1.2 μs
  Avg:   67.8 μs
  Max:   5.4 ms

Top functions by time:
  1. read()           - 45.3 μs avg (56.4 ms total)
  2. write()          - 67.8 μs avg (60.5 ms total)
  3. epoll_wait()     - 1.2 ms avg   (34.2 ms total)
  4. accept()         - 156 μs avg   (12.4 ms total)

On Linux (eBPF):

$ sudo ./mytrace -p 1234
Attaching eBPF probes to PID 1234...
Tracing function calls (Ctrl-C to stop)...

^C
Detaching probes...

=== Function Call Latency Report ===

Syscall latencies:
  read():       count=1247  min=0.8μs  avg=45.1μs  max=2.0ms
  write():      count=892   min=1.1μs  avg=68.2μs  max=5.5ms
  epoll_wait(): count=28    min=1.0ms  avg=1.2ms   max=3.4ms

Histogram for read() (μs):
  [0-2):    42  |@@                                    |
  [2-4):    127 |@@@@@@                                |
  [4-8):    312 |@@@@@@@@@@@@@@                        |
  [8-16):   485 |@@@@@@@@@@@@@@@@@@@@@                 |
  [16-32):  201 |@@@@@@@@@@                            |
  [32-64):  68  |@@@@                                  |
  [64-128): 11  |@                                     |
  [128+):   1   |                                      |

What you’re seeing:

  • DTrace uses native @quantize for elegant histograms
  • eBPF requires manual histogram buckets in BPF maps
  • Same insights, different programming models
  • Both reveal performance bottlenecks in real time

The Core Question You’re Answering

“How can you safely instrument a production system running at full speed without restarting it, and why did Solaris and Linux take such different approaches to solving this problem?”

Before writing any code, sit with this question. The answer reveals a fundamental divide: DTrace was designed from the ground up as a unified observability framework (the “cathedral”), while eBPF evolved from packet filtering into a general-purpose kernel instrumentation tool (the “bazaar”). Both work, but the design philosophies are polar opposites.


Concepts You Must Understand First

Stop and research these before coding:

  1. Kernel Probes and Instrumentation Points
    • What’s the difference between a kprobe and a tracepoint?
    • How does DTrace’s fbt (Function Boundary Tracing) work?
    • Why are some probe points “stable” and others not?
    • Book Reference: “DTrace: Dynamic Tracing in Oracle Solaris…“ by Brendan Gregg — Ch. 2-3
  2. Safety in Kernel Space
    • How does DTrace’s interpreter prevent crashes?
    • What does the eBPF verifier check for?
    • Why can’t you use arbitrary pointers in eBPF programs?
    • Book Reference: “BPF Performance Tools” by Brendan Gregg — Ch. 2
  3. Aggregations and Data Collection
    • What are DTrace aggregations (@count, @avg, @quantize)?
    • How do eBPF maps work?
    • How do you get data from kernel space to user space?
    • Book Reference: “DTrace: Dynamic Tracing…“ Ch. 4-5
  4. The D Language vs C-to-BPF
    • What restrictions does the D language have?
    • How is eBPF C compiled to bytecode?
    • What are the helper functions available in each?
    • Book Reference: “BPF Performance Tools” Ch. 3-4

Questions to Guide Your Design

Before implementing, think through these:

  1. Probe Selection
    • Which system calls should you trace?
    • Do you use syscall probes or function probes?
    • How do you filter for specific PIDs?
    • Should you trace kernel functions or just syscalls?
  2. Data Collection Strategy
    • Timestamps: monotonic clock or real-time clock?
    • Aggregations: keep raw data or compute in-kernel?
    • How often do you read data from kernel to userspace?
    • What’s the memory overhead?
  3. Cross-Platform Abstraction
    • Can you share data structure definitions?
    • Should you generate D scripts or write them directly?
    • How do you handle features one system has but the other doesn’t?
    • Can you provide a unified CLI for both?
  4. Performance Impact
    • What’s the overhead of each probe?
    • How do you minimize impact on the traced process?
    • Should you sample or trace every event?
    • How do you handle high-frequency functions?

Thinking Exercise

Manual DTrace vs eBPF Comparison

Before coding, write equivalent tracing scripts for this task:

Task: Count how many times the read() syscall is called by PID 1234, and show the average latency.

Write it in:

  1. DTrace (D language)
  2. eBPF (BCC Python)

Questions while writing:

  • Which is more concise?
  • Which handles aggregations more elegantly?
  • Which requires more boilerplate?
  • Which gives you more control?
  • Which is easier to debug?

DTrace answer (peek if stuck):

syscall::read:entry /pid == 1234/ {
    self->ts = timestamp;
}

syscall::read:return /self->ts/ {
    @latency = avg(timestamp - self->ts);
    @count = count();
}

eBPF answer (peek if stuck):

# Requires BCC library, much longer!

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain how DTrace ensures it doesn’t crash the kernel when running arbitrary D scripts.”
  2. “What’s the difference between a kprobe and a tracepoint in Linux? When would you use each?”
  3. “Why can’t you have unbounded loops in eBPF programs? How does the verifier enforce this?”
  4. “Describe how you’d use tracing to find the cause of a performance regression in a production system.”
  5. “What are the trade-offs between DTrace and eBPF? When would you choose one over the other?”
  6. “Explain what a ‘probe effect’ is and how you minimize it when tracing high-frequency events.”

Hints in Layers

Hint 1: Start with DTrace One-Liners Don’t jump into complex scripts. Start with DTrace command-line one-liners:

# Count syscalls by process
dtrace -n 'syscall:::entry { @[execname] = count(); }'

# Show read latency
dtrace -n 'syscall::read:entry { self->ts = timestamp; } syscall::read:return /self->ts/ { @["latency"] = avg(timestamp - self->ts); }'

Get comfortable with the syntax before writing full .d scripts.

Hint 2: Understand Probe Syntax Both systems use a similar pattern for probe specifications:

DTrace:  provider:module:function:name
eBPF:    attachment_type:target

# Examples:
DTrace:  syscall::read:entry
eBPF:    tracepoint:syscalls:sys_enter_read

Learn the probe naming conventions from documentation before coding.

Hint 3: Use BCC for eBPF (Initially) Don’t write raw eBPF in C initially. Use the BCC Python framework:

from bcc import BPF

prog = """
int trace_read(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid() >> 32;
    if (pid != TARGET_PID) return 0;
    // ... BPF code here
}
"""

b = BPF(text=prog.replace("TARGET_PID", str(target_pid)))

BCC handles a lot of boilerplate for you.

Hint 4: Data Aggregation Strategies

DTrace approach:
  @histogram = quantize(value);  // Built-in histogram
  @avg = avg(value);              // Built-in average

eBPF approach:
  BPF_HISTOGRAM(dist, u64);      // BCC macro
  dist.increment(bpf_log2l(value)); // Manual bucketing

Notice DTrace has richer built-in aggregations.


Books That Will Help

Topic Book Chapter
DTrace fundamentals “DTrace: Dynamic Tracing in Oracle Solaris…“ by Brendan Gregg Ch. 1-5
D language syntax “DTrace: Dynamic Tracing…“ by Brendan Gregg Ch. 4
DTrace on FreeBSD “Absolute FreeBSD, 3rd Edition” by Michael W. Lucas Ch. 19
eBPF fundamentals “BPF Performance Tools” by Brendan Gregg Ch. 1-4
Linux tracing “Systems Performance, 2nd Edition” by Brendan Gregg Ch. 13-15
Kernel internals “Understanding the Linux Kernel, 3rd Edition” by Bovet & Cesati Ch. 1-3
Performance analysis “Systems Performance” by Brendan Gregg Full book

Common Pitfalls & Debugging

Problem 1: “DTrace probe doesn’t fire”

  • Why: Probe name is wrong, module not loaded, or function inlined
  • Fix: List available probes with dtrace -l | grep <function>
  • Quick test: Start with syscall:: probes (always available)

Problem 2: “eBPF verifier rejects my program”

  • Why: Unbounded loop, invalid pointer dereference, or stack too large
  • Fix: Check dmesg for verifier error. Common issues:
    // BAD: unbounded loop
    for (int i = 0; i < n; i++) { }
    
    // GOOD: bounded loop
    #pragma unroll
    for (int i = 0; i < 10; i++) { }
    
  • Quick test: Simplify your BPF program to minimal version, add complexity incrementally

Problem 3: “Timestamps are wrong or negative”

  • Why: Using wrong clock source or overflow in subtraction
  • Fix: DTrace: use timestamp (nanoseconds). eBPF: use bpf_ktime_get_ns()
  • Quick test: Print raw timestamp values to debug

Problem 4: “High overhead, system slows down”

  • Why: Tracing high-frequency events without filtering
  • Fix: Add PID filter early:
    // DTrace
    syscall:::entry /pid == $1/ { }
    
    // eBPF
    u64 pid = bpf_get_current_pid_tgid() >> 32;
    if (pid != TARGET_PID) return 0;
    
  • Quick test: Start with low-frequency probes (once/second), increase gradually

Problem 5: “Can’t attach to user-space functions”

  • Why: No debug symbols, ASLR randomization, or wrong probe type
  • Fix: DTrace: use pid$target:::entry. eBPF: use uprobes with symbol resolution
  • Quick test: Verify binary has symbols: nm <binary> | grep <function>

Problem 6: “Data loss: ‘events dropped’“

  • Why: Perf buffer too small, events generated faster than consumed
  • Fix: Increase buffer size:
    # BCC Python
    b = BPF(text=prog)
    b["events"].open_perf_buffer(callback, page_cnt=512)  # Increase from default 8
    
  • Quick test: Monitor drop count, adjust buffer until drops == 0

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor OSes Covered
Sandboxed Service Intermediate 2-3 weeks ⭐⭐⭐⭐⭐ (security models) ⭐⭐⭐ OpenBSD, FreeBSD, Linux
Event-Driven Server Intermediate 1-2 weeks ⭐⭐⭐⭐ (I/O architecture) ⭐⭐⭐⭐ FreeBSD, Linux, macOS
Container/Jail/Zone Advanced 3-4 weeks ⭐⭐⭐⭐⭐ (isolation architecture) ⭐⭐⭐⭐⭐ Linux, FreeBSD, illumos
Packet Filter Tool Intermediate 2 weeks ⭐⭐⭐⭐ (networking) ⭐⭐⭐ OpenBSD, FreeBSD, Linux
DTrace/eBPF Tracer Advanced 2-3 weeks ⭐⭐⭐⭐⭐ (kernel internals) ⭐⭐⭐⭐ FreeBSD, illumos, Linux

Recommendation

Given that you want to deeply understand the differences, I recommend starting with Project 3: Build Your Own Container/Jail/Zone.

Why this project first:

  1. Maximum contrast: The difference between Linux’s 7 namespace types + cgroups vs FreeBSD’s single jail() syscall is the clearest demonstration of “building blocks” vs “first-class concept” philosophy
  2. Practical relevance: Containers are everywhere; understanding them at the kernel level makes you dangerous
  3. Forces multi-OS work: You literally cannot complete it without running multiple operating systems
  4. Foundation for others: Once you understand isolation, the security sandbox project (Project 1) becomes much clearer

Setup recommendation:

  • Use VirtualBox/VMware with FreeBSD 14, OpenBSD 7.5, and Linux (any distro)
  • Or use cloud VMs (Vultr/DigitalOcean have FreeBSD; OpenBSD requires ISO install)
  • illumos: Use OmniOS or SmartOS in VM

Final Comprehensive Project: Cross-Platform Unix Compatibility Layer

What you’ll build: A userspace compatibility library that allows programs written for one Unix to run on another—implementing syscall translation, filesystem abstraction, and API shimming. Think: a minimal “Wine for BSD” or “BSD personality for Linux.”

Why it teaches everything: This project forces you to confront EVERY difference between Unix systems:

  • Different syscall numbers and semantics
  • Different ioctl interfaces
  • Different signal behaviors
  • Different filesystem layouts and conventions
  • Different library ABIs

What you’ll build specifically:

  • A preloadable shared library (LD_PRELOAD) that intercepts syscalls
  • Translation layer for key differences (e.g., translate kqueue calls to epoll on Linux)
  • ABI compatibility for basic programs (get ls from FreeBSD running on Linux, or vice versa)

Core challenges you’ll face:

  • Challenge 1: Syscall number mapping (same name, different numbers across OSes)
  • Challenge 2: Struct layout differences (even struct stat differs)
  • Challenge 3: Signal semantics variations
  • Challenge 4: Implementing kqueue in terms of epoll (or vice versa)
  • Challenge 5: Path translation (/usr/local conventions, /proc vs /compat/linux/proc)
  • Challenge 6: Dynamic linker differences

Key Concepts:

  • Syscall interfaces: “The Linux Programming Interface” Ch. 3 - Kerrisk + BSD man pages comparison
  • ABI compatibility: “Computer Systems: A Programmer’s Perspective” Ch. 7 - Bryant & O’Hallaron
  • Dynamic linking: “Advanced Programming in the UNIX Environment” Ch. 17 - Stevens & Rago
  • FreeBSD Linux emulation: FreeBSD Handbook - Linux Binary Compatibility
  • illumos LX zones: illumos LX branded zones - how they run Linux binaries

Difficulty: Expert Time estimate: 2-3 months Prerequisites: Complete at least 2-3 projects above; strong C; understanding of ELF format

Real world outcome:

  • Run a simple FreeBSD binary on Linux (or vice versa): ./mycompat /path/to/freebsd/ls -la
  • See output showing which syscalls were translated
  • Demonstrate: “This program uses kqueue, but we’re translating it to epoll on Linux”

Learning milestones:

  1. Build syscall interception framework → understand how syscalls work at machine level
  2. Implement basic syscall translation (open, read, write, close) → understand “same but different”
  3. Implement struct translation layer → understand ABI differences
  4. Port kqueue→epoll (or reverse) → deep understanding of both
  5. Get a real program running → validate your understanding is complete

Summary

This learning path covers BSD vs Linux & Unix Variants through 5 hands-on projects plus a comprehensive final project. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate OSes Covered
1 Cross-Platform Sandboxed Service C Advanced 2-3 weeks OpenBSD, FreeBSD, Linux
2 Event-Driven TCP Echo Server C Intermediate 1-2 weeks FreeBSD, Linux, macOS
3 Build Your Own Container/Jail/Zone C Advanced 3-4 weeks Linux, FreeBSD, illumos
4 Packet Filter Firewall Configuration Tool C Intermediate 2 weeks OpenBSD, FreeBSD, Linux
5 DTrace/eBPF System Tracer C / D / Python Advanced 2-3 weeks FreeBSD, illumos, Linux
Final Cross-Platform Unix Compatibility Layer C Expert 2-3 months All Unix variants

For beginners (new to systems programming):

  • Start with Project 2 (Event Server) - simplest concepts, clear outcome
  • Then Project 4 (Firewall Tool) - builds on networking knowledge
  • Then Project 1 (Sandboxed Service) - introduces security concepts
  • Save Projects 3 and 5 for when you have solid foundation

For Linux developers wanting to understand BSD:

  • Start with Project 2 (kqueue vs epoll is clearest contrast)
  • Then Project 4 (pf syntax will impress you)
  • Then Project 1 (appreciate pledge’s simplicity)
  • Then Project 3 (jail is just ONE syscall!)
  • Finally Project 5 (port eBPF knowledge to DTrace)

For BSD users curious about Linux:

  • Start with Project 2 (see epoll’s socket focus)
  • Then Project 3 (7 namespace types will shock you)
  • Then Project 1 (seccomp-bpf complexity)
  • Then Project 5 (eBPF’s power and complexity)
  • Finally Project 4 (nftables vs pf)

For container/Docker deep understanding:

  • Jump directly to Project 3 (Container/Jail/Zone)
  • Then Project 1 (security primitives)
  • Then Project 2 (networking fundamentals)
  • Then Project 4 (network isolation)
  • Finally Project 5 (observe containers in production)

Expected Outcomes

After completing these projects, you will:

Understand design philosophies deeply

  • Linux’s “bazaar” approach (components from everywhere)
  • BSD’s “cathedral” approach (integrated, unified system)
  • How these philosophies affect every design decision

Master security model differences

  • OpenBSD’s pledge/unveil (promise-based, simple)
  • FreeBSD’s Capsicum (capability-based, granular)
  • Linux’s seccomp-bpf (filter-based, flexible but complex)

Explain container technologies from first principles

  • Why Linux containers need 7+ namespace types + cgroups
  • Why FreeBSD jails are a single syscall
  • Why Solaris Zones predate Docker by a decade
  • How these architectural choices reflect OS philosophies

Architect network systems intelligently

  • Choose between kqueue and epoll based on requirements
  • Write high-performance event-driven servers
  • Understand the C10K problem and modern solutions

Design secure, hardened systems

  • Apply principle of least privilege correctly
  • Choose appropriate security mechanisms for each OS
  • Audit and validate security configurations

Instrument production systems safely

  • Use DTrace to observe illumos/FreeBSD systems
  • Use eBPF to trace Linux kernel behavior
  • Find performance bottlenecks in running applications
  • Minimize probe overhead

Write truly portable code

  • Handle syscall differences across Unix variants
  • Abstract OS-specific features cleanly
  • Use conditional compilation effectively
  • Test on multiple platforms

Contribute to open source OS projects

  • Read and understand kernel code
  • Follow OS-specific contribution guidelines
  • Submit patches with confidence

Real-World Applications

These skills directly apply to:

  • Infrastructure Engineering: Netflix uses FreeBSD for CDN (800 Gb/s throughput)
  • Cloud Computing: Understanding container internals for Kubernetes/Docker
  • Security Engineering: Implementing sandboxing and privilege separation
  • Performance Engineering: Using DTrace/eBPF for production debugging
  • Systems Programming: Writing portable, high-performance servers
  • DevOps: Managing firewalls, containers, and observability across platforms

Key Insights You’ll Internalize

  1. “Simple” and “powerful” aren’t opposites - OpenBSD’s pledge proves this
  2. Abstraction has costs - Linux’s flexibility comes with complexity
  3. Design philosophy echoes everywhere - From syscalls to init systems
  4. Security is a design choice - Not a feature you add later
  5. Performance requires understanding - Can’t optimize what you don’t measure
  6. Portability requires discipline - Test on multiple platforms early

Total Time Investment

  • Part-time (10-15 hrs/week): 10-14 weeks (3-4 months)
  • Full-time (40 hrs/week): 6-8 weeks (1.5-2 months)
  • Final project adds: 2-3 months additional

What Comes Next

After mastering these projects, you’ll be ready to:

  1. Contribute to OS development - FreeBSD, OpenBSD, or Linux kernel
  2. Build system-level tools - Like Docker, systemd, or nginx
  3. Architect infrastructure - With deep understanding of trade-offs
  4. Teach others - Write blog posts, give talks, mentor
  5. Tackle advanced topics - Kernel development, driver writing, OS design

Sources