BSD LINUX UNIX VARIANTS LEARNING PROJECTS
In 1969 at Bell Labs, Ken Thompson and Dennis Ritchie created Unix. From that single ancestor, an entire family of operating systems evolvedโeach branch making different design choices that echo through every system call you make today.
Understanding BSD vs Linux & Unix Variants: A Deep Dive Through Building
Goal: Master the fundamental architectural differences between Unix variantsโBSD (FreeBSD, OpenBSD), Linux, and illumosโby building real systems that expose their distinct design philosophies. Youโll understand why these systems differ, not just how, enabling you to choose the right tool for each job and write truly portable systems code.
Why This Knowledge Matters
In 1969 at Bell Labs, Ken Thompson and Dennis Ritchie created Unix. From that single ancestor, an entire family of operating systems evolvedโeach branch making different design choices that echo through every system call you make today.
The professional reality:
- Netflix runs on FreeBSD for its content delivery (their CDN serves ~40% of North American internet traffic)
- OpenBSD pioneered security features now standard everywhere (ASLR, W^X, pledge/unveil)
- Linux dominates servers, cloud, and containers (96%+ of top 1M web servers)
- illumos (Solaris heritage) gave us DTrace and ZFSโtechnologies now ported everywhere
Understanding these systems isnโt academicโitโs understanding why your containers work the way they do, why some firewalls are easier to configure than others, and why certain security models exist.
The core question: Why did these systems evolve differently from the same ancestor?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Original Unix (1969) โ
โ Bell Labs (Thompson/Ritchie) โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BSD (1977) โ โ System V (AT&T) โ
โ UC Berkeley โ โ โ
โ "Academic/Research" โ โ "Commercial Unix" โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โผ โผ โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ FreeBSD โ โ OpenBSD โ โ NetBSD โ โ Darwin/ โ โ Solaris โ
โ (1993) โ โ (1995) โ โ (1993) โ โ macOS โ โ (1992) โ
โ โ โ โ โ โ โ โ โ โ
โ Focus: โ โ Focus: โ โ Focus: โ โ Mach + โ โ DTrace, ZFS โ
โ Perf, โ โ Securityโ โ Porta- โ โ BSD โ โ โ
โ Featuresโ โ Correct โ โ bility โ โ userland โ โโโโโโโโโโฌโโโโโโโโ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โผ
โโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ illumos โ
โ Linux (1991) โ โ (2010) โ
โ NOT Unix descendantโUnix-LIKE โ โ โ
โ Reimplementation of Unix ideas โ โ OpenSolaris โ
โ Linux kernel + GNU userland โ โ fork โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
The key insight that will click once you build these projects:
- Linux = A kernel with userland assembled from various sources (Lego blocks)
- BSD = Complete, integrated operating systems (Finished product)
- illumos = Enterprise Unix with native observability (DTrace) and storage (ZFS)
This fundamental difference shapes EVERYTHING: security models, container implementations, networking APIs, and more.
The Design Philosophy Deep Dive
Linux: โThe Bazaarโ
Linux follows the โcathedral vs bazaarโ model from Eric Raymondโmany independent contributors, rapid iteration, features from everywhere. The kernel is separate from userland (GNU tools, systemd, etc.).
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Linux System โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โ systemd โ โ GNU coreutilsโ โ glibc โ โ bash โ โ
โ โ (Lennart P.)โ โ (FSF) โ โ (FSF) โ โ (FSF) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โฒ Different projects, different maintainers โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ โ
โ โ Linux Kernel (Torvalds et al.) โ
โ โโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Monolithic kernel with loadable modules โ โ
โ โ syscall interface is the stable API boundary โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implications:
- Security features come from many sources (seccomp, SELinux, AppArmor, namespaces)
- Containers are โassembledโ from primitives (namespaces + cgroups + seccomp + โฆ)
- Updates can be partial (update kernel, keep userland or vice versa)
BSD: โThe Cathedralโ
BSD maintains the entire operating system as one project. Kernel, libc, core utilities, documentationโall versioned together.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FreeBSD/OpenBSD System โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Single source tree, single project โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ /usr/src โ โ
โ โ โโโ sys/ (kernel source) โ โ
โ โ โโโ lib/ (libc, libm, etc.) โ โ
โ โ โโโ bin/ (core utilities: ls, cat, etc.) โ โ
โ โ โโโ sbin/ (system utilities: mount, ifconfig) โ โ
โ โ โโโ usr.bin/ (user utilities: grep, awk, etc.) โ โ
โ โ โโโ share/ (docs, man pages) โ โ
โ โ โ โ
โ โ ALL maintained by the SAME project, versioned TOGETHER โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Result: Tight integration, consistent coding style, unified docs โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implications:
- Security features are built-in (jails, Capsicum on FreeBSD; pledge/unveil on OpenBSD)
- Containers are โfirst-classโ (jail(2) is a single system call)
- Updates are atomic (upgrade entire base system together)
Security Model Comparison: A Critical Difference
The security philosophy differences are profound:
OpenBSD: Promise-Based Security (pledge/unveil)
// OpenBSD: Tell the kernel what you WILL do, reveal what you WILL see
int main() {
// After this, only these capabilities remain
if (pledge("stdio rpath wpath", NULL) == -1)
err(1, "pledge");
// Only reveal these filesystem paths
if (unveil("/var/log", "rw") == -1)
err(1, "unveil");
if (unveil(NULL, NULL) == -1) // Lock it down
err(1, "unveil");
// Now the program runs with minimal privileges
// Any violation = immediate SIGABRT (uncatchable)
}
Philosophy: โSurrender capabilities at runtime. Promise what youโll do, reveal what youโll see.โ Simple, auditable, comprehensible by mortals.
FreeBSD: Capability-Based Security (Capsicum)
// FreeBSD: Limit capabilities on file descriptors
int main() {
int fd = open("/etc/passwd", O_RDONLY);
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK);
cap_rights_limit(fd, &rights); // This fd can now ONLY read/seek
cap_enter(); // Enter capability mode - no more global namespace access
// fd is now the ONLY way to access that file
// Cannot open new files, cannot access network
}
Philosophy: โCapabilities are tokens attached to file descriptors.โ Fine-grained control, but more complex.
Linux: Filter-Based Security (seccomp-bpf)
// Linux: Write BPF program to filter syscalls
struct sock_filter filter[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = { .len = 4, .filter = filter };
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
Philosophy: โMaximum flexibility through programmability.โ You write a BPF program that filters syscalls. Powerful but complexโeasy to make mistakes.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Security Model Comparison โ
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ OpenBSD โ FreeBSD โ Linux โ
โ pledge/unveil โ Capsicum โ seccomp-bpf โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ โ โ
โ "I promise to only"โ "This fd can only" โ "If syscall matches filter" โ
โ โ โ โ
โ Simple strings โ Capability rights โ BPF bytecode program โ
โ "stdio rpath" โ CAP_READ, CAP_SEEK โ Complex filter rules โ
โ โ โ โ
โ Easy to audit โ Medium complexity โ Hard to get right โ
โ ~10 lines code โ ~30 lines code โ ~100+ lines code โ
โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Isolation Architecture: Containers vs Jails vs Zones
This is where the โfirst-class conceptโ vs โbuilding blocksโ difference becomes crystal clear:
Linux: Assemble from Primitives
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Linux "Container" โ
โ (NOT a kernel conceptโassembled from parts) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ You must combine: โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ PID namespaceโ โmount namespaceโ โnetwork ns โ โ
โ โ clone(CLONE_ โ โ clone(CLONE_ โ โ clone(CLONE_โ โ
โ โ NEWPID) โ โ NEWNS) โ โ NEWNET) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ UTS namespaceโ โ IPC namespaceโ โ user ns โ โ
โ โ (hostname) โ โ (semaphores, โ โ (uid/gid โ โ
โ โ โ โ msg queues) โ โ mapping) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ cgroups โ โ seccomp โ + AppArmor/SELinux + ... โ
โ โ (resource โ โ (syscall โ โ
โ โ limits) โ โ filter) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โ Result: ~500+ lines of C code to create a container โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
FreeBSD: First-Class Jail
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FreeBSD Jail โ
โ (First-class kernel concept) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Single system call: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ jail(2) โ โ
โ โ โ โ
โ โ struct jail j = { โ โ
โ โ .version = JAIL_API_VERSION, โ โ
โ โ .path = "/jails/myjail", โ โ
โ โ .hostname = "myjail", โ โ
โ โ .jailname = "myjail", โ โ
โ โ .ip4s = 1, โ โ
โ โ .ip4 = &jail_ip, โ โ
โ โ }; โ โ
โ โ jail(&j); // That's it. You're in a jail. โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ + VNET for network virtualization โ
โ + rctl for resource limits โ
โ + ZFS clones for instant filesystem snapshots โ
โ โ
โ Result: ~100 lines of C code for equivalent isolation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
illumos: Zones with SMF Integration
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ illumos Zone โ
โ (Enterprise-grade isolation) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ zone_create() / zonecfg + zoneadm โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ - Full process isolation โ โ
โ โ - Delegated ZFS datasets โ โ
โ โ - Resource pools โ โ
โ โ - Network virtualization (crossbow) โ โ
โ โ - SMF (Service Management Facility) integration โ โ
โ โ - DTrace visibility across zones โ โ
โ โ - LX branded zones (run Linux binaries!) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Zones predated Docker by over a decade (2004) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Event-Driven I/O: kqueue vs epoll
Both solve the C10K problem (handling 10,000+ concurrent connections), but with different elegance:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ kqueue (BSD) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ // One call to register AND wait โ
โ struct kevent changes[100]; // What we want to monitor โ
โ struct kevent events[100]; // What happened โ
โ โ
โ // Register 100 file descriptors in ONE system call โ
โ kevent(kq, changes, 100, events, 100, NULL); โ
โ โ
โ Benefits: โ
โ โ Batch updates (register many fds in one syscall) โ
โ โ Generic (handles files, sockets, signals, processes, timers) โ
โ โ Cleaner API design โ
โ โ
โ Filter types: EVFILT_READ, EVFILT_WRITE, EVFILT_VNODE, โ
โ EVFILT_PROC, EVFILT_SIGNAL, EVFILT_TIMER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ epoll (Linux) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ // Separate calls for each modification โ
โ for (int i = 0; i < 100; i++) { โ
โ epoll_ctl(epfd, EPOLL_CTL_ADD, fds[i], &event); // 100 calls!โ
โ } โ
โ epoll_wait(epfd, events, 100, -1); โ
โ โ
โ Limitations: โ
โ โ One syscall per modification โ
โ โ Socket-focused (need eventfd/signalfd/timerfd for other types) โ
โ โ More system calls under high churn โ
โ โ
โ But: Still O(1) and very fast in practice โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Concept Analysis
To truly understand BSD vs Linux (and other Unix-likes), you need to grasp these fundamental architectural differences:
| Concept Area | Linux | FreeBSD | OpenBSD | illumos |
|---|---|---|---|---|
| Design Philosophy | Modular kernel + GNU userland (pieces from everywhere) | Integrated โcomplete OSโ (kernel + userland as one) | Security-first, minimal attack surface | Enterprise features (DTrace, ZFS native) |
| Isolation | Namespaces + cgroups (building blocks) | Jails (first-class kernel concept) | chroot + pledge/unveil | Zones (first-class containers) |
| Event I/O | epoll | kqueue | kqueue | Event ports |
| Packet Filter | nftables/iptables | pf (ported from OpenBSD) | pf (native) | IPFilter |
| Security Model | seccomp-bpf, SELinux, AppArmor | Capsicum, MAC Framework | pledge(2), unveil(2) | Privileges, zones |
| Tracing | eBPF, perf | DTrace (ported) | ktrace | DTrace (native) |
| Init System | systemd (mostly) | rc scripts | rc scripts | SMF |
Key insight: Linux is a kernel with userland assembled from various sources. BSDs are complete, integrated operating systems. This fundamental difference shapes everything else.
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Design Philosophy | Linux = bazaar (components from everywhere); BSD = cathedral (integrated system). This shapes everything. |
| Security Models | OpenBSD pledge/unveil = promise-based; FreeBSD Capsicum = capability-based; Linux seccomp = filter-based. Trade-offs between simplicity and flexibility. |
| Isolation Architecture | Jails/Zones are first-class kernel concepts; Linux containers are assembled from namespaces+cgroups. Complexity vs elegance. |
| Event I/O | kqueue is more elegant (batch ops, generic); epoll is socket-focused. Both solve C10K. |
| The Unix Heritage | BSD descends from original Unix; Linux is a reimplementation. This explains API differences. |
| Observability | DTrace (Solaris/illumos native, ported to BSD) vs eBPF (Linux). Both let you instrument running kernels. |
| Networking | BSDโs TCP/IP stack is the reference implementation. pf originated on OpenBSD. |
Deep Dive Reading by Concept
Unix History & Design Philosophy
| Concept | Book & Chapter |
|---|---|
| Unix origins and philosophy | The UNIX Programming Environment by Kernighan & Pike โ Ch. 1: โUNIX for Beginnersโ |
| BSD history and development | The Design and Implementation of the FreeBSD Operating System by McKusick et al. โ Ch. 1 |
| Linux kernel architecture | Understanding the Linux Kernel, 3rd Edition by Bovet & Cesati โ Ch. 1-2 |
| System calls deep dive | Advanced Programming in the UNIX Environment, 3rd Edition by Stevens & Rago โ Ch. 1-3 |
Security Models
| Concept | Book & Chapter |
|---|---|
| OpenBSD security philosophy | Absolute OpenBSD by Michael W. Lucas โ Ch. 1 & security chapters |
| FreeBSD Capsicum | Absolute FreeBSD, 3rd Edition by Michael W. Lucas โ Ch. 8 |
| Linux security mechanisms | The Linux Programming Interface by Michael Kerrisk โ Ch. 23 (Timers & Seccomp) |
| General Unix security | Mastering FreeBSD and OpenBSD Security by Hope, Potter & Korff โ Full book |
Isolation & Containers
| Concept | Book & Chapter |
|---|---|
| Linux namespaces | The Linux Programming Interface by Michael Kerrisk โ Ch. 28-29 (Process Creation) + online resources |
| FreeBSD jails | Absolute FreeBSD, 3rd Edition by Michael W. Lucas โ Ch. 12: โJailsโ |
| Linux cgroups | How Linux Works, 3rd Edition by Brian Ward โ Ch. 8 |
| General process isolation | Operating Systems: Three Easy Pieces by Arpaci-Dusseau โ Part II: โVirtualizationโ |
Networking & I/O
| Concept | Book & Chapter |
|---|---|
| Event-driven I/O | The Linux Programming Interface by Michael Kerrisk โ Ch. 63: โAlternative I/O Modelsโ |
| TCP/IP fundamentals | TCP/IP Illustrated, Volume 1 by W. Richard Stevens โ Full book (BSD reference impl) |
| Socket programming | UNIX Network Programming, Volume 1 by Stevens, Fenner & Rudoff โ Ch. 1-6 |
| FreeBSD networking | The Design and Implementation of the FreeBSD Operating System by McKusick et al. โ Ch. 12 |
System Tracing & Debugging
| Concept | Book & Chapter |
|---|---|
| DTrace fundamentals | DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD by Brendan Gregg โ Full book |
| eBPF/BPF on Linux | BPF Performance Tools by Brendan Gregg โ Ch. 1-5 |
| General debugging | The Art of Debugging with GDB, DDD, and Eclipse by Matloff & Salzman โ Ch. 1-3 |
The Unix Family Tree (Context)
Understanding the genealogy helps:
Original Unix (Bell Labs, 1970s)
โโโ BSD (Berkeley, 1977)
โ โโโ FreeBSD (1993) โ focus: performance, features, ZFS
โ โโโ OpenBSD (1995) โ focus: security, correctness, simplicity
โ โโโ NetBSD (1993) โ focus: portability
โ โโโ Darwin/macOS (2000) โ Mach microkernel + BSD userland
โ
โโโ System V (AT&T)
โ โโโ Solaris (Sun, 1992)
โ โโโ illumos (2010) โ OpenSolaris fork, DTrace/ZFS native
โ
โโโ Linux (1991) โ NOT Unix lineage, but Unix-like
โโโ GNU userland + Linux kernel
Linux is the โodd one outโโitโs a reimplementation of Unix ideas, not a descendant. This explains why it often does things differently.
Project 1: Cross-Platform Sandboxed Service
- File: BSD_LINUX_UNIX_VARIANTS_LEARNING_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The โService & Supportโ Model
- Difficulty: Level 3: Advanced
- Knowledge Area: OS Security / Systems Programming
- Software or Tool: pledge / unveil / seccomp / capsicum
- Main Book: โAdvanced Programming in the UNIX Environmentโ by Stevens & Rago
What youโll build: A file-watching daemon that monitors directories for changes and logs eventsโimplemented with native sandboxing on each OS (pledge/unveil on OpenBSD, Capsicum on FreeBSD, seccomp on Linux).
Why it teaches Unix differences: You canโt abstract away the security modelsโyou must understand each oneโs philosophy. OpenBSDโs โpromise what youโll do, reveal what youโll seeโ model (pledge/unveil) is fundamentally different from Linuxโs โfilter syscalls at BPF levelโ (seccomp) or FreeBSDโs capability-based approach (Capsicum).
Core challenges youโll face:
- Challenge 1: Understanding pledge promises (โstdio rpath wpathโ) vs seccomp BPF filters (maps to security model philosophy)
- Challenge 2: Using unveil() vs Capsicum cap_rights_limit() for filesystem restriction (maps to capability models)
- Challenge 3: Building without libc abstractions that hide OS differences (maps to syscall interface understanding)
- Challenge 4: Handling graceful degradation when security features arenโt available
Key Concepts:
- System calls & POSIX: โAdvanced Programming in the UNIX Environment, 3rd Editionโ Ch. 1-3 - Stevens & Rago
- OpenBSD pledge/unveil: pledge(2) man page and unveil(2) man page
- FreeBSD Capsicum: โAbsolute FreeBSD, 3rd Editionโ Ch. 8 - Michael W. Lucas
- Linux seccomp: โThe Linux Programming Interfaceโ Ch. 23 - Michael Kerrisk
- Security philosophy comparison: Jessie Frazelleโs โContainers vs. Zones vs. Jails vs. VMsโ
Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: C programming, basic Unix syscalls (open, read, write), comfort with man pages
Real world outcome:
- A daemon that prints โCREATED: /path/to/fileโ or โMODIFIED: /path/to/fileโ to stdout/syslog
- Running with minimal privileges on each OSโdemonstrable by attempting forbidden operations and seeing them blocked
- A single codebase with
#ifdef __OpenBSD__,#ifdef __FreeBSD__,#ifdef __linux__blocks showing the architectural differences
Learning milestones:
- Get basic file watching working on one OS โ understand inotify (Linux) vs kqueue EVFILT_VNODE (BSD)
- Add sandboxing on OpenBSD with pledge/unveil โ understand โpromise-basedโ security
- Port sandboxing to FreeBSD Capsicum โ understand capability-based security
- Port to Linux seccomp-bpf โ understand filter-based security and why itโs โharderโ
- Compare: which was easiest? Which is most secure? Why?
Real World Outcome
When you complete this project, youโll have a security-hardened file-watching daemon that demonstrates the fundamental differences between Unix security models.
What youโll see running on OpenBSD:
$ ./filewatcher /var/log
[filewatcher] Starting with pledge("stdio rpath wpath cpath") and unveil("/var/log", "rw")
[filewatcher] Security sandbox active. Attempting forbidden operation...
[filewatcher] BLOCKED: Cannot access /etc/passwd (unveil restriction)
[filewatcher] Monitoring /var/log for changes...
[2024-12-22 14:32:01] CREATED: /var/log/messages.1
[2024-12-22 14:32:05] MODIFIED: /var/log/auth.log
[2024-12-22 14:32:10] DELETED: /var/log/old.log
# If you try to violate pledge:
$ ./filewatcher_bad /var/log
[filewatcher] Starting...
[filewatcher] Attempting network connection (not pledged)...
Abort trap (core dumped) # SIGABRT - pledge violation!
What youโll see running on FreeBSD with Capsicum:
$ ./filewatcher /var/log
[filewatcher] Entering capability mode...
[filewatcher] File descriptor rights limited: CAP_READ, CAP_EVENT
[filewatcher] Capability mode active. Global namespace access revoked.
[filewatcher] Monitoring /var/log for changes...
[2024-12-22 14:32:01] CREATED: /var/log/messages.1
# Attempting to open new file after cap_enter():
[filewatcher] ERROR: open("/etc/passwd") failed: Not permitted in capability mode
What youโll see running on Linux with seccomp:
$ ./filewatcher /var/log
[filewatcher] Installing seccomp-bpf filter...
[filewatcher] Allowed syscalls: read, write, inotify_add_watch, inotify_rm_watch, exit_group
[filewatcher] Filter installed. Monitoring...
[2024-12-22 14:32:01] CREATED: /var/log/messages.1
# Attempting forbidden syscall:
$ ./filewatcher_bad /var/log
[filewatcher] Attempting socket() syscall (not allowed)...
Bad system call (core dumped) # SIGSYS - seccomp violation!
Your codebase will look like:
// Conditional compilation showing the architectural differences
#ifdef __OpenBSD__
// ~15 lines: pledge() + unveil()
pledge("stdio rpath wpath", NULL);
unveil(watch_path, "rw");
unveil(NULL, NULL);
#elif defined(__FreeBSD__)
// ~30 lines: Capsicum capability mode
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_EVENT, CAP_FCNTL);
cap_rights_limit(dir_fd, &rights);
cap_enter();
#elif defined(__linux__)
// ~100+ lines: seccomp-bpf filter program
struct sock_filter filter[] = { /* BPF program */ };
// ... complex filter setup
#endif
The Core Question Youโre Answering
โWhy do different Unix systems take such radically different approaches to application sandboxing, and what are the real-world trade-offs?โ
This project forces you to confront a fundamental truth: security is a design philosophy, not just a feature list. OpenBSDโs pledge/unveil says โtell us what you need, weโll kill you if you lie.โ FreeBSDโs Capsicum says โcapabilities are tokens on file descriptors.โ Linuxโs seccomp says โhereโs a programmable filterโgo wild.โ
By implementing the same functionality on all three, youโll viscerally understand why OpenBSD can sandbox their entire base system while Linux applications rarely use seccomp directly.
Concepts You Must Understand First
Stop and research these before coding:
- System Calls as the Security Boundary
- What is a system call? How does it differ from a library function?
- Why is the syscall interface the natural place to enforce security?
- How does the kernel know which process is making the call?
- Book Reference: Advanced Programming in the UNIX Environment, 3rd Edition by Stevens & Rago โ Ch. 1-3
- The Principle of Least Privilege
- What does it mean for a program to have โminimal privilegesโ?
- Why should a file watcher not have network access?
- Whatโs the difference between DAC (discretionary) and MAC (mandatory) access control?
- Book Reference: Mastering FreeBSD and OpenBSD Security by Hope, Potter & Korff โ Ch. 1-2
- OpenBSD pledge/unveil Model
- What are โpromisesโ in pledge? (stdio, rpath, wpath, cpath, inet, dns, etc.)
- How does unveil() complement pledge()?
- Why is violation = SIGABRT with no recovery?
- Reference: pledge(2) man page and Bob Beckโs BSDCan 2018 talk
- FreeBSD Capsicum Model
- What is โcapability modeโ and why canโt you leave it?
- How do cap_rights_t work? Whatโs CAP_READ vs CAP_WRITE?
- Whatโs the difference between cap_rights_limit() and cap_enter()?
- Book Reference: Absolute FreeBSD, 3rd Edition by Michael W. Lucas โ Ch. 8
- Linux seccomp-bpf Model
- What is BPF (Berkeley Packet Filter)? Why is it used for syscall filtering?
- How do you write a BPF filter program?
- Whatโs the difference between SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_ALLOW?
- Book Reference: The Linux Programming Interface by Michael Kerrisk โ Ch. 23
- File System Event Notification
- Linux: How does inotify work? What events can you watch?
- BSD: How does kqueue EVFILT_VNODE work? Whatโs the kevent structure?
- Why are these fundamentally different APIs?
- Book Reference: The Linux Programming Interface by Michael Kerrisk โ Ch. 19
Questions to Guide Your Design
Before implementing, think through these:
- What exactly needs sandboxing?
- What system calls does a file watcher need? (open, read, stat, inotify_add_watch/kevent, write to log)
- What system calls should be BLOCKED? (socket, execve, fork, ptrace, mount)
- How do you enumerate the minimal set?
- How do you test the sandbox?
- How can you verify that forbidden operations are actually blocked?
- What happens when a sandboxed program tries a forbidden syscall?
- How do you distinguish โsandbox blocked itโ from โother errorโ?
- How do you handle initialization vs runtime?
- Most programs need more privileges during startup (opening config files, binding ports)
- How do pledge/Capsicum/seccomp handle the โinitialize, then restrictโ pattern?
- When exactly should you โlock downโ?
- What about error handling?
- If pledge() fails, should you continue without sandboxing or exit?
- How do you write code that gracefully degrades on systems without these features?
- How do you log sandbox violations for debugging?
- Cross-platform abstraction?
- Should you create a common API that hides the OS differences?
- Or should you embrace the differences with
#ifdef? - What are the trade-offs of each approach?
Thinking Exercise
Before coding, trace this scenario by hand:
Your file watcher needs to:
- Open a directory for watching
- Read file system events
- Write events to a log file
- Optionally: send alerts over the network (for a โpremiumโ version)
Map to each security model:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenBSD pledge/unveil โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Step 1: What promises do we need? โ
โ stdio (read/write to fds) โ
โ rpath (read files) โ
โ wpath (write files) โ
โ cpath (create files - for log rotation?) โ
โ inet (ONLY if network alerts enabled) โ
โ โ
โ Step 2: What paths do we reveal? โ
โ unveil("/var/log", "rw") - watch and log here โ
โ unveil("/etc/filewatcher.conf", "r") - config file โ
โ unveil(NULL, NULL) - lock it down โ
โ โ
โ Step 3: What happens if we try socket() without "inet" promise? โ
โ โ Process receives SIGABRT, core dump created โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FreeBSD Capsicum โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Step 1: Open all needed file descriptors BEFORE cap_enter() โ
โ int dir_fd = open("/var/log", O_RDONLY|O_DIRECTORY); โ
โ int log_fd = open("/var/log/filewatcher.log", O_WRONLY); โ
โ int kq = kqueue(); โ
โ โ
โ Step 2: Limit rights on each fd โ
โ cap_rights_limit(dir_fd, &(CAP_READ|CAP_EVENT|CAP_LOOKUP));โ
โ cap_rights_limit(log_fd, &(CAP_WRITE|CAP_SEEK)); โ
โ โ
โ Step 3: Enter capability mode โ
โ cap_enter(); // No way back! โ
โ โ
โ Step 4: What happens if we try open("/etc/passwd")? โ
โ โ Returns -1, errno = ECAPMODE โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Linux seccomp-bpf โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Step 1: Enumerate all syscalls we need (this is the hard part!) โ
โ read, write, close, fstat, mmap, mprotect, โ
โ inotify_init1, inotify_add_watch, inotify_rm_watch, โ
โ epoll_create1, epoll_ctl, epoll_wait, โ
โ openat (with restrictions?), exit_group, ... โ
โ โ
โ Step 2: Write BPF filter program โ
โ For each syscall: ALLOW if in whitelist, KILL otherwise โ
โ Must handle syscall arguments for openat restrictions! โ
โ โ
โ Step 3: What happens if we try socket()? โ
โ โ Process receives SIGSYS, terminated โ
โ โ
โ Challenge: How do you restrict openat() to specific paths? โ
โ BPF can't easily inspect string arguments! โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key questions from this exercise:
- Why is OpenBSDโs model so much simpler?
- Why does Capsicum require pre-opening all file descriptors?
- Why is Linuxโs path restriction so much harder?
The Interview Questions Theyโll Ask
Prepare to answer these:
- โWhatโs the difference between pledge, Capsicum, and seccomp?โ
- pledge: Promise-based, operates on โpromiseโ categories, simple strings
- Capsicum: Capability-based, operates on file descriptors, fine-grained
- seccomp: Filter-based, operates on syscalls with BPF, most flexible but complex
- โWhy did OpenBSD choose the pledge model?โ
- Simplicity enables adoption (90%+ of base system is pledgeโd)
- Auditable by humans (you can read โstdio rpathโ and understand it)
- Fail-closed philosophy (violation = death, no recovery)
- โWhat are the limitations of each approach?โ
- pledge: Coarse-grained (canโt say โonly read /etc/passwdโ)
- Capsicum: Requires restructuring code to pre-open descriptors
- seccomp: Hard to restrict syscall arguments (e.g., which paths for open)
- โHow would you sandbox a web browser?โ
- Chromium uses seccomp-bpf on Linux
- Capsicum was designed with Chromium in mind (FreeBSD port exists)
- This is a great real-world comparison point
- โWhatโs the attack surface reduction of each model?โ
- pledge: Reduces syscall surface to promised categories
- Capsicum: Removes global namespace entirely after cap_enter()
- seccomp: Reduces to explicit syscall whitelist
- โCan you escape these sandboxes?โ
- All have had vulnerabilities (nothing is perfect)
- Complexity = more bugs (seccomp filters have had escapes)
- OpenBSDโs simplicity has security benefits
Hints in Layers
Hint 1: Start with file watching (no sandbox)
Get the core functionality working first:
// Linux inotify
int fd = inotify_init1(IN_NONBLOCK);
inotify_add_watch(fd, "/var/log", IN_CREATE | IN_MODIFY | IN_DELETE);
// Read events in a loop
// BSD kqueue
int kq = kqueue();
struct kevent ev;
EV_SET(&ev, dir_fd, EVFILT_VNODE, EV_ADD | EV_ENABLE | EV_CLEAR,
NOTE_WRITE | NOTE_DELETE | NOTE_RENAME, 0, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);
// Wait for events with kevent()
Hint 2: Add OpenBSD pledge first (simplest)
#ifdef __OpenBSD__
#include <unistd.h>
// After opening watch directory but before main loop:
if (pledge("stdio rpath wpath", NULL) == -1)
err(1, "pledge");
// After all setup, lock down paths:
if (unveil(watch_path, "rw") == -1)
err(1, "unveil");
if (unveil(NULL, NULL) == -1) // No more unveil calls allowed
err(1, "unveil");
#endif
Hint 3: FreeBSD Capsicum requires restructuring
#ifdef __FreeBSD__
#include <sys/capsicum.h>
// Open EVERYTHING you need FIRST
int dir_fd = open(watch_path, O_RDONLY | O_DIRECTORY);
int log_fd = open(log_path, O_WRONLY | O_APPEND | O_CREAT, 0644);
int kq = kqueue();
// Limit capabilities on each
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_EVENT, CAP_FCNTL);
cap_rights_limit(dir_fd, &rights);
cap_rights_init(&rights, CAP_WRITE, CAP_SEEK);
cap_rights_limit(log_fd, &rights);
// Enter capability mode - no turning back!
if (cap_enter() == -1)
err(1, "cap_enter");
#endif
Hint 4: Linux seccomp is the most complex
#ifdef __linux__
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
// Use libseccomp for sane API:
#include <seccomp.h>
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
// Whitelist needed syscalls
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(inotify_add_watch), 0);
// ... many more
seccomp_load(ctx);
#endif
Hint 5: Test sandbox violations
void test_sandbox(void) {
// Try something we shouldn't be able to do
int sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock == -1) {
printf("GOOD: socket() blocked as expected\n");
} else {
printf("BAD: socket() succeeded, sandbox not working!\n");
close(sock);
}
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| System calls fundamentals | Advanced Programming in the UNIX Environment, 3rd Edition by Stevens & Rago | Ch. 1-3 |
| OpenBSD security philosophy | Absolute OpenBSD by Michael W. Lucas | Ch. 1 + security chapters |
| FreeBSD Capsicum | Absolute FreeBSD, 3rd Edition by Michael W. Lucas | Ch. 8 |
| Linux seccomp-bpf | The Linux Programming Interface by Michael Kerrisk | Ch. 23 |
| File watching (inotify) | The Linux Programming Interface by Michael Kerrisk | Ch. 19 |
| BSD kqueue | The Design and Implementation of the FreeBSD Operating System by McKusick et al. | Ch. 6 |
| Security principles | Mastering FreeBSD and OpenBSD Security by Hope, Potter & Korff | Ch. 1-4 |
| BPF internals | BPF Performance Tools by Brendan Gregg | Ch. 2 (BPF basics) |
Project 2: Event-Driven TCP Echo Server (kqueue vs epoll)
- File: BSD_LINUX_UNIX_VARIANTS_LEARNING_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: High Performance Networking
- Software or Tool: epoll / kqueue
- Main Book: โThe Linux Programming Interfaceโ by Michael Kerrisk
What youโll build: A high-performance TCP echo server handling 10,000+ concurrent connections using native event APIsโkqueue on BSD, epoll on Linuxโwith no abstraction libraries.
Why it teaches Unix differences: The kqueue vs epoll comparison reveals deep kernel design philosophy differences. kqueue is more general (handles files, signals, processes, timersโnot just sockets) and allows batch updates. epoll is socket-focused and requires one syscall per change. Building the same server on both forces you to internalize these differences.
Core challenges youโll face:
- Challenge 1: kevent() batch operations vs epoll_ctl() single operations (maps to API design philosophy)
- Challenge 2: Handling EVFILT_READ, EVFILT_WRITE vs EPOLLIN, EPOLLOUT (maps to event model differences)
- Challenge 3: Edge-triggered vs level-triggered behavior on both systems
- Challenge 4: Scaling to C10K connections and measuring performance differences
Key Concepts:
- Non-blocking I/O fundamentals: โThe Linux Programming Interfaceโ Ch. 63 - Michael Kerrisk
- kqueue design and API: Kernel Queue Complete Guide
- epoll internals: โLinux System Programming, 2nd Editionโ Ch. 4 - Robert Love
- C10K problem context: Dan Kegelโs C10K paper
- Cross-platform event loops: Scalable Event Multiplexing: epoll vs kqueue
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Socket programming basics, understanding of file descriptors
Real world outcome:
- A server that accepts connections and echoes back whatever clients send
- Benchmark output: โHandled 10,000 concurrent connections, 50,000 req/sec on FreeBSD kqueueโ vs โ45,000 req/sec on Linux epollโ
- Performance graphs comparing both implementations under load
Learning milestones:
- Build blocking echo server โ understand why it doesnโt scale
- Convert to epoll on Linux โ understand event-driven I/O
- Port to kqueue on FreeBSD/OpenBSD โ notice the cleaner API
- Add benchmarking with
wrkor custom client โ quantify the differences - Try macOS kqueue โ understand Darwinโs BSD heritage
Real World Outcome
When you complete this project, youโll have a high-performance TCP echo server that handles thousands of concurrent connections using native OS event APIs.
What youโll see running on FreeBSD with kqueue:
$ ./echo_server 8080
[echo_server] kqueue() created, fd=3
[echo_server] Listening on port 8080
[echo_server] Registered listener with EVFILT_READ
[echo_server] Entering event loop...
[14:32:01] Client connected from 192.168.1.10:52341 (fd=4)
[14:32:01] Client connected from 192.168.1.11:48923 (fd=5)
[14:32:01] Received 1024 bytes from fd=4, echoing back
[14:32:01] Client connected from 192.168.1.12:39847 (fd=6)
...
[14:32:05] Active connections: 847
[14:32:10] Active connections: 2,341
[14:32:15] Active connections: 5,892
[14:32:20] Active connections: 10,003 # C10K achieved!
# Performance stats:
[echo_server] kevent() calls: 15,234
[echo_server] Events processed: 1,247,892
[echo_server] Avg events per kevent(): 81.9
[echo_server] Throughput: 52,341 req/sec
What youโll see running on Linux with epoll:
$ ./echo_server 8080
[echo_server] epoll_create1() returned fd=3
[echo_server] Listening on port 8080
[echo_server] Added listener to epoll with EPOLLIN
[echo_server] Entering event loop...
[14:32:01] Client connected from 192.168.1.10:52341 (fd=4)
[14:32:01] epoll_ctl(EPOLL_CTL_ADD, fd=4) # One syscall per fd!
[14:32:01] Client connected from 192.168.1.11:48923 (fd=5)
[14:32:01] epoll_ctl(EPOLL_CTL_ADD, fd=5)
...
[14:32:20] Active connections: 10,003
# Performance stats:
[echo_server] epoll_wait() calls: 18,456
[echo_server] epoll_ctl() calls: 45,234 # More syscalls than kqueue!
[echo_server] Events processed: 1,198,234
[echo_server] Throughput: 48,721 req/sec
Benchmark comparison output:
$ ./benchmark_comparison.sh
========================================
kqueue vs epoll Performance Test
========================================
Test: 10,000 concurrent connections, 60 seconds
BSD (FreeBSD 14) - kqueue:
Requests/sec: 52,341
Latency avg: 1.2ms
Latency p99: 4.8ms
Syscalls: 15,234 kevent()
Linux (Ubuntu 24.04) - epoll:
Requests/sec: 48,721
Latency avg: 1.4ms
Latency p99: 5.2ms
Syscalls: 63,690 (epoll_wait + epoll_ctl)
Analysis:
- kqueue batches updates: ONE kevent() call for multiple changes
- epoll requires one epoll_ctl() per fd modification
- Under high connection churn, kqueue has fewer syscalls
- Both handle C10K easily, but kqueue is more elegant
Your codebase comparison:
// BSD kqueue - batch register and wait in ONE call
struct kevent changes[MAX_EVENTS]; // What to change
struct kevent events[MAX_EVENTS]; // What happened
int nchanges = 0;
// Add multiple fds to changes array
EV_SET(&changes[nchanges++], client_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
EV_SET(&changes[nchanges++], another_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
// ONE syscall does everything!
int nevents = kevent(kq, changes, nchanges, events, MAX_EVENTS, NULL);
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// Linux epoll - separate calls for modification and waiting
struct epoll_event ev, events[MAX_EVENTS];
// Each fd requires its own syscall
ev.events = EPOLLIN;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev); // Syscall 1
ev.data.fd = another_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, another_fd, &ev); // Syscall 2
// Then wait
int nevents = epoll_wait(epfd, events, MAX_EVENTS, -1); // Syscall 3
The Core Question Youโre Answering
โWhy is kqueue considered technically superior to epoll, and what does this teach us about API design in operating systems?โ
This project reveals a truth about Unix API design: elegance matters for performance. kqueueโs ability to batch operations into a single syscall means fewer context switches under load. But epoll works โwell enoughโ and ships with the dominant server OS.
Youโll understand why Nginx, HAProxy, and other high-performance servers have different code paths for different OSes, and why some developers prefer BSD for networking workloads.
Concepts You Must Understand First
Stop and research these before coding:
- The C10K Problem
- What is the C10K problem and why was it revolutionary in 1999?
- Why donโt traditional threading models scale to 10K connections?
- How do event-driven architectures solve this?
- Reference: Dan Kegelโs C10K Paper
- Blocking vs Non-Blocking I/O
- What happens when you call
read()on a blocking socket with no data? - How does
O_NONBLOCKchange socket behavior? - What does EAGAIN/EWOULDBLOCK mean?
- Book Reference: The Linux Programming Interface by Michael Kerrisk โ Ch. 63
- What happens when you call
- Level-Triggered vs Edge-Triggered
- Level-triggered: โnotify while condition existsโ
- Edge-triggered: โnotify when condition changesโ
- Why does edge-triggered require draining the buffer completely?
- Which is default for kqueue? For epoll?
- Book Reference: Linux System Programming, 2nd Edition by Robert Love โ Ch. 4
- File Descriptors and the Kernel
- What is a file descriptor really? (index into per-process table)
- How does the kernel track which fds to monitor?
- Why is select() O(n) while epoll/kqueue are O(1)?
- Book Reference: Advanced Programming in the UNIX Environment by Stevens & Rago โ Ch. 3
- kqueue Architecture
- What is a kevent structure?
- What are filters? (EVFILT_READ, EVFILT_WRITE, EVFILT_VNODE, EVFILT_TIMERโฆ)
- Why can kqueue batch changes?
- Reference: Kernel Queue: Complete Guide
- epoll Architecture
- What do epoll_create, epoll_ctl, epoll_wait do?
- Why separate calls for modification and waiting?
- What is EPOLLONESHOT and when would you use it?
- Book Reference: The Linux Programming Interface by Michael Kerrisk โ Ch. 63
Questions to Guide Your Design
Before implementing, think through these:
- Server Architecture
- Will you use a single-threaded event loop or multiple threads with separate event loops?
- How will you handle the accept() of new connections?
- Should accept() be level-triggered or edge-triggered?
- Event Handling
- When a read event fires, how much data should you read?
- What if the client sends more data than your buffer size?
- How do you handle partial writes when the send buffer is full?
- Connection Lifecycle
- How do you detect client disconnection?
- When should you remove a fd from the event set?
- How do you avoid use-after-free when closing connections?
- Performance Measurement
- How will you count syscalls to compare the APIs?
- How will you generate load for benchmarking?
- What metrics matter: throughput, latency, syscall count?
- Error Handling
- What happens if kqueue()/epoll_create() fails?
- How do you handle EINTR during kevent()/epoll_wait()?
- What if a client causes an errorโcrash the server or just close that connection?
Thinking Exercise
Before coding, trace this scenario by hand:
You have 1000 clients connected. 100 of them send data simultaneously.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ kqueue Event Processing โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ State: 1000 fds registered with EVFILT_READ โ
โ โ
โ Step 1: 100 clients send data simultaneously โ
โ โ
โ Step 2: kevent(kq, NULL, 0, events, 1000, NULL) โ
โ Returns: 100 events (only the ready ones) โ
โ Syscalls so far: 1 โ
โ โ
โ Step 3: Process all 100 events, read data, echo back โ
โ โ
โ Step 4: 50 clients disconnect โ
โ We need to remove them from kqueue โ
โ Build array: changes[50] = {EV_DELETE for each fd} โ
โ โ
โ Step 5: kevent(kq, changes, 50, events, 1000, NULL) โ
โ Removes 50 fds AND waits for new events in ONE call โ
โ Syscalls so far: 2 โ
โ โ
โ Total syscalls for this cycle: 2 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ epoll Event Processing โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ State: 1000 fds registered with EPOLLIN โ
โ โ
โ Step 1: 100 clients send data simultaneously โ
โ โ
โ Step 2: epoll_wait(epfd, events, 1000, -1) โ
โ Returns: 100 events โ
โ Syscalls so far: 1 โ
โ โ
โ Step 3: Process all 100 events, read data, echo back โ
โ โ
โ Step 4: 50 clients disconnect โ
โ We need to remove them from epoll โ
โ epoll_ctl(epfd, EPOLL_CTL_DEL, fd1, NULL) // syscall 2 โ
โ epoll_ctl(epfd, EPOLL_CTL_DEL, fd2, NULL) // syscall 3 โ
โ ... 48 more times ... โ
โ Syscalls so far: 51 โ
โ โ
โ Step 5: epoll_wait() for next batch โ
โ Syscalls so far: 52 โ
โ โ
โ Total syscalls for this cycle: 52 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key insight: Under high connection churn (many connects/disconnects), kqueueโs batching advantage becomes significant. Under stable connection pools, the difference is minimal.
The Interview Questions Theyโll Ask
Prepare to answer these:
- โWhy is kqueue technically superior to epoll?โ
- Batch updates: one syscall for multiple changes
- More generic: handles files, signals, processes, not just sockets
- Cleaner API: changes and events can be in the same call
- โIf kqueue is better, why does everyone use Linux?โ
- epoll is โgood enoughโ for most workloads
- Linux has better hardware support, more developers, larger ecosystem
- Most applications use abstraction layers (libevent, libuv) anyway
- โWhatโs the difference between level-triggered and edge-triggered?โ
- Level: kernel keeps notifying as long as fd is ready
- Edge: kernel notifies once when state changes from not-ready to ready
- Edge requires you to drain the buffer completely or youโll miss data
- โHow would you handle 1 million connections?โ
- C10K is 20+ years old; C1M is the new challenge
- Need multiple event loops (one per core)
- Need to think about memory per connection
- SO_REUSEPORT helps distribute accept() load
- โWhat do Nginx and HAProxy use?โ
- Both have epoll and kqueue backends
- Code is mostly the same, event API is abstracted
- They prove the performance difference is measurable but not critical
- โWhy didnโt Linux just implement kqueue?โ
- NIH (Not Invented Here) syndrome
- Different kernel architecture made direct porting hard
- By the time kqueue was proven, epoll was already deployed
Hints in Layers
Hint 1: Start with a blocking echo server
Understand the baseline before optimization:
int client_fd = accept(listen_fd, NULL, NULL);
while (1) {
ssize_t n = read(client_fd, buf, sizeof(buf));
if (n <= 0) break;
write(client_fd, buf, n); // Echo back
}
close(client_fd);
// Problem: only handles ONE client at a time!
Hint 2: Non-blocking sockets are essential
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
// Now read() returns -1 with errno=EAGAIN instead of blocking
Hint 3: kqueue skeleton
int kq = kqueue();
struct kevent ev;
// Register listener
EV_SET(&ev, listen_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);
while (1) {
struct kevent events[64];
int n = kevent(kq, NULL, 0, events, 64, NULL);
for (int i = 0; i < n; i++) {
int fd = events[i].ident;
if (fd == listen_fd) {
// Accept new connection, add to kqueue
} else if (events[i].filter == EVFILT_READ) {
// Read from client, echo back
}
}
}
Hint 4: epoll skeleton
int epfd = epoll_create1(0);
struct epoll_event ev, events[64];
// Register listener
ev.events = EPOLLIN;
ev.data.fd = listen_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
while (1) {
int n = epoll_wait(epfd, events, 64, -1);
for (int i = 0; i < n; i++) {
int fd = events[i].data.fd;
if (fd == listen_fd) {
// Accept new connection
int client = accept(listen_fd, NULL, NULL);
ev.events = EPOLLIN;
ev.data.fd = client;
epoll_ctl(epfd, EPOLL_CTL_ADD, client, &ev); // Extra syscall!
} else {
// Read from client, echo back
}
}
}
Hint 5: Benchmark with a simple load generator
# Using netcat and yes for simple load
for i in $(seq 1 1000); do
(yes "hello" | nc localhost 8080 &)
done
# Or use wrk for HTTP if you add HTTP parsing
wrk -t4 -c10000 -d30s http://localhost:8080/
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| I/O multiplexing fundamentals | The Linux Programming Interface by Michael Kerrisk | Ch. 63: โAlternative I/O Modelsโ |
| kqueue deep dive | The Design and Implementation of the FreeBSD Operating System by McKusick et al. | Ch. 6 |
| epoll internals | Linux System Programming, 2nd Edition by Robert Love | Ch. 4 |
| Non-blocking sockets | UNIX Network Programming, Volume 1 by Stevens, Fenner & Rudoff | Ch. 16 |
| High-performance networking | TCP/IP Illustrated, Volume 1 by W. Richard Stevens | Ch. 17-18 |
| The C10K problem | Dan Kegelโs C10K paper (online) | Full document |
| Event loop design | Network Programming with Go by Adam Woodbeck | Ch. 3 (concepts transfer to C) |
Project 3: Build Your Own Container/Jail/Zone
- File: BSD_LINUX_UNIX_VARIANTS_LEARNING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: Level 1: The โResume Goldโ
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: OS Virtualization, Namespaces
- Software or Tool: Linux, FreeBSD, Docker
- Main Book: โThe Linux Programming Interfaceโ by Michael Kerrisk
What youโll build: A minimal container runtime from scratchโusing Linux namespaces+cgroups, FreeBSD jails, and illumos zonesโto understand how OS-level virtualization differs fundamentally across Unix systems.
Why it teaches Unix differences: As Jessie Frazelle explains, โJails and Zones are first-class kernel concepts. Containers are NOTโtheyโre just a term for combining Linux namespaces and cgroups.โ Building all three reveals why FreeBSDโs jail(2) is a single syscall while Linux requires orchestrating 7+ namespace types plus cgroups.
Core challenges youโll face:
- Challenge 1: Linuxโcombining mount, PID, network, user, UTS, IPC namespaces manually (maps to โbuilding blocksโ philosophy)
- Challenge 2: FreeBSDโsingle
jail()syscall with jailparams (maps to โfirst-class conceptโ philosophy) - Challenge 3: Networking inside containersโveth pairs (Linux) vs VNET jails (FreeBSD)
- Challenge 4: Filesystem isolationโoverlay/bind mounts (Linux) vs ZFS clones (FreeBSD/illumos)
- Challenge 5: Resource limitsโcgroups v2 (Linux) vs rctl (FreeBSD)
Key Concepts:
- Linux namespaces: โThe Linux Programming Interfaceโ Ch. 44 - Michael Kerrisk
- FreeBSD jails: FreeBSD Handbook Ch. 17: Jails
- illumos zones: Getting Started with Zones on OmniOS
- Containers deep dive: Klara Systems: OpenShift vs FreeBSD Jails
- cgroups v2: โHow Linux Works, 3rd Editionโ Ch. 8 - Brian Ward
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: C programming, basic understanding of processes and filesystems
Real world outcome:
- Run
./mycontainer /bin/shand get an isolated shell with its own PID 1, network stack, and filesystem view - Demonstrate isolation: processes inside canโt see host processes; networking is separate
- Show the difference in complexity: ~500 lines for Linux namespace container vs ~100 lines for FreeBSD jail wrapper
Learning milestones:
- Linux: Create PID namespace, see process isolation โ understand namespace concept
- Linux: Add mount namespace, overlay filesystem โ understand filesystem isolation
- Linux: Add network namespace with veth pair โ understand network virtualization
- FreeBSD: Create jail with single syscall โ notice the dramatic simplicity difference
- FreeBSD: Add VNET networking to jail โ understand VNET architecture
- Compare codebase sizes and complexity โ internalize the design philosophy difference
Real World Outcome
When you complete this project, youโll have built minimal container runtimes that demonstrate the fundamental design philosophy differences between Linux and FreeBSD.
What youโll see on Linux (your container runtime):
$ sudo ./mycontainer run /bin/sh
[mycontainer] Creating namespaces...
[mycontainer] PID namespace: clone(CLONE_NEWPID) - PID 1 inside!
[mycontainer] Mount namespace: clone(CLONE_NEWNS) - isolated filesystem
[mycontainer] UTS namespace: clone(CLONE_NEWUTS) - new hostname
[mycontainer] Network namespace: clone(CLONE_NEWNET) - isolated network
[mycontainer] User namespace: clone(CLONE_NEWUSER) - uid mapping
[mycontainer] IPC namespace: clone(CLONE_NEWIPC) - isolated semaphores
[mycontainer] Setting up cgroups v2...
[mycontainer] Memory limit: 256MB
[mycontainer] CPU shares: 512
[mycontainer] Setting up root filesystem...
[mycontainer] pivot_root() to /containers/alpine
[mycontainer] Setting up network...
[mycontainer] Created veth pair: veth0 <-> container0
[mycontainer] Container IP: 10.0.0.2/24
[mycontainer] Host bridge: 10.0.0.1/24
[mycontainer] Dropping capabilities...
[mycontainer] Entering container...
/ # hostname
container-12345
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /bin/sh <-- We are PID 1!
2 root 0:00 ps aux
/ # cat /proc/1/cgroup
0::/mycontainer <-- Our cgroup
/ # ip addr
1: lo: <LOOPBACK,UP> mtu 65536
inet 127.0.0.1/8
2: eth0: <BROADCAST,UP> mtu 1500
inet 10.0.0.2/24 <-- Isolated network!
/ # exit
[mycontainer] Container exited with status 0
[mycontainer] Cleaning up namespaces and cgroups...
What youโll see on FreeBSD (your jail runtime):
$ sudo ./myjail run /bin/sh
[myjail] Creating jail...
[myjail] jail_set(2) with:
[myjail] path = /jails/alpine
[myjail] hostname = jail-12345
[myjail] ip4.addr = 10.0.0.2
[myjail] That's it. One syscall. Jail created.
[myjail] Entering jail...
$ hostname
jail-12345
$ ps aux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 1 0.0 0.1 4788 1524 - SJ 14:32 0:00.01 /bin/sh
root 2 0.0 0.1 4788 1496 - R+J 14:32 0:00.00 ps aux
$ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask 0xff000000
jail0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet 10.0.0.2 netmask 0xffffff00
$ exit
[myjail] Jail exited with status 0
Comparing your codebases:
$ wc -l linux_container.c freebsd_jail.c
547 linux_container.c # 500+ lines for Linux namespaces + cgroups
98 freebsd_jail.c # ~100 lines for FreeBSD jail
# Breakdown of Linux container complexity:
$ grep -c 'clone\|unshare' linux_container.c
12 # Many namespace operations
$ grep -c 'cgroup' linux_container.c
35 # cgroup setup is verbose
$ grep -c 'veth\|netlink' linux_container.c
48 # Network namespace setup is complex
# FreeBSD jail simplicity:
$ grep -c 'jail' freebsd_jail.c
8 # jail_set, jail_attach, jailparam_*
The visual difference:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Linux Container Creation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ clone(CLONE_NEWPID) โโ โ
โ clone(CLONE_NEWNS) โโค โ
โ clone(CLONE_NEWUTS) โโผโโบ "Assemble the parts" โ
โ clone(CLONE_NEWNET) โโค 7+ syscalls just for namespaces โ
โ clone(CLONE_NEWUSER) โโค โ
โ clone(CLONE_NEWIPC) โโค โ
โ clone(CLONE_NEWCGROUP)โโ โ
โ + โ
โ cgroup_create() โโ โ
โ write(memory.max) โโผโโบ Setup resource limits โ
โ write(cpu.weight) โโ More file operations โ
โ + โ
โ veth_create() โโ โ
โ netlink_addaddr() โโผโโบ Network setup via netlink โ
โ netlink_addroute() โโ Complex socket programming โ
โ + โ
โ pivot_root() โโโโบ Filesystem isolation โ
โ + โ
โ seccomp_load() โโโโบ Optional syscall filtering โ
โ โ
โ Result: ~500 lines of C, deeply understanding 5+ subsystems โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FreeBSD Jail Creation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ struct jailparam params[] = { โ
โ { "path", "/jails/myjail" }, โ
โ { "hostname", "myjail" }, โ
โ { "ip4.addr", "10.0.0.2" }, โ
โ { "vnet", "new" }, // VNET for network isolation โ
โ }; โ
โ โ
โ jail_set(params, nparams, JAIL_CREATE | JAIL_ATTACH); โ
โ โ
โ // That's it. ONE syscall. You're in the jail. โ
โ โ
โ Result: ~100 lines of C, understanding ONE subsystem โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Core Question Youโre Answering
โWhy is a โcontainerโ not a thing on Linux, and what are the real implications of this design choice?โ
This project will burn into your brain the most important insight about Unix system design: Linux containers are a term for a combination of primitives. FreeBSD jails and Solaris zones are first-class kernel concepts.
As Jessie Frazelle famously wrote: โJails and Zones are first-class concepts. Containers are NOT.โ This difference explains:
- Why container escapes happen on Linux
- Why Docker needed years to become stable
- Why FreeBSD jails were production-ready in 2000
- Why illumos Zones can run Linux binaries (LX branded zones)
Concepts You Must Understand First
Stop and research these before coding:
- Process Isolation Fundamentals
- What does a process see? (memory space, file descriptors, PID space)
- What is chroot() and why isnโt it enough for isolation?
- What is โescapingโ a chroot and how is it done?
- Book Reference: Operating Systems: Three Easy Pieces by Arpaci-Dusseau โ Part II: โVirtualizationโ
- Linux Namespaces (The Building Blocks)
- PID namespace: What does it mean to have PID 1?
- Mount namespace: How does the filesystem view differ?
- Network namespace: What is a network stack?
- User namespace: How do UID/GID mappings work?
- UTS namespace: Just the hostname, but important!
- IPC namespace: Semaphores, message queues, shared memory
- Book Reference: The Linux Programming Interface by Michael Kerrisk โ Ch. 28-29
- Linux cgroups (Resource Limits)
- What is a cgroup hierarchy?
- cgroups v1 vs v2: Why did Linux redesign this?
- How do you limit memory, CPU, I/O?
- Book Reference: How Linux Works, 3rd Edition by Brian Ward โ Ch. 8
- FreeBSD Jails (The Integrated Approach)
- What is the jail(2) system call?
- Whatโs a jailparam and how do you set them?
- What is VNET and why does it make jails more powerful?
- What is rctl (resource control)?
- Book Reference: Absolute FreeBSD, 3rd Edition by Michael W. Lucas โ Ch. 12: โJailsโ
- Filesystem Isolation
- Linux: What is pivot_root() vs chroot()?
- Linux: What is an overlay filesystem?
- FreeBSD: How do nullfs mounts work?
- Both: How does ZFS make container storage better?
- Book Reference: The Linux Programming Interface by Michael Kerrisk โ Ch. 18
- Network Virtualization
- Linux: What is a veth pair? What is a bridge?
- Linux: How does netlink work?
- FreeBSD: What is VNET? How is it different from IP-based jails?
- Reference: FreeBSD Handbook Ch. 17: Jails
Questions to Guide Your Design
Before implementing, think through these:
- What defines โisolationโ?
- From the containerโs view: what should it be unable to see/do?
- From the hostโs view: what should be protected?
- Whatโs the threat model?
- How do you set up the root filesystem?
- Where do you get a minimal rootfs? (Alpine, busybox)
- Should changes persist or be discarded? (overlay vs bind mount)
- How do you mount /proc, /sys, /dev inside?
- How do you handle networking?
- Does the container need network access?
- How does traffic get routed between container and host?
- Do you need NAT for outbound connections?
- What about resource limits?
- How much memory should the container have?
- Should it have limited CPU?
- What happens when limits are exceeded?
- How do you enter the container?
- Linux: clone() with flags vs unshare() + fork()
- FreeBSD: jail_attach() vs starting a new process in jail
- What happens to the child process?
Thinking Exercise
Before coding, trace what Docker does on Linux:
$ strace -f docker run --rm alpine echo hello 2>&1 | grep -E 'clone|unshare|mount|pivot|cgroup'
# You'll see something like:
clone(child_stack=0x..., flags=CLONE_NEWNS|CLONE_NEWPID|...)
mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL)
mount("overlay", "/var/lib/docker/.../merged", "overlay", ...)
pivot_root(".", ".")
mount("proc", "/proc", "proc", ...)
openat(AT_FDCWD, "/sys/fs/cgroup/memory/.../memory.max", ...)
write(3, "268435456", 9) # 256MB memory limit
clone(child_stack=0x..., flags=CLONE_NEWNET|...)
Now trace what FreeBSD does with a jail:
$ truss jail -c name=test path=/jails/test command=/bin/sh 2>&1 | grep jail
# You'll see:
jail_set(0x..., 5, 0x3) # THAT'S IT. One syscall.
Map the complexity:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ What Docker Does on Linux โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Layer 1: Namespace creation (7 different namespaces) โ
โ โ
โ Layer 2: cgroup creation and configuration โ
โ - Create cgroup directory โ
โ - Write limits to pseudo-files โ
โ - Add process to cgroup โ
โ โ
โ Layer 3: Filesystem setup โ
โ - Create overlay mount โ
โ - pivot_root to new root โ
โ - Mount /proc, /sys, /dev โ
โ - Mask sensitive paths โ
โ โ
โ Layer 4: Network setup โ
โ - Create veth pair โ
โ - Move one end to container namespace โ
โ - Configure IP addresses โ
โ - Set up routing โ
โ - Configure iptables rules โ
โ โ
โ Layer 5: Security โ
โ - Drop capabilities โ
โ - Install seccomp filter โ
โ - Set up AppArmor/SELinux profile โ
โ โ
โ TOTAL: 50+ syscalls, configuration across 5+ subsystems โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ What FreeBSD Does with Jails โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Layer 1: jail_set() with parameters โ
โ - path: root filesystem โ
โ - hostname: container name โ
โ - ip4.addr / vnet: network config โ
โ - (optional) resource limits via rctl โ
โ โ
โ TOTAL: 1-3 syscalls, everything is integrated โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The philosophical difference:
- Linux: โHere are Lego blocks. Assemble them yourself. Maximum flexibility!โ
- FreeBSD: โHereโs a finished product. It works. Limited flexibility, but itโs correct.โ
The Interview Questions Theyโll Ask
Prepare to answer these:
- โWhatโs the difference between a container and a VM?โ
- VM: Separate kernel, hardware virtualization (hypervisor)
- Container: Shared kernel, OS-level virtualization (namespaces/jails)
- Container is lighter but has weaker isolation
- โHow does Docker work under the hood on Linux?โ
- Uses clone() with namespace flags
- Uses cgroups for resource limits
- Uses overlay filesystem for copy-on-write layers
- Uses pivot_root() for filesystem isolation
- โWhat is a container escape?โ
- Attacker inside container gains access to host
- Usually through kernel vulnerabilities (shared kernel!)
- Or misconfiguration (privileged containers, mounted docker socket)
- โWhy are FreeBSD jails considered more secure?โ
- Single, audited subsystem vs. assembled primitives
- Less complexity = fewer bugs
- Jails existed since 2000, battle-tested
- โWhat is the difference between Docker and LXC/LXD?โ
- Docker: Application containers, immutable images, microservices
- LXC/LXD: System containers, more like lightweight VMs
- Both use the same Linux primitives underneath
- โHow would you debug a container networking issue?โ
- Check namespace with
nsenter - Check veth pairs with
ip link - Check routing with
ip route - Check iptables rules
- Check namespace with
Hints in Layers
Hint 1: Start with just PID namespace
The simplest isolationโsee a different PID space:
// Linux: fork into new PID namespace
int child_pid = clone(child_func, stack + STACK_SIZE,
CLONE_NEWPID | SIGCHLD, NULL);
// In child_func:
printf("I am PID %d\n", getpid()); // Will print "I am PID 1"!
Hint 2: Add mount namespace for filesystem isolation
// After clone with CLONE_NEWNS:
mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL); // Private mounts
mount("/containers/alpine", "/containers/alpine", NULL, MS_BIND, NULL);
chdir("/containers/alpine");
pivot_root(".", ".");
umount2(".", MNT_DETACH); // Unmount old root
Hint 3: FreeBSD jail is dramatically simpler
#include <sys/jail.h>
#include <jail.h>
struct jailparam params[4];
jailparam_init(¶ms[0], "path");
jailparam_set(¶ms[0], "/jails/myjail");
jailparam_init(¶ms[1], "host.hostname");
jailparam_set(¶ms[1], "myjail");
jailparam_init(¶ms[2], "ip4.addr");
jailparam_set(¶ms[2], "10.0.0.2");
jailparam_init(¶ms[3], "persist");
jailparam_set(¶ms[3], NULL);
int jid = jailparam_set(params, 4, JAIL_CREATE | JAIL_ATTACH);
// That's it! You're now in the jail.
Hint 4: Linux network namespace needs veth pair
// This requires netlink programming or shelling out to `ip`:
system("ip link add veth0 type veth peer name container0");
system("ip link set container0 netns <pid>");
system("ip addr add 10.0.0.1/24 dev veth0");
system("ip link set veth0 up");
// Inside container namespace:
system("ip addr add 10.0.0.2/24 dev container0");
system("ip link set container0 up");
system("ip route add default via 10.0.0.1");
Hint 5: cgroups v2 setup
// Create cgroup
mkdir("/sys/fs/cgroup/mycontainer", 0755);
// Set memory limit (256MB)
int fd = open("/sys/fs/cgroup/mycontainer/memory.max", O_WRONLY);
write(fd, "268435456", 9);
close(fd);
// Add process to cgroup
fd = open("/sys/fs/cgroup/mycontainer/cgroup.procs", O_WRONLY);
char pid_str[16];
sprintf(pid_str, "%d", child_pid);
write(fd, pid_str, strlen(pid_str));
close(fd);
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Linux namespaces | The Linux Programming Interface by Michael Kerrisk | Ch. 28-29 (Process Creation) |
| Linux cgroups | How Linux Works, 3rd Edition by Brian Ward | Ch. 8 |
| FreeBSD jails | Absolute FreeBSD, 3rd Edition by Michael W. Lucas | Ch. 12: โJailsโ |
| Container internals | Container Security by Liz Rice | Full book (OโReilly) |
| Process isolation theory | Operating Systems: Three Easy Pieces by Arpaci-Dusseau | Part II: โVirtualizationโ |
| Filesystem namespaces | The Linux Programming Interface by Michael Kerrisk | Ch. 18: โDirectories and Linksโ |
| Network namespaces | Linux Network Internals by Christian Benvenuti | Ch. 1-3 |
| illumos Zones | illumos Documentation | Online |
Project 4: Packet Filter Firewall Configuration Tool
- File: BSD_LINUX_UNIX_VARIANTS_LEARNING_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The โService & Supportโ Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Networking / Security
- Software or Tool: pf / nftables
- Main Book: โAbsolute OpenBSDโ by Michael W. Lucas
What youโll build: A command-line tool that generates and applies firewall rulesโusing pf on OpenBSD/FreeBSD and nftables on Linuxโfrom a common configuration format.
Why it teaches Unix differences: OpenBSDโs pf (packet filter) is legendary for its clean syntax and powerful features. Linuxโs nftables (replacing iptables) has different semantics. Building a tool that targets both forces you to understand network stack differences at the kernel level.
Core challenges youโll face:
- Challenge 1: pfโs stateful inspection model vs nftablesโ table/chain/rule hierarchy
- Challenge 2: pf anchors vs nftables sets for dynamic rules
- Challenge 3: NAT handling differences
- Challenge 4: Loading rules atomically vs incrementally
Key Concepts:
- pf fundamentals: โAbsolute OpenBSDโ Ch. 7 - Michael W. Lucas
- OpenBSD pf FAQ: OpenBSD official documentation
- nftables design: Linux nftables wiki
- BSD networking: โTCP/IP Illustrated Vol. 1โ - W. Richard Stevens (the BSD reference implementation)
- Packet filtering theory: โMastering FreeBSD and OpenBSD Securityโ Ch. 4-5 - Hope, Potter & Korff
Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Basic networking (TCP/IP), understanding of firewalls conceptually
Real world outcome:
- A tool that reads YAML like
allow: {port: 22, from: 10.0.0.0/8}and outputs valid pf.conf or nftables rules - Apply rules and demonstrate: blocked connections fail, allowed connections succeed
- Show the same logical policy expressed in both syntaxes
Learning milestones:
- Write pf rules manually on OpenBSD โ understand pf syntax and concepts
- Write equivalent nftables rules on Linux โ notice the structural differences
- Build parser for common config format โ abstract the similarities
- Generate native rules for each OS โ encode the differences
- Test with real traffic โ verify correctness
Project 5: DTrace/eBPF System Tracer
- File: BSD_LINUX_UNIX_VARIANTS_LEARNING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: D (DTrace), Rust, Python (BCC)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: Level 1: The โResume Goldโ
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: System Tracing, Performance
- Software or Tool: DTrace, eBPF, BCC
- Main Book: โBPF Performance Toolsโ by Brendan Gregg
What youโll build: A system tracing tool that shows function call latencies in running processesโusing DTrace on FreeBSD/illumos/macOS and eBPF on Linux.
Why it teaches Unix differences: DTrace originated in Solaris (now illumos) and was ported to FreeBSD and macOS. Linux created eBPF as a โcompetitor.โ Both let you instrument a running kernel without rebooting, but their models differ significantly. DTrace uses the D language; eBPF uses C compiled to bytecode with complex verifier rules.
Core challenges youโll face:
- Challenge 1: D language scripts vs eBPF C programs (maps to โlanguage designโ philosophy)
- Challenge 2: DTrace probes (fbt, syscall, pid) vs eBPF attach points (kprobe, tracepoint, uprobe)
- Challenge 3: DTrace aggregations (@count, @quantize) vs eBPF maps
- Challenge 4: Safety modelsโDTraceโs interpreter vs eBPFโs verifier
Key Concepts:
- DTrace fundamentals: Brendan Greggโs DTrace Tools
- DTrace scripts: DTraceToolkit
- eBPF/BCC on Linux: Brendan Greggโs โBPF Performance Toolsโ
- illumos DTrace deep dive: illumos Features
- FreeBSD DTrace: โAbsolute FreeBSD, 3rd Editionโ Ch. 19 - Michael W. Lucas
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Understanding of kernel/userspace boundary, basic C
Real world outcome:
- Run
./mytrace -p <pid>and see output like:read() latency: min=1ฮผs avg=50ฮผs max=2ms histogram: [1-10ฮผs: 500] [10-100ฮผs: 200] - Same tool works on FreeBSD (DTrace) and Linux (eBPF) with different backends
- Demonstrate tracing a real application (like nginx) to find performance bottlenecks
Learning milestones:
- Write simple DTrace one-liner on FreeBSD โ understand probe concept
- Convert to D script with aggregations โ understand D language
- Port to eBPF/BCC on Linux โ notice the complexity increase
- Add histogram output โ understand both aggregation models
- Trace real application โ apply knowledge practically
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor | OSes Covered |
|---|---|---|---|---|---|
| Sandboxed Service | Intermediate | 2-3 weeks | โญโญโญโญโญ (security models) | โญโญโญ | OpenBSD, FreeBSD, Linux |
| Event-Driven Server | Intermediate | 1-2 weeks | โญโญโญโญ (I/O architecture) | โญโญโญโญ | FreeBSD, Linux, macOS |
| Container/Jail/Zone | Advanced | 3-4 weeks | โญโญโญโญโญ (isolation architecture) | โญโญโญโญโญ | Linux, FreeBSD, illumos |
| Packet Filter Tool | Intermediate | 2 weeks | โญโญโญโญ (networking) | โญโญโญ | OpenBSD, FreeBSD, Linux |
| DTrace/eBPF Tracer | Advanced | 2-3 weeks | โญโญโญโญโญ (kernel internals) | โญโญโญโญ | FreeBSD, illumos, Linux |
Recommendation
Given that you want to deeply understand the differences, I recommend starting with Project 3: Build Your Own Container/Jail/Zone.
Why this project first:
- Maximum contrast: The difference between Linuxโs 7 namespace types + cgroups vs FreeBSDโs single
jail()syscall is the clearest demonstration of โbuilding blocksโ vs โfirst-class conceptโ philosophy - Practical relevance: Containers are everywhere; understanding them at the kernel level makes you dangerous
- Forces multi-OS work: You literally cannot complete it without running multiple operating systems
- Foundation for others: Once you understand isolation, the security sandbox project (Project 1) becomes much clearer
Setup recommendation:
- Use VirtualBox/VMware with FreeBSD 14, OpenBSD 7.5, and Linux (any distro)
- Or use cloud VMs (Vultr/DigitalOcean have FreeBSD; OpenBSD requires ISO install)
- illumos: Use OmniOS or SmartOS in VM
Final Comprehensive Project: Cross-Platform Unix Compatibility Layer
What youโll build: A userspace compatibility library that allows programs written for one Unix to run on anotherโimplementing syscall translation, filesystem abstraction, and API shimming. Think: a minimal โWine for BSDโ or โBSD personality for Linux.โ
Why it teaches everything: This project forces you to confront EVERY difference between Unix systems:
- Different syscall numbers and semantics
- Different ioctl interfaces
- Different signal behaviors
- Different filesystem layouts and conventions
- Different library ABIs
What youโll build specifically:
- A preloadable shared library (
LD_PRELOAD) that intercepts syscalls - Translation layer for key differences (e.g., translate
kqueuecalls toepollon Linux) - ABI compatibility for basic programs (get
lsfrom FreeBSD running on Linux, or vice versa)
Core challenges youโll face:
- Challenge 1: Syscall number mapping (same name, different numbers across OSes)
- Challenge 2: Struct layout differences (even
struct statdiffers) - Challenge 3: Signal semantics variations
- Challenge 4: Implementing kqueue in terms of epoll (or vice versa)
- Challenge 5: Path translation (
/usr/localconventions,/procvs/compat/linux/proc) - Challenge 6: Dynamic linker differences
Key Concepts:
- Syscall interfaces: โThe Linux Programming Interfaceโ Ch. 3 - Kerrisk + BSD man pages comparison
- ABI compatibility: โComputer Systems: A Programmerโs Perspectiveโ Ch. 7 - Bryant & OโHallaron
- Dynamic linking: โAdvanced Programming in the UNIX Environmentโ Ch. 17 - Stevens & Rago
- FreeBSD Linux emulation: FreeBSD Handbook - Linux Binary Compatibility
- illumos LX zones: illumos LX branded zones - how they run Linux binaries
Difficulty: Expert Time estimate: 2-3 months Prerequisites: Complete at least 2-3 projects above; strong C; understanding of ELF format
Real world outcome:
- Run a simple FreeBSD binary on Linux (or vice versa):
./mycompat /path/to/freebsd/ls -la - See output showing which syscalls were translated
- Demonstrate: โThis program uses kqueue, but weโre translating it to epoll on Linuxโ
Learning milestones:
- Build syscall interception framework โ understand how syscalls work at machine level
- Implement basic syscall translation (open, read, write, close) โ understand โsame but differentโ
- Implement struct translation layer โ understand ABI differences
- Port kqueueโepoll (or reverse) โ deep understanding of both
- Get a real program running โ validate your understanding is complete
Sources
- pledge(2) - OpenBSD manual pages
- unveil(2) - OpenBSD manual pages
- Jessie Frazelle: Containers vs. Zones vs. Jails vs. VMs
- FreeBSD Handbook: Jails and Containers
- Klara Systems: OpenShift vs FreeBSD Jails
- FreeBSD kqueue vs Linux epoll comparison
- Kernel Queue: Complete Guide
- Scalable Event Multiplexing: epoll vs kqueue
- The C10K problem
- Brendan Greggโs DTrace Tools
- DTraceToolkit
- illumos ZFS Administration Guide
- illumos Features
- Getting Started with Zones on OmniOS