Operating Systems from First Principles
Goal: Build a precise mental model of how an operating system turns raw hardware into a reliable, multi-program environment. You will understand the kernel/user boundary, process and thread scheduling, virtual memory, storage, and I/O, and you will know why each abstraction exists. You will implement core subsystems (boot flow, allocator, scheduler, syscall interface, file system, shell, containers) and validate them with real traces and tests. By the end, you can trace any syscall end to end, reason about performance and security trade-offs, and design small kernels and OS subsystems from scratch.
Introduction
Operating systems are the privileged software layer that virtualizes hardware resources, enforces isolation, and exposes stable abstractions (processes, files, sockets, memory) to user programs. In practice, the OS is both a resource manager and a contract: it decides who gets CPU time, how memory is mapped, how data persists, and how devices are accessed.
What you will build (by the end of this guide):
- A bootable bare-metal “Hello” that runs without any OS
- A bootloader that loads a kernel and switches CPU modes
- Core OS simulators (scheduler, virtual memory, allocator)
- User-space OS tools (shell, syscall tracer, IPC tools)
- A file system in a disk image (mountable via FUSE)
- A minimal container runtime (namespaces + cgroups)
- A kernel module and a network stack exploration project
Scope (what is included):
- x86-64 Linux and Unix-style APIs as the primary learning target
- Core OS abstractions: processes, threads, memory, files, devices, IPC
- Practical implementations and simulators you can run and test
Out of scope (for this guide):
- Building a full production kernel for real hardware
- GUI subsystems, window servers, and graphics stacks
- Deep device-specific driver development beyond basic modules
The Big Picture (Mental Model)
Firmware -> Bootloader -> Kernel -> User Space
| | | |
| | | +--> Processes, shells, servers
| | +--> Scheduler, VM, FS, IPC, drivers
| +--> Load kernel, set CPU mode, pass boot info
+--> Initialize hardware, provide boot services
Key Terms You Will See Everywhere
- Kernel: The privileged core that manages CPU, memory, devices, and protection.
- System call: The controlled entry from user space into the kernel.
- Process: A running program with its own address space and resources.
- Virtual memory: The per-process illusion of a private address space.
- File system: The abstraction that turns blocks into named, durable files.
- Interrupt: An asynchronous signal from hardware that preempts execution.
How to Use This Guide
- Read the Theory Primer first. It is your mini-book and mental model.
- Then do projects in order or follow a learning path that matches your goals.
- For low-level projects, always use QEMU or a VM. Never test kernel code on your host OS.
- Each project has a Definition of Done. Treat it as a contract.
- Keep a lab notebook. Write down invariants, surprises, and failures.
- Revisit the primer after each project. You will understand more the second time.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
Programming Skills:
- Confident C programming (pointers, structs, memory layout, bitwise ops)
- Comfort with the command line, Make, and basic toolchains
- Ability to read and write basic assembly (x86-64 syntax preferred)
Computer Architecture Fundamentals:
- CPU registers, stack, calling conventions, instruction execution
- Basic memory hierarchy: caches, RAM, storage
- Recommended Reading: “Computer Systems: A Programmer’s Perspective” Ch. 1-3
OS Fundamentals:
- What a process is and why system calls exist
- Basic file I/O concepts (open, read, write, close)
- Recommended Reading: “Operating Systems: Three Easy Pieces” Ch. 1-6
Helpful But Not Required
Advanced Topic:
- Detailed x86 boot and mode switching
- Can learn during: Projects 1-2
Advanced Topic:
- Kernel debugging and tracing tools
- Can learn during: Projects 8, 15, 16
Advanced Topic:
- Networking internals (TCP/IP, sockets)
- Can learn during: Project 16
Self-Assessment Questions
- Can you explain what a system call is and why user code cannot access hardware directly?
- Can you compile a C program and inspect its symbols with
nmorobjdump? - Can you explain the difference between virtual and physical memory?
- Can you read simple assembly and understand stack usage?
- Can you use GDB to set breakpoints and inspect memory?
If you answered “no” to questions 1-3, spend 1-2 weeks on OSTEP and CS:APP before starting.
Development Environment Setup
Required Tools:
- Linux host or VM (Ubuntu 22.04+ or Debian 12+ recommended)
- GCC or Clang, Make, binutils (ld, objdump, nm)
- QEMU (x86_64) for boot and kernel testing
- GDB for debugging (plus gdb-multiarch if needed)
Recommended Tools:
- NASM or GAS for assembly
strace,ltrace,perf,bpftracefor tracingdd,hexdump,xxd,objdumpfor binary inspection- FUSE development headers for file system project
Time Investment
- Foundations and early projects (1-6): 8-12 weeks
- Mid-stack projects (7-12): 10-14 weeks
- Advanced projects (13-16): 8-12 weeks
- Total: 6-9 months at a steady pace
Important Reality Check
OS work is detail-heavy and error-prone. You will spend real time debugging boot loaders, memory corruption, and deadlocks. The payoff is deep systems intuition that transfers to every other area of software.
Big Picture / Mental Model
At runtime, the OS sits between user programs and hardware. It multiplexes CPU time, virtualizes memory, translates I/O into device commands, and enforces security boundaries. A useful mental model is to think in terms of three loops: CPU scheduling, memory translation, and I/O completion.
User Program
| syscalls
v
+--------------------+
| Kernel |
| - Scheduler |<-- timer interrupts
| - Virtual Memory |<-- page faults
| - File System |<-- block I/O
| - IPC |<-- pipes/sockets
| - Drivers |<-- device interrupts
+--------------------+
| privileged ops
v
Hardware (CPU, RAM, Disk, NIC, GPU)
Key invariants:
- User code never touches kernel memory directly.
- Every syscall validates user inputs.
- Every runnable thread eventually runs (fairness).
- Storage updates either succeed or fail safely across crashes.
Theory Primer
C1: Bootstrapping and CPU Modes
Fundamentals
Bootstrapping is the process of turning powered-off hardware into a running kernel. When the CPU resets, it starts executing firmware code (BIOS or UEFI), not your OS. The firmware initializes hardware just enough to load a bootloader. The bootloader then loads the kernel image into memory, sets up a stack and CPU mode, and jumps to the kernel entry point. On x86, the CPU begins in 16-bit real mode with severe limits; modern kernels require 32-bit protected mode and eventually 64-bit long mode with paging enabled. Understanding this flow is critical because every later OS feature assumes the boot stage has established a correct memory map, stack, and CPU state. If boot is wrong, everything fails.
Deep Dive into the Concept
Bootstrapping starts at the CPU reset vector. On legacy BIOS systems, the firmware runs a power-on self-test, enumerates devices, and then loads the first 512-byte sector of a bootable disk to memory at 0x7C00. That sector is a tiny program with a magic signature (0x55AA) at the end. Because 512 bytes is tiny, real systems use a multi-stage boot: stage 1 loads stage 2, and stage 2 loads the kernel. In UEFI systems, the firmware can read filesystems directly and load a PE/COFF executable from an EFI System Partition, but the core idea is the same: a small, trusted loader prepares the environment for the kernel.
In real mode, the CPU uses segmented addressing and only has access to the first 1 MB of memory. Code must fit and use 16-bit instructions. A key early task is enabling the A20 line so memory addresses above 1 MB work correctly. The bootloader then builds a Global Descriptor Table (GDT) and sets the PE bit in CR0 to enter 32-bit protected mode. This is not just a flag flip: you must perform a far jump to flush the instruction pipeline and load the new code segment. From there, the bootloader can set up a flat memory model where segments map to the same linear address space.
Modern 64-bit kernels require long mode. That transition requires enabling Physical Address Extension (PAE), building identity-mapped page tables, setting CR3 to point at them, enabling paging (CR0.PG), and then setting the long mode bit in the EFER MSR. Only then can the CPU execute 64-bit instructions. This sequence is fragile: if page tables are incorrect or not aligned, the CPU will triple fault and reset. That is why many teaching kernels begin in 32-bit mode and only later switch to 64-bit after bringing up minimal diagnostics.
A bootloader also handles disk I/O. In BIOS mode, this means using INT 0x13 services to read sectors, which can be CHS or LBA. The loader must know how much to read and where to put it. That implies a kernel image format: raw binary, or a structured format like ELF. If you use ELF, the loader reads the ELF header, copies each loadable segment into memory, and then jumps to the entry address. This is the same mechanism used by modern OS loaders, just simplified.
Early kernel code is freestanding: there is no libc, no dynamic memory, and no interrupts yet. The kernel must set up its own stack, zero the BSS, and establish a minimal console or serial output so you can debug. Only after the kernel initializes interrupt handlers, memory management, and device drivers does it become a full OS. This early phase is where most boot bugs live, and the only visibility you have is whatever you print (VGA memory, serial, or QEMU debug).
Boot is also where hardware differences first appear. BIOS vs UEFI, legacy PIC vs APIC, and different memory maps change what the kernel must do. UEFI provides services for memory maps and drivers, while BIOS expects you to work with legacy interfaces. Real kernels parse boot parameters (command line, memory map, initrd location) and validate them before continuing. That validation step is a miniature trust boundary: your kernel must assume firmware or the bootloader can be wrong. In this guide, you build small, controlled boot paths to internalize the invariants: you must know exactly where code is loaded, which CPU mode is active, and how control passes to the kernel. Once you trust the boot path, everything else in the OS becomes explainable and debuggable.
How this fits on projects
- Project 1 builds a raw boot sector and forces you to work in real mode.
- Project 2 implements a multi-stage bootloader and mode switch.
- Project 14 (xv6 study) lets you inspect a real teaching OS boot path.
Definitions & key terms
- Firmware: The first code that runs after reset (BIOS or UEFI).
- Bootloader: A small program that loads and starts the kernel.
- Real mode: 16-bit x86 mode with segmented addressing and 1 MB limit.
- Protected mode: 32-bit mode with privilege levels and memory protection.
- Long mode: 64-bit mode enabled via paging and EFER MSR.
- GDT: Global Descriptor Table defining memory segments.
- A20 line: Gate that allows access above 1 MB in legacy systems.
- ELF: Executable and Linkable Format used by many Unix kernels.
Mental model diagram
Power -> Firmware -> Stage1 -> Stage2 -> Kernel
| | | |
| | +--> load ELF segments
| +--> read sectors from disk
+--> hardware init, boot order
How it works (step by step)
- CPU resets and begins executing firmware.
- Firmware loads boot sector or EFI loader.
- Bootloader loads kernel image into memory.
- Bootloader sets CPU mode and stack.
- Bootloader jumps to kernel entry.
- Kernel initializes memory, interrupts, and devices.
Minimal concrete example
; 16-bit boot sector (conceptual)
org 0x7C00
mov ah, 0x0E
mov al, 'H'
int 0x10 ; BIOS teletype output
jmp $
times 510-($-$$) db 0
dw 0xAA55
Common misconceptions
- “The kernel starts in 64-bit mode by default.” (It does not on BIOS systems.)
- “The bootloader is optional.” (You always need something to load the kernel.)
- “Switching CPU modes is just a flag.” (It requires correct tables and jumps.)
Check-your-understanding questions
- Why must the boot sector end with 0x55AA?
- What is the purpose of the A20 line?
- Why do you need a far jump after enabling protected mode?
Check-your-understanding answers
- The BIOS checks for 0x55AA to decide the sector is bootable.
- It enables access above 1 MB; without it, addresses wrap.
- The far jump reloads CS and flushes the pipeline for the new mode.
Real-world applications
- Secure boot chains in laptops and servers
- Embedded systems with custom boot ROMs
- Virtual machine firmware (QEMU/OVMF) boot paths
Where you’ll apply it
- Projects 1, 2, 14
References
- Operating Systems: Three Easy Pieces, Ch. 36
- Write Great Code, Vol 2, Ch. 3
- The Art of Assembly Language, Ch. 13
- Intel 64 and IA-32 SDM, Vol 3 (system programming)
Key insights
Boot is a series of strict invariants; if you control the invariants, you control the system.
Summary
Bootstrapping is the forced path from reset to a running kernel. It is constrained, fragile, and deterministic. Understanding it gives you the first mental model of how hardware becomes software.
Homework/Exercises to practice the concept
- Draw the memory map from 0x0000 to 0x20000 during boot.
- Modify the boot sector to print a second line using VGA memory.
- Implement a two-stage loader that reads 4 sectors.
Solutions to the homework/exercises
- Include a 0x7C00 boot sector and show reserved regions like IVT/BDA.
- Use 0xB8000 and increment by 2 bytes per character.
- Stage 1 reads N sectors into 0x1000, then jumps to stage 2.
C2: Kernel Boundary and System Call Interface
Fundamentals
The kernel boundary is the line that separates trusted code from untrusted code. User programs run in an unprivileged CPU mode and cannot directly access devices, page tables, or other processes’ memory. Instead, they request services through system calls. A system call is a controlled transition into the kernel, where the OS validates arguments, performs the requested action, and returns to user space. This boundary is the core safety and stability mechanism of any OS. If you understand how a syscall enters the kernel, how the kernel validates user input, and how control returns, you understand the OS contract. Every project you build later depends on this boundary behaving correctly.
Deep Dive into the Concept
Modern CPUs provide hardware support for privilege separation. On x86, user code runs at ring 3 and kernel code runs at ring 0. The transition happens through instructions like syscall/sysenter or through exceptions and interrupts. When the CPU executes a syscall instruction, it switches to kernel mode, loads a kernel stack pointer from a model specific register, saves user context, and jumps to a well-defined entry point. That entry point is usually a hand-written assembly stub that saves registers, switches to a safe stack, and calls a C function that dispatches to the correct system call handler.
The system call ABI defines how arguments are passed. On Linux x86-64, the syscall number is placed in rax and arguments are passed in rdi, rsi, rdx, r10, r8, and r9. The return value is placed in rax, and negative values are used to signal errors. The kernel must treat all user pointers as untrusted. Functions like copy_from_user and copy_to_user exist specifically to validate memory access and handle faults if the user passes a bad pointer. This is why the syscall boundary is a security boundary: a single unchecked pointer can crash or compromise the kernel.
System call entry is designed to be fast but safe. The kernel entry path follows a tight sequence: save registers, switch stacks, validate arguments, dispatch to the handler, and return to user space. To reduce overhead, modern kernels expose certain data (such as time) through a vDSO page mapped into user space. But anything that touches privileged state still requires the full transition.
The boundary is also a policy boundary. The kernel decides which processes can perform which system calls through permissions, capability checks, and security modules. For example, a process might be able to read a file but not change its permissions, or it might be prevented from creating new namespaces. Every syscall handler must check these policies because the kernel is the final gatekeeper. This is why the syscall table is a central audit point in OS security.
Debugging and tracing syscalls is one of the most practical ways to understand OS behavior. Tools like strace use ptrace to intercept syscalls at entry and exit, showing you exactly which kernel services a program uses. This reveals the runtime behavior that libraries hide. When you implement a syscall wrapper in assembly, you see exactly how small the interface is and how much libc does for you.
Syscalls also interact with signals and error handling. A blocking syscall can return early with EINTR if a signal handler runs; robust programs must restart or handle partial results. The kernel may mark some syscalls as restartable so it can resume them transparently after signal handling. This behavior matters for I/O, sleep, and synchronization. Security layers such as seccomp, audit, and LSM hooks intercept syscalls as additional filters. They can allow, deny, or log calls, which adds overhead but provides observability and control in production.
ABI stability is another hidden constraint. Once a syscall number is exposed, it cannot change without breaking binaries. Linux therefore adds new syscalls rather than changing existing ones and maintains compatibility layers for 32-bit applications on 64-bit kernels. Some syscalls are split into families (open vs openat) to add new semantics without breaking old code. Legacy vsyscall and modern vDSO pages illustrate the balance between speed and compatibility. When you build raw syscall wrappers, you see why the ABI is small but extremely rigid.
The same boundary appears in every OS. Windows has NT syscalls, macOS has Mach traps, and Unix-like systems have POSIX system calls. The specific ABI differs, but the concepts are identical: controlled privilege transition, argument validation, resource accounting, and a return to user space. Understanding the kernel boundary is the fastest way to predict performance (syscalls are expensive), security (bad syscalls are dangerous), and portability (different OSes expose different syscall sets).
C3: Execution Model - Processes, Threads, Scheduling, and Synchronization
Fundamentals
The execution model describes how the OS creates, runs, and coordinates work. A process is a protected container for resources (address space, files, credentials). A thread is a schedulable unit of execution within a process. The OS scheduler decides which thread runs next, for how long, and on which CPU. Because threads share memory, the OS must provide synchronization primitives so that concurrent access does not corrupt state. Understanding processes, threads, scheduling, and synchronization gives you the ability to predict latency, throughput, and correctness in real systems.
Deep Dive into the Concept
A process begins when the OS creates a new address space and loads a program into it. On Unix, this often happens through fork and exec: fork creates a near-exact copy of the parent, and exec replaces the child image with a new program. This separation allows the child to adjust its environment (file descriptors, environment variables, working directory) before executing. Internally, the OS builds a Process Control Block (PCB) that stores PID, state, registers, memory map, and scheduling metadata. Every context switch updates this structure.
Threads share the same address space but have separate stacks and registers. This allows fast communication but makes correctness harder. Preemptive scheduling means the OS can interrupt a running thread at almost any time, save its context, and run another thread. The timer interrupt drives this. A context switch is expensive: it flushes CPU pipelines, may evict cache lines, and disrupts branch prediction. The scheduler must balance fairness (everyone runs) with performance (minimize switches) and responsiveness (interactive tasks stay snappy).
Scheduling policies encode those trade-offs. FIFO (first in, first out) maximizes simplicity but can starve short jobs. Shortest Job First optimizes turnaround but is unfair. Round Robin improves responsiveness by giving each process a time slice. Multi-Level Feedback Queue (MLFQ) adapts based on observed behavior, promoting interactive jobs and demoting CPU hogs. Linux’s Completely Fair Scheduler (CFS) uses a virtual runtime to approximate ideal fairness, maintaining a red-black tree of runnable tasks. Each policy needs metrics: turnaround time, response time, waiting time, and throughput.
Synchronization is the second half of the execution model. When threads share memory, they must coordinate access to avoid races. Locks provide mutual exclusion; semaphores control access to a limited resource; condition variables allow threads to sleep until a condition becomes true. These primitives rely on atomic operations like compare-and-swap. The OS provides low-level primitives and can also detect or mitigate issues like priority inversion (where a low-priority thread holds a lock needed by a high-priority thread). Deadlock is a risk when multiple locks are acquired in different orders, so the OS and your code must maintain ordering rules or use deadlock detection.
The OS also manages thread states: running, ready, blocked, and terminated. A thread blocks on I/O or synchronization and later becomes ready. The scheduler moves ready threads onto a CPU. A well-designed system has clear invariants: a blocked thread is not runnable, only one thread owns a lock at a time, and a thread cannot run on two CPUs simultaneously. These invariants are small, but the state space is huge, which is why concurrency bugs are so hard.
Finally, the execution model touches performance isolation. The scheduler must protect interactive tasks from CPU-bound tasks and avoid starvation. In multi-core systems, there is also a NUMA dimension: where a thread runs affects memory latency. Sophisticated schedulers try to keep threads near their data while still balancing load. Your projects implement simplified versions of these ideas so you can see their trade-offs in a controlled environment.
How this fits on projects
- Project 5 builds a scheduler simulator and forces you to measure fairness.
- Project 6 and 7 exercise fork/exec and process relationships.
- Project 11 and 15 use synchronization primitives to avoid races.
Definitions & key terms
- Process Control Block (PCB): Kernel data structure describing a process.
- Context switch: Saving and restoring CPU state between threads.
- Preemption: Forcibly stopping a thread to run another.
- Time slice: Fixed time the scheduler gives a thread before switching.
- Mutex: A mutual exclusion lock.
- Condition variable: A wait/notify synchronization primitive.
Mental model diagram
Ready Queue -> [Scheduler] -> CPU
^ | |
| v |
Blocked <------ I/O, locks ----+
How it works (step by step)
- A process is created and placed in the ready queue.
- The scheduler selects a thread and runs it on a CPU.
- The thread runs until it blocks, yields, or is preempted.
- The OS saves context and picks the next runnable thread.
- Synchronization primitives ensure shared memory stays consistent.
Minimal concrete example
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
int counter = 0;
void *worker(void *arg) {
for (int i = 0; i < 100000; i++) {
pthread_mutex_lock(&lock);
counter++;
pthread_mutex_unlock(&lock);
}
return NULL;
}
Common misconceptions
- “Threads are always faster than processes.” (They are often faster but risk races.)
- “A context switch is cheap.” (It is measurable and can dominate latency.)
- “Locks guarantee correctness.” (Only if you use them consistently.)
Check-your-understanding questions
- Why can Round Robin improve responsiveness compared to FIFO?
- What makes a thread blocked versus ready?
- How does priority inversion occur?
Check-your-understanding answers
- It limits how long any one process can run without yielding.
- Blocked threads wait for I/O or synchronization; ready threads can run.
- A low-priority thread holds a lock needed by a high-priority thread.
Real-world applications
- Interactive desktop scheduling vs batch workloads
- High-frequency trading systems tuned for low latency
- Web servers with thread pools and backpressure
Where you’ll apply it
- Projects 5, 6, 7, 11, 15
References
- Operating Systems: Three Easy Pieces, Ch. 4, 7-9, 26-31
- Linux Kernel Development, Ch. 4
- Advanced Programming in the UNIX Environment, Ch. 8-9
- docs.kernel.org: scheduler and CFS overview
Key insights
Scheduling and synchronization are policy choices built on small but fragile invariants.
Summary
The execution model defines how the OS creates work, runs it, and keeps it correct. Scheduling policies trade fairness against throughput, while synchronization keeps shared memory safe under preemption.
Homework/Exercises to practice the concept
- Simulate FIFO, SJF, and Round Robin for a 4-process workload.
- Build a toy lock and measure how contention changes runtime.
- Create a deadlock with two locks and then fix it by ordering.
Solutions to the homework/exercises
- Use a timeline table and compute response/turnaround by hand.
- Add timestamps and compare unlocked vs locked increments.
- Acquire locks in a single global order to prevent cycles.
C4: Virtual Memory and Paging
Fundamentals
Virtual memory gives each process the illusion of a large, private address space. The OS and hardware cooperate to translate virtual addresses into physical addresses using page tables. This abstraction provides isolation (one process cannot access another’s memory), flexibility (the OS can move pages around), and efficiency (only needed pages must be resident). Page faults are the mechanism by which the OS brings pages into memory on demand. Understanding address translation, page tables, and page faults is essential for reasoning about performance and security.
Deep Dive into the Concept
A virtual address is split into a page number and an offset. The page number indexes into a page table, which yields a physical frame number and permissions. The CPU then combines the frame number with the offset to produce a physical address. On modern systems, page tables are multi-level to reduce memory overhead. Instead of a single huge table, the OS creates intermediate tables only when needed. This allows sparse address spaces and efficient memory use.
The Translation Lookaside Buffer (TLB) is a hardware cache of page table entries. When the CPU needs to translate an address, it checks the TLB first. A TLB hit is fast; a TLB miss requires walking the page tables in memory, which is slower. A page fault is different: it occurs when the page is not present or the access violates permissions. On a page fault, the CPU traps into the kernel, which decides whether to load the page, allocate a new page, or kill the process. This trap path is one of the most important performance and correctness paths in the OS.
Demand paging uses this mechanism to load pages only when they are accessed. The OS can map a file into memory and only load the pages that are actually used. This makes large applications feasible and enables features like memory-mapped files and copy-on-write. Copy-on-write is how fork is efficient: the parent and child share the same physical pages until one writes, at which point the OS copies the page and updates the page tables.
When physical memory fills up, the OS must evict pages. Page replacement algorithms like FIFO, LRU, and Clock decide which pages to evict. The working set model helps explain why some workloads thrash: if the working set is larger than physical memory, the OS spends most of its time servicing page faults. The OS can also use swap space on disk to extend memory, but swapping is far slower than RAM, so a system with heavy swap tends to stall.
Virtual memory is also a protection mechanism. Each page has permissions: read, write, execute. The OS sets these to enforce isolation and security. Modern defenses like non-executable stacks (NX) and address space layout randomization (ASLR) are built on top of paging. The OS can also use huge pages (2 MB or 1 GB) to reduce TLB pressure for large, contiguous memory regions, trading off flexibility for speed.
In a simulator, you can see these concepts clearly. When you control page size, number of frames, and replacement policy, you can observe how locality of reference drives performance. A small change in access pattern can shift the miss rate dramatically. This is why understanding virtual memory is so important for systems performance work: it explains why programs slow down when they exceed cache or RAM, and why memory access patterns matter even when you have “enough” RAM.
Virtual memory also enables sharing. Shared libraries map the same physical pages into many processes, reducing memory usage. Memory-mapped files let the OS treat file I/O as paging, which simplifies applications and enables zero-copy reads. Kernel space itself is mapped into every process (usually in a protected region), which allows fast transitions but requires strict access control.
How this fits on projects
- Project 4 builds a VM simulator and page replacement algorithms.
- Project 3 (allocator) depends on paging behavior for sbrk/mmap.
- Project 14 (xv6 study) exposes real page table code.
Definitions & key terms
- Page: Fixed-size block of memory (commonly 4 KB).
- Page table: Data structure mapping virtual pages to physical frames.
- TLB: Cache of recent translations.
- Page fault: Trap when a page is missing or access is illegal.
- Copy-on-write: Share pages until a write triggers a copy.
- Working set: Set of pages a process actively uses.
Mental model diagram
Virtual Addr -> [Page Table] -> Physical Frame -> RAM
| ^
v |
TLB (cache) ----+
How it works (step by step)
- CPU splits virtual address into page number + offset.
- TLB is checked for a cached translation.
- On miss, CPU walks page tables in memory.
- If page not present, a page fault traps to kernel.
- Kernel loads/allocates page, updates tables, resumes.
Minimal concrete example
uint32_t page_num = vaddr >> 12; // 4 KB pages
uint32_t offset = vaddr & 0xFFF;
if (!pte[page_num].present) page_fault(page_num);
phys = (pte[page_num].frame << 12) | offset;
Common misconceptions
- “Page faults are always errors.” (They are often normal demand paging.)
- “More RAM means no VM overhead.” (TLB misses still matter.)
- “TLB miss == page fault.” (A TLB miss can still be a valid page.)
Check-your-understanding questions
- Why are page tables multi-level on 64-bit systems?
- What is the difference between a TLB miss and a page fault?
- Why does locality improve performance?
Check-your-understanding answers
- A flat table would be enormous and waste memory for sparse spaces.
- A TLB miss requires a table walk; a page fault means the page is absent.
- Locality keeps recent pages in cache and reduces fault rate.
Real-world applications
- Memory-mapped databases and file I/O
- Security hardening via NX and ASLR
- High-performance systems using huge pages
Where you’ll apply it
- Projects 4, 3, 14
References
- Operating Systems: Three Easy Pieces, Ch. 13-22
- Computer Systems: A Programmer’s Perspective, Ch. 9
- Computer Architecture (Hennessy and Patterson), Ch. 5
Key insights
Virtual memory is both a performance tool and a security boundary.
Summary
Paging gives each process a private, flexible view of memory. The cost is translation overhead and page faults, which the OS manages with caching and replacement policies.
Homework/Exercises to practice the concept
- Simulate FIFO and LRU on a short access trace.
- Draw a two-level page table for a small address space.
- Measure page fault rate for a random vs sequential access pattern.
Solutions to the homework/exercises
- Use a 4-frame memory and track evictions by hand.
- Split virtual address into directory, table, and offset bits.
- Sequential access shows fewer faults due to spatial locality.
C5: Dynamic Memory Allocation and Fragmentation
Fundamentals
Dynamic memory allocation turns raw pages into flexible blocks that programs can request and free at runtime. The OS provides primitives like brk/sbrk and mmap, and user-space allocators (malloc) manage the heap on top of those primitives. The allocator must track free and used blocks, satisfy alignment requirements, and minimize fragmentation. Fragmentation comes in two forms: internal (wasted space inside allocated blocks) and external (unused gaps between blocks). Understanding allocators is essential for performance, stability, and security because memory bugs often originate here.
Deep Dive into the Concept
The simplest allocator is a bump pointer: keep a pointer to the end of the heap and move it forward for each allocation. This is fast but cannot free memory, so it only works in special cases. Real allocators maintain metadata for each block, typically a header that stores size and free/used status. Free blocks are organized in lists or trees. When a request arrives, the allocator finds a suitable free block, splits it if necessary, and returns a pointer to the payload. When memory is freed, the allocator marks the block as free and coalesces with neighboring free blocks to reduce fragmentation.
The allocator must respect alignment: most platforms require certain data types to be aligned to 8 or 16 bytes. This affects header size and block splitting. If blocks are not aligned, the CPU may raise exceptions or silently degrade performance. Many allocators use boundary tags or footers to make coalescing efficient. Others use segregated free lists, where blocks are grouped by size class to speed up allocation.
The OS provides two primary sources of memory: the heap via sbrk (contiguous growth) and memory mappings via mmap (arbitrary regions). Modern allocators often use mmap for large allocations and sbrk or mmap for the main heap. This reduces fragmentation and allows the OS to return large blocks to the system. Threaded programs complicate things: locking the whole heap on every allocation is expensive, so many allocators use per-thread caches or arenas to reduce contention.
Kernel allocators face similar problems but with different constraints. The kernel often uses slab or buddy allocators to manage fixed-size objects and page-level blocks. Slab allocators reduce fragmentation and improve cache locality by reusing objects of the same size. Buddy allocators manage pages in power-of-two sizes and can quickly split and coalesce. Understanding these approaches helps you reason about kernel memory and performance.
Fragmentation is the primary trade-off. First-fit allocation is fast but can leave small unusable gaps. Best-fit reduces wasted space but can be slow. The allocator must also consider security: use-after-free, double-free, and buffer overflows are common vulnerabilities. Many allocators add canaries, quarantines, or randomized layouts to mitigate attacks. These safety measures often trade performance for security.
Building your own allocator in a project forces you to make these trade-offs explicit. You will decide how to track metadata, how to search for free blocks, how to split and coalesce, and how to handle realloc. This experience turns malloc from a black box into a mental model you can use when debugging real programs.
In production systems, allocator behavior can be the difference between stable performance and catastrophic pauses. Many runtimes expose tuning knobs for arena counts, size classes, and background trimming. Instrumentation helps: track total allocated bytes, peak usage, fragmentation ratios, and allocation latency. These metrics let you correlate allocator choices with application behavior, which is a key skill for debugging memory-heavy services.
Allocator design also interacts with CPU caches: contiguous allocations improve locality, while scattered blocks increase cache misses and latency.
How this fits on projects
- Project 3 builds a user-space allocator from scratch.
- Project 4 relies on understanding page sizes and alignment.
- Project 14 (xv6 study) exposes kernel allocators.
Definitions & key terms
- Heap: Region of memory used for dynamic allocation.
- Internal fragmentation: Wasted space inside allocated blocks.
- External fragmentation: Wasted space between allocated blocks.
- Free list: Data structure that tracks free blocks.
- Coalescing: Merging adjacent free blocks.
- Arena: A per-thread or per-heap allocation region.
Mental model diagram
[free block][alloc][free][free] -> split -> [free][alloc][alloc][free]
How it works (step by step)
- Request size is rounded up to alignment.
- Allocator finds a free block large enough.
- If block is larger, it is split into used + free.
- On free, allocator marks block and coalesces neighbors.
- Large requests may use mmap directly.
Minimal concrete example
struct block {
size_t size;
int free;
struct block *next;
};
Common misconceptions
- “malloc just asks the OS for memory every time.” (Allocators reuse blocks.)
- “Fragmentation is only about wasted bytes.” (It can prevent large allocs.)
- “realloc always moves memory.” (It may expand in place.)
Check-your-understanding questions
- Why must allocations be aligned?
- What is the difference between internal and external fragmentation?
- Why use mmap for large allocations?
Check-your-understanding answers
- Misalignment can break CPU access rules and slow performance.
- Internal is wasted space inside blocks; external is wasted gaps.
- It avoids large heap growth and allows returning memory to the OS.
Real-world applications
- High-performance allocators in databases and game engines
- Memory debugging tools that detect leaks and corruption
- Kernel slab caches for network buffers and file descriptors
Where you’ll apply it
- Projects 3, 4, 14
References
- Operating Systems: Three Easy Pieces, Ch. 17
- Computer Systems: A Programmer’s Perspective, Ch. 9
- C Interfaces and Implementations, Ch. 5-6
- The Linux Programming Interface, Ch. 7
Key insights
Allocators are policy engines: the data structures you choose define the trade-offs.
Summary
Dynamic allocation is the bridge between page-level memory and flexible program needs. Fragmentation, alignment, and concurrency are the core challenges.
Homework/Exercises to practice the concept
- Implement a bump allocator and measure its speed.
- Add free lists and coalescing, then test fragmentation.
- Compare first-fit and best-fit on a synthetic workload.
Solutions to the homework/exercises
- Use a single pointer and return aligned slices.
- Add headers and test with a randomized allocate/free sequence.
- Track free space and average allocation time for each strategy.
C6: Storage - File Systems and Crash Consistency
Fundamentals
File systems turn raw blocks into named, durable files. They provide directories, permissions, metadata, and a stable namespace. The OS caches file data in memory and must ensure that updates remain consistent across crashes and power loss. File systems use structures like superblocks, inodes, and allocation bitmaps to track what is stored where. Understanding file system layout and crash consistency helps you reason about persistence, performance, and reliability.
Deep Dive into the Concept
A block device exposes a linear array of fixed-size blocks. The file system organizes these blocks into metadata and data. The superblock describes the file system geometry (block size, number of blocks, free space). Inodes describe files: size, owner, permissions, and pointers to data blocks. Directories are special files that map names to inode numbers. The OS translates a path like /home/user/doc.txt by walking directories, reading inodes, and following data block pointers.
Free space management is central. Many file systems use bitmaps to track which blocks are free, while others use free lists or extent trees. Allocation strategies trade off locality and fragmentation. If you place file data near its inode, you improve cache locality. But if you always allocate the first free block, you may fragment the disk. This is why file systems often include heuristics or delayed allocation.
Crash consistency is the hardest part. A file system update often touches multiple blocks: directory entry, inode, data blocks, and allocation bitmap. If the system crashes mid-update, the on-disk structure can become inconsistent. Journaling solves this by writing intent to a log before making changes. After a crash, the file system replays the log to complete or roll back operations. Copy-on-write file systems (like ZFS or btrfs) avoid in-place updates entirely, writing new structures and flipping pointers atomically. Both approaches trade space and write amplification for consistency.
Caching and buffering are critical for performance. The page cache stores recently accessed file data in RAM. Write-back caches allow the OS to batch writes, improving throughput but increasing the risk window for data loss unless you call fsync. The OS also merges adjacent writes to reduce disk seeks. These caching decisions matter because the performance of a file system is often dominated by I/O scheduling and cache hit rates, not raw disk speed.
File systems must also handle permissions and durability semantics. Unix permissions (rwx for user/group/other) are a simple model, while ACLs provide more granularity. Durability is controlled by fsync and by mount options that choose between data journaling, ordered writes, and writeback modes. The OS must balance speed against safety, and it exposes those options so applications can choose.
A minimal file system project shows all of these choices. When you implement your own file system in a disk image, you choose the layout, allocation policy, and consistency guarantees. You will observe how small structural choices affect both performance and the ability to recover from crashes.
Modern storage adds another layer of complexity. SSDs reorder writes internally and use erase blocks, so the OS cannot assume on-disk ordering without explicit barriers. File systems add checksums and sequence numbers to detect silent corruption. Some systems use copy-on-write to avoid in-place updates entirely, trading write amplification for safety. These design decisions are not just academic; they determine whether your data survives power loss and how well your system performs under heavy write loads.
Checksums, scrubbers, and redundancy (such as RAID) complement file system consistency by detecting and correcting latent corruption over time.
How this fits on projects
- Project 9 builds a file system from scratch.
- Project 10 explores device files and the “everything is a file” model.
- Project 12 builds Unix tools that depend on file semantics.
Definitions & key terms
- Superblock: Metadata describing the file system layout.
- Inode: Metadata record for a file.
- Extent: A contiguous range of blocks assigned to a file.
- Journaling: Logging updates to ensure crash recovery.
- fsync: System call that forces data to durable storage.
- Page cache: Memory cache for file data and metadata.
Mental model diagram
[superblock][inode table][bitmap][data blocks]
| | | |
+--> metadata free map file contents
How it works (step by step)
- Parse path into directory components.
- Read directory entries to find inode numbers.
- Read inode to find data block pointers.
- Read or write data blocks.
- Update metadata and journal changes if enabled.
Minimal concrete example
/ (inode 2)
|-- home (inode 10)
|-- user (inode 11)
|-- notes.txt (inode 42)
Common misconceptions
- “A file is stored in one contiguous block.” (Most files are fragmented.)
- “fsync is optional.” (Without it, data may be lost after a crash.)
- “Journaling guarantees no data loss.” (It guarantees metadata consistency.)
Check-your-understanding questions
- Why do file systems need a separate inode table?
- What problem does journaling solve?
- Why does caching improve performance but risk durability?
Check-your-understanding answers
- It separates metadata from file names and supports hard links.
- It makes multi-block updates recoverable after crashes.
- Cached writes may not reach disk immediately unless flushed.
Real-world applications
- Databases that rely on fsync for durability
- Log-structured systems optimized for SSDs
- Container images layered on top of file systems
Where you’ll apply it
- Projects 9, 10, 12
References
- Operating Systems: Three Easy Pieces, Ch. 39-45
- Understanding the Linux Kernel, Ch. 12
- The Linux Programming Interface, Ch. 14
Key insights
File systems are where correctness and performance collide; crash consistency is the core challenge.
Summary
File systems map names to blocks, maintain metadata, and protect consistency across failures. The details of allocation, caching, and journaling define both speed and safety.
Homework/Exercises to practice the concept
- Design a block layout for a 10 MB disk image.
- Implement a bitmap allocator for blocks.
- Simulate a crash mid-write and define recovery steps.
Solutions to the homework/exercises
- Reserve blocks for superblock, bitmap, inodes, and data.
- Use a bit array and scan for free bits.
- Replay a journal entry or roll back incomplete metadata.
C7: I/O Subsystem - Interrupts, Drivers, and Devices
Fundamentals
The I/O subsystem connects the OS to physical devices. Devices are slow and unpredictable compared to the CPU, so the OS relies on interrupts, buffering, and drivers to manage them. A device driver translates OS requests into hardware commands and reports completion via interrupts. The OS must schedule I/O, avoid blocking the kernel on slow devices, and provide a uniform interface to programs. Understanding interrupts and drivers is essential for explaining responsiveness, throughput, and device failures.
Deep Dive into the Concept
I/O devices are controlled through registers mapped into memory or I/O ports. A driver writes commands to these registers, the device performs work, and then it interrupts the CPU when done. The interrupt triggers a handler in the kernel, which acknowledges the device, queues any new work, and wakes waiting processes. This interrupt-driven model allows the CPU to do useful work while devices operate. Polling is an alternative where the CPU repeatedly checks device status, which is simpler but wastes cycles.
Modern systems rely on DMA (Direct Memory Access) for performance. DMA lets devices read or write memory directly without copying through the CPU. The OS sets up DMA descriptors, and the device transfers data in the background. This improves throughput but creates complexity: the OS must ensure the memory is mapped and pinned, and it must handle cache coherency on some architectures.
Drivers are often split into a fast “top half” (interrupt handler) and a slower “bottom half” or deferred work, so the OS can keep interrupts short. Linux uses mechanisms like softirqs and workqueues for this. Drivers expose devices as block devices, character devices, or network devices. Block devices (disks) have request queues and can be optimized by reordering I/O requests to minimize seeks. Character devices (serial ports) are stream-oriented and simpler.
User programs typically see devices through file descriptors and syscalls like open, read, write, and ioctl. This is the “everything is a file” model: disks, terminals, and sockets all look like files. The OS can multiplex I/O using select, poll, or epoll, letting a program wait for multiple devices at once. This is how high-performance servers handle many connections without dedicating a thread to each.
I/O subsystems are also where many failures occur: lost interrupts, misbehaving drivers, or slow devices can stall the system. The OS must be defensive, using timeouts and watchdogs to detect failures. In kernels, a single bad driver can crash the entire system, which is why kernel modules are carefully controlled and tested.
Your projects expose these ideas through device files, kernel modules, and network stack exploration. By simulating I/O and using tracing tools, you see how interrupt latency and buffering affect performance, and why driver correctness is as important as application correctness.
Modern kernels use hybrid techniques to balance latency and throughput. For example, network drivers may switch between interrupt-driven mode and polling (NAPI style) under heavy load to reduce interrupt storms. Storage stacks may merge and reorder requests to optimize device queues. Asynchronous I/O interfaces let applications submit many requests without blocking, improving throughput at the cost of complexity. These optimizations all rely on the same core ideas: explicit queues, careful synchronization, and clear ownership of buffers.
Latency-sensitive workloads care about interrupt latency and scheduling. Real-time systems may pin interrupts to specific CPUs, raise handler priorities, or avoid deferred work to meet deadlines. These techniques show that the I/O subsystem is not just about throughput; it is also about predictability under load.
How this fits on projects
- Project 10 builds device files and demonstrates file-based device I/O.
- Project 15 builds a kernel module and a simple driver.
- Project 16 explores network I/O and packet handling.
Definitions & key terms
- Interrupt: Asynchronous signal from hardware to the CPU.
- DMA: Device-initiated memory transfer without CPU copying.
- Driver: Kernel code that controls a hardware device.
- Block device: Random-access device like disks.
- Character device: Stream device like a serial port.
- ioctl: Device-specific control call.
Mental model diagram
User -> syscall -> Driver -> Device
^ |
| v
wakeup interrupt
How it works (step by step)
- User program issues read/write on a device file.
- Kernel calls the driver with a request.
- Driver programs device registers or DMA.
- Device signals completion via interrupt.
- Kernel handler wakes waiting process.
Minimal concrete example
int fd = open("/dev/ttyS0", O_RDWR);
write(fd, "ping\n", 5);
Common misconceptions
- “Interrupts are always faster than polling.” (Polling can win for busy devices.)
- “Drivers are just user-space libraries.” (Drivers run in kernel mode.)
- “All devices are block devices.” (Many are streams or packets.)
Check-your-understanding questions
- Why is DMA faster than copying through the CPU?
- Why do interrupt handlers do minimal work?
- What is the difference between block and character devices?
Check-your-understanding answers
- DMA transfers data without CPU involvement and avoids extra copies.
- Long handlers block other interrupts and increase latency.
- Block devices are random access; character devices are streams.
Real-world applications
- High-throughput storage systems using NVMe
- Low-latency networking with interrupt coalescing
- Embedded devices with limited CPU budgets
Where you’ll apply it
- Projects 10, 15, 16
References
- Operating Systems: Three Easy Pieces, Ch. 36
- Linux Kernel Development, Ch. 13
- The Linux Programming Interface, Ch. 13-14
- docs.kernel.org: device drivers and interrupt handling
Key insights
I/O is the boundary between deterministic CPU code and nondeterministic devices.
Summary
The I/O subsystem coordinates devices using drivers, interrupts, buffering, and DMA. Correctness and performance depend on careful sequencing and defensive design.
Homework/Exercises to practice the concept
- Trace disk I/O for a file copy with
straceandiostat. - Write a user-space program that uses select on two file descriptors.
- Explain why a buggy driver can crash the system.
Solutions to the homework/exercises
- Observe open/read/write calls and correlate with disk stats.
- Use select to wait on stdin and a pipe simultaneously.
- Drivers run in kernel mode and can corrupt kernel memory.
C8: IPC and Composition - Pipes, Sockets, Signals, Networking
Fundamentals
Inter-process communication (IPC) lets processes exchange data and coordinate work. Unix provides multiple IPC mechanisms: pipes, signals, shared memory, message queues, and sockets. The Unix philosophy treats many of these as file descriptors, enabling composition with simple read/write operations. Networking extends IPC beyond a single machine through sockets and the TCP/IP stack. Understanding IPC is critical for building shells, servers, and scalable systems.
Deep Dive into the Concept
A pipe is a unidirectional byte stream connecting two processes. It is created by the parent, and then each child inherits one end. This simple mechanism enables the classic Unix pipeline. Sockets generalize pipes: they are bidirectional endpoints that can communicate locally (Unix domain sockets) or across networks (INET sockets). Unix domain sockets behave like files, support file descriptor passing, and are fast for local IPC. INET sockets carry TCP or UDP traffic across networks and introduce reliability, congestion control, and ordering concerns.
Signals are asynchronous notifications delivered by the kernel. They are used for process control (SIGINT, SIGTERM), timers, and fault reporting (SIGSEGV). Signals interrupt a running thread and run a signal handler, so they require careful design to avoid race conditions. Shared memory is the fastest IPC method because it avoids copying, but it requires explicit synchronization to avoid races. Message queues and pipes include synchronization implicitly but add overhead and limited message structure.
The networking stack adds several layers. At the bottom, the network interface card (NIC) receives frames. The kernel processes them into IP packets, then into TCP or UDP segments, then delivers data to sockets. TCP adds reliability with sequence numbers, acknowledgments, retransmission, and congestion control. UDP is connectionless and faster but requires application-level reliability if needed. Socket APIs expose this complexity through functions like socket, bind, connect, send, and recv. The OS also provides select, poll, and epoll so a single process can manage many sockets concurrently.
Composition is where IPC becomes a design philosophy. By exposing uniform file descriptor interfaces, Unix allows small programs to be composed into pipelines. A shell uses fork/exec to create processes, then wires them together with pipes. The same mechanisms build web servers, log processors, and data pipelines. The kernel enforces boundaries, but IPC provides the glue for collaboration.
Performance considerations are critical. Copying data between processes can dominate runtime, so many OSes provide zero-copy mechanisms like splice and sendfile. For networking, interrupt coalescing and offload features can reduce CPU overhead but add latency. These are trade-offs that high-performance systems must understand.
Flow control and backpressure are the hidden backbone of IPC. Pipes have finite buffers, and writes can block when the buffer is full, forcing producers to slow down. Sockets add more layers: TCP uses window sizes and acknowledgments to regulate send rate, while UDP leaves flow control to the application. Non-blocking I/O and event loops (select, poll, epoll) let a single process handle many endpoints, but they require careful message framing so one stream does not starve others. These mechanics are why “simple” IPC often becomes a system design exercise.
IPC also requires framing. Pipes and TCP streams are byte streams, so applications must define message boundaries with delimiters or length prefixes. Half-closed connections (shutdown write) can signal end-of-stream without closing the read side. These details determine whether protocols are robust, especially under partial reads and writes and when clients disconnect unexpectedly.
These edge cases are where many IPC bugs live in practice.
How this fits on projects
- Project 7 builds a shell with pipes and redirection.
- Project 11 builds IPC mechanisms and a small chat system.
- Project 16 explores network sockets and packet flow.
Definitions & key terms
- Pipe: Unidirectional byte stream between processes.
- Socket: Communication endpoint for local or network communication.
- Signal: Asynchronous notification from the kernel.
- Shared memory: Memory region mapped into multiple processes.
- TCP: Reliable, ordered byte stream protocol.
- UDP: Connectionless datagram protocol.
Mental model diagram
Process A -> pipe/socket -> Process B
| ^
+-- signal/notify ------+
How it works (step by step)
- Create IPC endpoint (pipe, socket, shared memory).
- Fork or connect processes to the endpoint.
- Read/write data through file descriptors.
- Kernel queues data and wakes receivers.
- For sockets, kernel routes packets through TCP/IP.
Minimal concrete example
int fd[2];
pipe(fd);
if (fork() == 0) {
dup2(fd[1], STDOUT_FILENO);
execlp("ls", "ls", NULL);
} else {
dup2(fd[0], STDIN_FILENO);
execlp("wc", "wc", "-l", NULL);
}
Common misconceptions
- “Signals are reliable.” (Signals can be coalesced or missed.)
- “Shared memory is always the best.” (It requires careful synchronization.)
- “Sockets are only for networks.” (Unix domain sockets are local IPC.)
Check-your-understanding questions
- Why are pipes unidirectional?
- Why does TCP need congestion control?
- Why can a signal handler be dangerous?
Check-your-understanding answers
- Pipes are optimized for simple producer-consumer flows.
- Without congestion control, networks collapse under load.
- It can interrupt code at unsafe points, causing races.
Real-world applications
- Shell pipelines and data processing chains
- Microservices communicating over sockets
- Event-driven servers using epoll
Where you’ll apply it
- Projects 7, 11, 16
References
- The Linux Programming Interface, Ch. 44-56
- Advanced Programming in the UNIX Environment, Ch. 14-16
- TCP/IP Illustrated, Vol 1, Ch. 1-6
- UNIX Network Programming, Vol 1, Ch. 1-6
Key insights
IPC is the OS mechanism that turns isolated processes into systems.
Summary
IPC mechanisms let processes cooperate. Pipes and sockets provide simple composition, while shared memory offers speed with higher synchronization cost.
Homework/Exercises to practice the concept
- Build a pipeline of three commands and trace their file descriptors.
- Implement a simple echo server using TCP sockets.
- Compare throughput of pipes vs shared memory for a large buffer.
Solutions to the homework/exercises
- Use
lsofor/proc/<pid>/fdto inspect the pipeline. - Create a socket, bind, listen, accept, recv, send.
- Measure copy overhead in a microbenchmark.
C9: Security, Isolation, and Virtualization
Fundamentals
The OS enforces security by isolating processes and controlling access to resources. Traditional Unix permissions (user, group, other) are the baseline, but modern systems add capabilities, ACLs, mandatory access control, and sandboxing. Virtualization extends isolation across entire OS instances using hypervisors or containers. Containers rely on namespaces and cgroups to isolate processes while sharing the kernel. Understanding these mechanisms explains how multi-tenant systems remain safe and how containers differ from virtual machines.
Deep Dive into the Concept
Access control starts with user and group IDs. The kernel checks permissions on every file access and system call that modifies privileged state. Capabilities split the all-powerful root privilege into smaller pieces, so a process can be allowed to bind to low ports without full root privileges. ACLs add fine-grained permissions beyond the simple rwx model. Mandatory Access Control systems like SELinux add a policy layer that can deny access even if traditional permissions allow it. These mechanisms form a layered defense, and each is enforced at syscall boundaries.
Namespaces provide isolation by creating separate views of system resources. PID namespaces give processes their own process tree, mount namespaces give separate filesystem views, and network namespaces give isolated network stacks. A container is a set of processes that share a kernel but live in their own namespaces. Cgroups (control groups) impose resource limits and accounting: CPU shares, memory caps, I/O throttling. Together, namespaces and cgroups are the core of containerization. They are not virtualization in the hardware sense; they are isolation within a single kernel.
Virtual machines, by contrast, emulate hardware. A hypervisor runs at a higher privilege level and presents virtual CPUs, memory, and devices to guest OSes. Type 1 hypervisors run directly on hardware; type 2 hypervisors run on a host OS. Hardware virtualization extensions (Intel VT-x, AMD-V) make this efficient by allowing the guest to run near-native speeds while trapping privileged operations to the hypervisor. The trade-off is overhead and complexity: VMs provide stronger isolation but require a full OS per guest.
Security also involves sandboxing untrusted code. Mechanisms like seccomp filter syscalls, reducing the attack surface. Combined with namespaces, seccomp can turn a normal process into a tightly constrained sandbox. This is how browsers and container runtimes limit damage from compromised processes. The kernel is the ultimate enforcement point, so kernel bugs are high-impact, and defense-in-depth is essential.
Isolation is not just about security; it is also about reliability. Resource limits prevent one process from starving others. Namespaces allow multiple applications to run with conflicting configurations. Cgroups provide observability by letting you see resource usage per container or process group. When you build a minimal container in this guide, you will see how small the underlying mechanism is compared to the heavy ecosystem built on top of it.
Security in this space is not automatic. Misconfigured namespaces or overly broad capabilities can allow container escapes. Kernel attack surface grows when you enable many subsystems inside containers. This is why production systems combine cgroups, namespaces, seccomp filters, and sometimes additional isolation like user namespaces or VM-based containers. Understanding these layers helps you decide which isolation level is appropriate for a workload and why containers are fast but not always equivalent to VMs in threat models.
Virtual machines add overhead but provide stronger isolation boundaries, especially when combined with hardware-assisted virtualization and device passthrough. The trade-off is a heavier footprint and slower startup compared to containers.
How this fits on projects
- Project 10 uses file permissions and device access control.
- Project 13 builds a container using namespaces and cgroups.
- Project 15 explores kernel modules and privileged code.
Definitions & key terms
- Capabilities: Split root privileges into discrete rights.
- ACL: Access Control List for fine-grained permissions.
- Namespace: Isolated view of a system resource.
- cgroup: Resource control and accounting for process groups.
- Hypervisor: Software that runs virtual machines.
- seccomp: Syscall filtering mechanism.
Mental model diagram
Container = namespaces + cgroups + processes
VM = hypervisor + virtual hardware + guest OS
How it works (step by step)
- User requests a privileged action via syscall.
- Kernel checks UID/GID, capabilities, and policies.
- Namespace and cgroup limits are applied.
- If allowed, kernel performs the action.
- If denied, kernel returns an error (EPERM/EACCES).
Minimal concrete example
unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET);
// Create a process that sees its own mount, PID, and network namespaces.
Common misconceptions
- “Containers are just lightweight VMs.” (They share the host kernel.)
- “Root inside a container is real root.” (It is namespaced and constrained.)
- “Capabilities are rarely used.” (They are core to modern security.)
Check-your-understanding questions
- What is the key difference between a VM and a container?
- Why do capabilities exist when root already exists?
- How do cgroups enforce memory limits?
Check-your-understanding answers
- VMs virtualize hardware; containers isolate processes in one kernel.
- Capabilities reduce the power of privileged processes to a minimum.
- The kernel tracks usage and denies allocations past the limit.
Real-world applications
- Kubernetes and container orchestration
- Multi-tenant servers with strict resource limits
- Sandboxing untrusted plugins and code
Where you’ll apply it
- Projects 10, 13, 15
References
- Operating Systems: Three Easy Pieces, Ch. 14
- Linux man-pages: namespaces(7), capabilities(7)
- docs.kernel.org: cgroup v2 documentation
- Android Open Source Project: Kernel Overview
Key insights
Isolation is enforced at syscall boundaries, and containers are policy on top of kernel primitives.
Summary
Security and virtualization are built from small, enforceable kernel mechanisms. Namespaces and cgroups isolate processes, while hypervisors virtualize hardware for stronger boundaries.
Homework/Exercises to practice the concept
- Use
unshareto create a new PID namespace and observe process IDs. - Create a cgroup with a memory limit and test enforcement.
- Compare a container and VM by listing running processes inside each.
Solutions to the homework/exercises
- Run
unshare -p -f --mount-proc bashand observe PID 1 inside. - Write a small program that allocates memory until it is killed.
- In a VM you see a full OS; in a container you see host kernel processes.
Glossary
- ABI: Application Binary Interface; calling conventions and binary formats.
- Bootloader: Program that loads the kernel and prepares CPU state.
- Context switch: Saving/restoring CPU state between threads.
- cgroup: Kernel mechanism for resource limits and accounting.
- Copy-on-write: Share memory until a write triggers a copy.
- DMA: Direct Memory Access; device transfers without CPU copies.
- ELF: Executable and Linkable Format for binaries.
- FUSE: Filesystem in Userspace; build a FS without kernel code.
- GDT: Table defining memory segments in protected mode.
- Hypervisor: Software layer that runs virtual machines.
- Inode: File metadata record in Unix file systems.
- Interrupt: Asynchronous hardware signal to the CPU.
- IPC: Inter-process communication (pipes, sockets, shared memory).
- Kernel: Privileged core managing CPU, memory, and devices.
- Namespace: Isolated view of a system resource (PID, mount, net).
- Page fault: Trap when a page is missing or access is invalid.
- PCB: Process Control Block; kernel data for a process.
- Pipe: One-way byte stream between processes.
- Syscall: Controlled entry into the kernel for services.
- TLB: Cache of virtual-to-physical translations.
- vDSO: Shared page for fast kernel data without syscalls.
Why Operating Systems Matter
The Modern Problem It Solves
Modern systems run many programs at once, across local machines, clouds, and mobile devices. Without an OS, every program would need to manage hardware directly, leading to chaos, poor security, and fragile software. The OS provides isolation, fairness, and a uniform interface that makes software portable and reliable.
Recent real-world context (with sources and year):
- StatCounter global mobile OS share (Nov 2025): Android about 71.9% and iOS about 27.6%.
- StatCounter global desktop OS share (Nov 2025): Windows about 69.5% worldwide.
- POSIX.1-2024 is the latest standardized interface for portable Unix-style systems.
- Android is built on the upstream Linux kernel and depends on standard kernel subsystems.
Without OS With OS
+------------+ +-----------------------+
| Program A | | Scheduler + VM + FS |
| Program B | | Isolation + Security |
| Program C | | Stable APIs |
+------------+ +-----------------------+
| |
v v
Hardware chaos Reliable computing
Context & Evolution (History)
- 1969-1973: Unix emerges at Bell Labs and is rewritten in C.
- 1980s-1990s: POSIX standardizes a portable OS interface.
- 1991: Linux introduces a Unix-like kernel with open development.
- 2000s-2020s: Containers and cloud workloads push OS isolation features.
The lesson: OS abstractions that are simple and composable tend to survive. The Unix model is still visible inside modern Linux, macOS, and Android.
Concept Summary Table
| Code | Concept | What you must internalize |
|---|---|---|
| C1 | Bootstrapping and CPU Modes | Boot flow, CPU modes, loader invariants |
| C2 | Kernel Boundary and Syscalls | Privilege transitions, ABI, validation |
| C3 | Execution Model | Processes, threads, scheduling, synchronization |
| C4 | Virtual Memory and Paging | Translation, TLB, page faults, replacement |
| C5 | Dynamic Allocation | Heap design, fragmentation, alignment |
| C6 | File Systems | Inodes, allocation, journaling, crash safety |
| C7 | I/O Subsystem | Interrupts, drivers, DMA, device models |
| C8 | IPC and Composition | Pipes, sockets, signals, networking |
| C9 | Security and Virtualization | Permissions, namespaces, cgroups, VMs |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| 1. Bare Metal “Hello” | C1, C7 |
| 2. Bootloader + Kernel | C1, C2 |
| 3. Memory Allocator | C5, C4 |
| 4. VM Simulator | C4 |
| 5. Scheduler Simulator | C3 |
| 6. Fork and Exec | C2, C3 |
| 7. Unix Shell | C2, C3, C8 |
| 8. Syscall Interface | C2 |
| 9. File System | C6 |
| 10. Device Files | C6, C7, C9 |
| 11. IPC Toolkit | C3, C8 |
| 12. Unix Tools | C6, C8 |
| 13. Containers | C9, C2 |
| 14. xv6 Study | C1-C6 |
| 15. Kernel Module | C2, C7, C9 |
| 16. Network Stack | C7, C8 |
Deep Dive Reading by Concept
| Concept | Book & Chapter | Why It Helps |
|---|---|---|
| C1 Bootstrapping | OSTEP Ch. 36; Write Great Code Vol 2 Ch. 3 | Boot flow and CPU modes |
| C2 Syscalls | TLPI Ch. 3; APUE Ch. 2-3 | Syscall ABI and process APIs |
| C3 Execution Model | OSTEP Ch. 4, 7-9, 26-31 | Scheduling and concurrency |
| C4 Virtual Memory | OSTEP Ch. 13-22; CS:APP Ch. 9 | Paging and address translation |
| C5 Allocation | OSTEP Ch. 17; C Interfaces and Implementations Ch. 5 | Allocator design |
| C6 File Systems | OSTEP Ch. 39-45; Understanding the Linux Kernel Ch. 12 | Layout and journaling |
| C7 I/O | OSTEP Ch. 36; Linux Kernel Development Ch. 13 | Interrupts and drivers |
| C8 IPC/Networking | TLPI Ch. 44-56; UNIX Network Programming Ch. 1-6 | IPC and sockets |
| C9 Security/Virtualization | Modern Operating Systems Ch. 9-10; Linux docs | Permissions and isolation |
Quick Start
Day 1
- Read OSTEP Ch. 1-2 and skim Ch. 36 (boot overview).
- Install QEMU, build tools, and run a minimal C program.
- Draw the user/kernel boundary and annotate where syscalls occur.
Day 2
- Build Project 1 and boot it in QEMU.
- Use
straceon a simple command and categorize syscalls. - Write a 1-page summary: “What does the OS provide that hardware does not?”
Recommended Learning Paths
1) Foundations Path: Projects 1 -> 2 -> 3 -> 4 -> 5 -> 6 2) Unix Tooling Path: Projects 6 -> 7 -> 11 -> 12 -> 8 3) Kernel Focus Path: Projects 14 -> 15 -> 16 -> 13 4) Performance Path: Projects 3 -> 4 -> 5 -> 9 -> 16
Success Metrics
- I can explain the full boot chain from firmware to kernel entry.
- I can trace a syscall from user code to kernel handler and back.
- I can explain why a program slows down when its working set exceeds RAM.
- I can implement a scheduler and measure fairness metrics.
- I can build a file system image and recover it after a simulated crash.
- I can build a basic container with namespaces and cgroups.
Appendix: Tooling and Debugging Cheatsheet
- Boot and kernel debugging:
qemu-system-x86_64 -s -S,gdb,objdump -d - Syscall tracing:
strace -f,ltrace - Process inspection:
ps,top,htop,/proc/<pid>/status - File system inspection:
hexdump,xxd,debugfs,fsck - I/O profiling:
iostat,iotop,perf - Network inspection:
tcpdump,ss,ip
Project Overview Table
| # | Project | Focus | Difficulty | Outcome |
|---|---|---|---|---|
| 1 | Bare Metal “Hello” | Boot, real mode | Expert | Bootable 512B image |
| 2 | Bootloader + Kernel | Mode switch, loader | Expert | Protected-mode kernel stub |
| 3 | Memory Allocator | Heap, fragmentation | Advanced | malloc/free library |
| 4 | VM Simulator | Paging, faults | Expert | Trace-driven VM simulator |
| 5 | Scheduler Simulator | Policies, metrics | Advanced | Scheduler comparison tool |
| 6 | Fork and Exec | Process model | Intermediate | Process explorer |
| 7 | Unix Shell | Pipes, jobs | Advanced | Featureful shell |
| 8 | Syscall Interface | ABI, tracing | Expert | Raw syscall + tracer |
| 9 | File System | Persistence | Expert | FUSE-mounted FS |
| 10 | Device Files | I/O model | Advanced | Custom /dev devices |
| 11 | IPC Toolkit | Pipes, shm | Intermediate | IPC demos + chat |
| 12 | Unix Tools | Composition | Intermediate | Mini coreutils |
| 13 | Containers | Namespaces, cgroups | Expert | Minimal container runtime |
| 14 | xv6 Study | Kernel architecture | Advanced | Annotated xv6 + changes |
| 15 | Kernel Module | Drivers | Expert | Loadable module |
| 16 | Network Stack | Packets, sockets | Expert | Packet analyzer/mini stack |
Project List
Project 1: Bare Metal “Hello World” (No OS)
Real World Outcome
You boot a raw 512-byte image in QEMU and see your message printed on the virtual screen, with no OS, no libc, and no runtime. The CPU is executing your instructions directly.
$ nasm -f bin boot.asm -o boot.bin
$ ls -l boot.bin
-rw-r--r-- 1 user staff 512 Jan 01 12:00 boot.bin
$ qemu-system-x86_64 -drive format=raw,file=boot.bin
[QEMU window]
Hello, OS!
_
The Core Question You Are Answering
What is the smallest program that can make real hardware do visible work without any OS support, and why must it be structured in a precise way?
Concepts You Must Understand First
- C1 Bootstrapping and CPU modes (OSTEP Ch. 36)
- BIOS boot signature and 0x7C00 load address (Write Great Code Vol 2 Ch. 3)
- Memory-mapped I/O (CS:APP Ch. 6)
- Instruction size limits and padding (Art of Assembly Ch. 13)
Questions to Guide Your Design
- Why must the boot sector be exactly 512 bytes?
- Do you want BIOS interrupts or direct VGA memory writes?
- What registers are safe to assume on entry?
- How do you position the cursor or print multiple lines?
- How will you halt the CPU cleanly after output?
Thinking Exercise
Draw a 25x80 text screen. Compute the memory offset for row 3, column 10 in VGA text mode and show the 2-byte value for a white “A” on black.
The Interview Questions They Will Ask
- What does the 0x55AA signature do?
- Why is
org 0x7C00required? - What is the difference between BIOS output and VGA memory output?
- What CPU mode are you in when BIOS loads the boot sector?
Hints in Layers
Layer 1 - Minimal boot
- Write an infinite loop and confirm QEMU does not reboot.
Layer 2 - BIOS teletype
- Use
int 0x10with AH=0x0E to print characters.
Layer 3 - Direct VGA
- Write word values to 0xB8000 for text mode output.
Layer 4 - Polished output
- Clear the screen and set the cursor position for readability.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 36 | Boot sequence overview |
| Write Great Code Vol 2 | 3 | Real mode and segmentation |
| Art of Assembly | 13 | BIOS interrupts |
| CS:APP | 6 | Memory-mapped I/O |
Common Pitfalls & Debugging
Problem: “QEMU reboots instantly”
- Why: Missing or incorrect 0x55AA signature.
- Fix: Pad to 510 bytes and append
0x55 0xAA. - Quick test:
tail -c 2 boot.bin | hexdump -Cshows55 aa.
Problem: “No text appears”
- Why: Writing to wrong address or using 32-bit instructions in real mode.
- Fix: Use 16-bit instructions and VGA base 0xB8000.
- Quick test: Write a single character to offset 0 and observe.
Definition of Done
boot.binis exactly 512 bytes and ends with 0x55AA- QEMU boots and prints your message reliably
- Output appears without using any OS or libc
- You can explain each instruction in your boot sector
Project 2: Bootloader that Loads a Kernel
Real World Outcome
You build a two-stage bootloader that loads a kernel image from disk, switches to protected mode, and transfers control to C code. The screen shows a boot log and a kernel banner.
$ make
nasm -f bin stage1.asm -o stage1.bin
nasm -f bin stage2.asm -o stage2.bin
gcc -m32 -ffreestanding -c kernel.c -o kernel.o
ld -m elf_i386 -T linker.ld -o kernel.bin kernel.o
cat stage1.bin stage2.bin kernel.bin > os.bin
$ qemu-system-x86_64 -drive format=raw,file=os.bin
Bootloader: reading kernel...
Bootloader: enabling A20...
Bootloader: entering protected mode...
Kernel: hello from 32-bit C
The Core Question You Are Answering
How do you safely transition from firmware to your own kernel, set up CPU protection, and transfer control to higher-level code?
Concepts You Must Understand First
- C1 Bootstrapping and CPU modes (OSTEP Ch. 36)
- GDT and segmentation basics (Write Great Code Vol 2 Ch. 3)
- Disk reads via BIOS INT 0x13 (Art of Assembly Ch. 13)
- Kernel image layout and linker scripts (CS:APP Ch. 7)
Questions to Guide Your Design
- Where will you load the kernel so you do not overwrite the bootloader?
- How many sectors must you read, and how do you compute that?
- What GDT entries are required for a flat memory model?
- How will you set up the stack before entering C?
- What happens if the A20 line is not enabled?
Thinking Exercise
Sketch a memory map from 0x0000 to 0x20000 showing the boot sector at 0x7C00 and a kernel loaded at 0x1000 or 0x100000. Identify overlaps.
The Interview Questions They Will Ask
- Why is a far jump required after setting CR0.PE?
- What is the purpose of the GDT?
- How does the bootloader decide where the kernel starts?
- Why is
-ffreestandingused in kernel builds?
Hints in Layers
Layer 1 - Stage 1 loads stage 2
- Read a fixed number of sectors into memory and jump.
Layer 2 - Enter protected mode
- Load a minimal GDT, set CR0.PE, far jump.
Layer 3 - Call C code
- Set up a stack and call a C function from assembly.
Layer 4 - ELF-aware loader
- Parse ELF headers and load segments to correct addresses.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 36 | Boot flow overview |
| Art of Assembly | 8, 13 | Mode switching and BIOS I/O |
| CS:APP | 7 | Linking and object layout |
| Linux Kernel Development | 13 | Low-level I/O context |
Common Pitfalls & Debugging
Problem: “Bootloader hangs after enabling protected mode”
- Why: GDT descriptor or far jump is incorrect.
- Fix: Verify selector values and GDT layout.
- Quick test: Use QEMU
-d intand step through the transition.
Problem: “Kernel prints garbage”
- Why: Stack not initialized or wrong segment selectors.
- Fix: Set SS:ESP before calling C.
- Quick test: Print a constant after the mode switch.
Definition of Done
- Stage 1 loads stage 2 and the kernel reliably
- CPU enters protected mode without triple fault
- Kernel prints a C message after boot
- You can explain the GDT and mode switch sequence
Project 3: Memory Allocator (malloc from Scratch)
Real World Outcome
You implement malloc, free, and realloc with a free list and coalescing, and you can preload your allocator into real programs.
$ gcc -fPIC -shared -o libmymalloc.so mymalloc.c
$ LD_PRELOAD=./libmymalloc.so ./alloc_test
[mymalloc] malloc(1024) -> 0x7f1234000010
[mymalloc] malloc(2048) -> 0x7f1234000420
[mymalloc] free(0x7f1234000010)
[mymalloc] malloc(512) -> 0x7f1234000010 (reused)
[mymalloc] coalesce: merged 2 blocks
Heap stats:
total: 65536 bytes
in-use: 2560 bytes
free: 62976 bytes
fragmentation: 12%
The Core Question You Are Answering
How does a runtime convert raw pages into safe, reusable allocations with minimal overhead and fragmentation?
Concepts You Must Understand First
- C5 Dynamic allocation (OSTEP Ch. 17)
- Heap vs stack and alignment (C Interfaces and Implementations Ch. 5)
- sbrk vs mmap (TLPI Ch. 7)
- Fragmentation metrics (OSTEP Ch. 17)
Questions to Guide Your Design
- Will your size field include the header or only payload?
- How will you align allocations (8 or 16 bytes)?
- When do you split a free block, and how small can the remainder be?
- How will you coalesce neighbors efficiently?
- What is your strategy for large allocations?
Thinking Exercise
Simulate this sequence on paper: malloc(32), malloc(64), free(first), malloc(16), free(second). Draw the free list after each step.
The Interview Questions They Will Ask
- What is the difference between internal and external fragmentation?
- Why does
freenot need a size argument? - How would you make the allocator thread-safe?
- Why is alignment required for correctness?
Hints in Layers
Layer 1 - Bump allocator
- Allocate by moving a pointer forward; ignore free.
Layer 2 - Free list
- Add block headers and support freeing blocks.
Layer 3 - Split and coalesce
- Split large blocks and merge adjacent free blocks.
Layer 4 - realloc and stats
- Add realloc, fragmentation metrics, and debug logging.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 17 | Allocator fundamentals |
| CS:APP | 9 | Heap internals |
| TLPI | 7 | sbrk and mmap |
| C Interfaces and Implementations | 5 | Alignment and layout |
Common Pitfalls & Debugging
Problem: “malloc returns misaligned pointers”
- Why: Header size not aligned to required boundary.
- Fix: Round block sizes up to alignment.
- Quick test: Allocate a
doubleand check address % 8 == 0.
Problem: “free causes crashes later”
- Why: Coalescing merges wrong neighbors or corrupts list.
- Fix: Add boundary tags or verify with heap walking.
- Quick test: Run with AddressSanitizer and compare.
Definition of Done
- Alloc/free works on randomized tests without corruption
- Coalescing reduces fragmentation on a sample workload
- realloc passes shrink and grow cases
- You can explain your allocator policy choices
Project 4: Virtual Memory Simulator
Real World Outcome
You build a trace-driven simulator that translates virtual addresses, handles page faults, and compares replacement algorithms.
$ ./vm_sim workload.trace --pages=64 --frames=8 --algo=lru
VM Simulator
============
Pages: 64 (256 KB) Frames: 8 (32 KB)
Algorithm: LRU
Access 0x1234 -> page 1 -> frame 0 [fault]
Access 0x1240 -> page 1 -> frame 0 [hit]
Access 0x9ABC -> page 9 -> frame 1 [fault]
...
Stats:
accesses: 10000
page faults: 812 (8.12%)
TLB hits: 8234 (82.34%)
FIFO faults: 1012
LRU faults: 812
Clock faults: 855
The Core Question You Are Answering
How does the OS translate addresses and decide what stays in memory when RAM is smaller than the working set?
Concepts You Must Understand First
- C4 Virtual memory and paging (OSTEP Ch. 13-22)
- TLB and caching (Hennessy and Patterson Ch. 5)
- Replacement policies (OSTEP Ch. 22)
- Locality of reference (CS:APP Ch. 6)
Questions to Guide Your Design
- How many bits are page number vs offset?
- What fields must each PTE contain (present, dirty, accessed)?
- How will you approximate LRU efficiently?
- How will you validate translations against expected output?
- What metrics matter most for comparisons?
Thinking Exercise
Given frames=4 and trace: 1,2,3,4,1,2,5,1,2,3,4,5. Compute FIFO and LRU faults by hand.
The Interview Questions They Will Ask
- Why are page tables multi-level on 64-bit systems?
- What is the difference between a TLB miss and a page fault?
- Why does locality make LRU outperform FIFO?
- What is the working set model?
Hints in Layers
Layer 1 - Direct page table
- Implement page number extraction and translation.
Layer 2 - FIFO
- Add a queue of frames and evict the oldest.
Layer 3 - LRU or Clock
- Track access times or use an accessed bit.
Layer 4 - TLB simulation
- Add a small translation cache and track hit rate.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 13-22 | Paging, faults, replacement |
| CS:APP | 9 | VM fundamentals |
| Hennessy and Patterson | 5 | TLB and caching |
Common Pitfalls & Debugging
Problem: “All translations are wrong”
- Why: Page size mask or shift is incorrect.
- Fix: Verify page size and bit math on small examples.
- Quick test: Translate a known address by hand.
Problem: “LRU behaves like FIFO”
- Why: Access timestamps not updated on hits.
- Fix: Update access metadata for every access.
- Quick test: Run a locality-heavy trace.
Definition of Done
- Translation works for manual test cases
- Replacement algorithms produce distinct fault rates
- TLB hit rate is computed and reported
- You can explain why LRU beats FIFO on locality traces
Project 5: Process Scheduler Simulator
Real World Outcome
You build a scheduler simulator that compares FIFO, SJF, Round Robin, MLFQ, and CFS on real workloads and outputs fairness metrics.
$ ./schedsim workload.txt --algo=mlfq
Scheduler Simulator
===================
Workload: 5 processes
Algorithm: MLFQ (3 queues, quanta 10/20/40 ms)
Timeline:
0ms P1
10ms P2
20ms P3
30ms P4
40ms P5
...
Results:
PID Turnaround Response Wait
P1 180ms 0ms 80ms
P2 110ms 10ms 40ms
P3 290ms 20ms 90ms
P4 40ms 30ms 30ms
P5 120ms 40ms 70ms
Average response: 20ms
The Core Question You Are Answering
How should the OS decide which process runs next so the system is both fair and responsive?
Concepts You Must Understand First
- C3 Execution model and scheduling (OSTEP Ch. 4, 7-9)
- Preemption and time slicing (OSTEP Ch. 7)
- Scheduling metrics (response, turnaround, wait)
- CFS overview (Linux Kernel Development Ch. 4)
Questions to Guide Your Design
- How will you model CPU vs I/O bursts?
- When two events occur at the same time, which wins?
- How do you account for context switch overhead?
- How will you compute response time for Round Robin?
- What does fairness mean in your simulator?
Thinking Exercise
For processes P1(5), P2(2), P3(8) arriving at time 0, compute FIFO, SJF, and RR (q=2) by hand.
The Interview Questions They Will Ask
- Why can SJF starve long-running jobs?
- What is the difference between preemptive and non-preemptive scheduling?
- Why does MLFQ use priority boosts?
- How does CFS approximate fairness?
Hints in Layers
Layer 1 - FIFO
- Implement a simple queue and no preemption.
Layer 2 - SJF and RR
- Add burst times and a time quantum.
Layer 3 - Metrics
- Track start and completion times for each process.
Layer 4 - MLFQ or CFS
- Add multiple queues or a virtual runtime tree.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 4, 7-9 | Scheduling fundamentals |
| Linux Kernel Development | 4 | CFS design |
Common Pitfalls & Debugging
Problem: “Response time is negative”
- Why: Start time overwritten on reschedule.
- Fix: Record first-run time only once.
- Quick test: Validate with a 2-process workload.
Problem: “Blocked processes never resume”
- Why: I/O completion events are not re-queued.
- Fix: Move blocked processes back to ready at completion time.
- Quick test: Add a single I/O burst and verify.
Definition of Done
- FIFO, SJF, and RR produce correct metrics
- MLFQ or CFS is implemented and compared
- Simulator outputs a timeline and averages
- You can explain trade-offs for each policy
Project 6: Fork and Exec (Process Creation)
Real World Outcome
You build a process explorer that demonstrates fork/exec/wait, shows process trees, and surfaces zombies and orphans.
$ ./fork_explorer
Parent PID: 1201
fork() -> child PID 1202
Child: exec /bin/ls
file1.txt file2.txt fork_explorer
Parent: child 1202 exited with status 0
$ ./fork_explorer tree
init(1)
+-- bash(1180)
+-- fork_explorer(1201)
+-- ls(1202)
The Core Question You Are Answering
Why did Unix split process creation into fork and exec, and how does that enable composition and redirection?
Concepts You Must Understand First
- C2 Syscall boundary (TLPI Ch. 3)
- C3 Process model (OSTEP Ch. 4-5)
- fork/exec semantics (APUE Ch. 8)
- Wait and exit status (TLPI Ch. 26)
Questions to Guide Your Design
- How will you clearly show parent vs child output ordering?
- How will you demonstrate zombies and orphans safely?
- How will you parse
/procto build a tree? - How do you capture child exit status accurately?
- How will you show file descriptor inheritance?
Thinking Exercise
Draw the process tree created by a shell pipeline cat file | grep foo | wc -l and label who calls fork and exec.
The Interview Questions They Will Ask
- What is the difference between fork and exec?
- What is a zombie and how is it cleaned up?
- Why is fork efficient despite copying memory?
- How does a shell implement output redirection?
Hints in Layers
Layer 1 - Basic fork
- Print different messages in parent and child.
Layer 2 - exec
- In the child, exec
/bin/lsand observe output.
Layer 3 - wait
- Parent waits and prints exit status.
Layer 4 - /proc tree
- Parse
/proc/<pid>/statand build a tree.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| TLPI | 24-26 | fork, exec, wait |
| APUE | 8 | Process relationships |
| OSTEP | 5 | Fork and exec model |
Common Pitfalls & Debugging
Problem: “Output ordering is confusing”
- Why: Parent and child run concurrently.
- Fix: Use fflush or sleep to make ordering clear.
- Quick test: Add timestamps to each line.
Problem: “Zombie stays forever”
- Why: Parent never calls wait.
- Fix: Call waitpid or handle SIGCHLD.
- Quick test:
ps -el | grep Zbefore and after wait.
Definition of Done
- fork/exec/wait demo works and shows exit status
- Process tree is printed from /proc data
- Zombies and orphans are demonstrated and cleaned
- You can explain the fork+exec design rationale
Project 7: Unix Shell from Scratch
Real World Outcome
You build a functional shell with parsing, pipes, redirection, and job control.
$ ./myshell
myshell> echo hello
hello
myshell> ls -la | grep .c | wc -l
5
myshell> sort < unsorted.txt | uniq > sorted.txt
myshell> sleep 100 &
[1] 4321
myshell> jobs
[1] Running sleep 100
myshell> fg 1
sleep 100
^C
myshell> exit
The Core Question You Are Answering
How do fork, exec, pipes, and signals combine to implement the Unix command model?
Concepts You Must Understand First
- C3 Processes and scheduling (OSTEP Ch. 4)
- C8 Pipes and IPC (TLPI Ch. 44)
- File descriptor duplication and redirection (APUE Ch. 3)
- Signals and job control (TLPI Ch. 20-22)
Questions to Guide Your Design
- How will you parse quoted strings and escapes?
- How will you implement pipelines of arbitrary length?
- How will you handle built-in commands like
cd? - How will you manage process groups for job control?
- How will you handle SIGINT and SIGTSTP correctly?
Thinking Exercise
Explain how grep foo < in.txt | sort | uniq -c > out.txt is executed at the process and FD level.
The Interview Questions They Will Ask
- How do pipes work under the hood?
- Why must
cdbe a shell built-in? - How does the shell implement
&andfg/bg? - What is a process group and why does it matter?
Hints in Layers
Layer 1 - Read/exec loop
- Implement a loop that execs single commands.
Layer 2 - Pipes and redirection
- Use pipe(), dup2(), and close() to wire processes.
Layer 3 - Built-ins
- Implement
cd,exit,jobs, andfg.
Layer 4 - Job control
- Use setpgid and tcsetpgrp for foreground control.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| TLPI | 20-22, 44 | Signals and pipes |
| APUE | 3, 9 | Redirection and job control |
| OSTEP | 5 | Fork/exec mental model |
Common Pitfalls & Debugging
Problem: “Pipelines hang”
- Why: Parent forgot to close pipe ends.
- Fix: Close unused ends in all processes.
- Quick test: Use
lsof -p <pid>to inspect FDs.
Problem: “Ctrl+C kills the shell”
- Why: Shell did not separate process groups.
- Fix: Put child in its own process group and forward signals.
- Quick test: Run
sleep 100and press Ctrl+C.
Definition of Done
- Commands execute with correct exit status
- Pipes and redirection match bash for simple cases
- Built-ins work and do not spawn child processes
- Job control handles background and foreground tasks
Project 8: System Call Interface
Real World Outcome
You implement raw syscalls in assembly, compare with libc, and build a small tracer that logs syscalls of a target program.
$ ./syscall_lab raw
[raw] write(1, "Hello\n", 6) -> 6
[raw] getpid() -> 5123
$ ./syscall_lab trace /bin/ls
execve("/bin/ls", ["ls"], [/* env */]) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY) = 3
mmap(NULL, 12345, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f...
write(1, "file1\nfile2\n", 12) = 12
exit_group(0) = ?
The Core Question You Are Answering
What exactly happens when user code crosses the kernel boundary, and how small is the real OS API?
Concepts You Must Understand First
- C2 Syscall boundary and ABI (TLPI Ch. 3)
- User/kernel transition (Linux Kernel Development Ch. 5)
- ptrace basics (TLPI Ch. 26)
- Error handling (errno conventions)
Questions to Guide Your Design
- How will you encode syscall numbers for your architecture?
- How will you pass 6 arguments safely?
- How will you capture syscall entry and exit with ptrace?
- How will you decode syscall numbers into names?
- How will you handle multi-threaded targets?
Thinking Exercise
Pick a simple program (e.g., cat file) and predict the first 10 syscalls it will make.
The Interview Questions They Will Ask
- Why does the kernel validate user pointers?
- What registers are used for syscalls on x86-64?
- What is the difference between a syscall and a function call?
- How does
stracework internally?
Hints in Layers
Layer 1 - Raw syscall
- Implement
getpidandwritein inline asm.
Layer 2 - Syscall table
- Add a small map from numbers to names.
Layer 3 - Tracing
- Use
ptrace(PTRACE_SYSCALL)to stop at entry/exit.
Layer 4 - Formatter
- Decode arguments and print human-readable output.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| TLPI | 3, 26 | Syscall basics and tracing |
| Low-Level Programming | 8 | x86-64 syscall ABI |
| Linux Kernel Development | 5 | Kernel entry path |
Common Pitfalls & Debugging
Problem: “Raw syscall returns -1”
- Why: Wrong syscall number or argument register.
- Fix: Verify ABI and use
errnomapping. - Quick test: Compare with libc call.
Problem: “Tracer misses syscalls”
- Why: You did not wait for both entry and exit stops.
- Fix: Alternate between entry and exit and track state.
- Quick test: Log a counter and ensure it increments.
Definition of Done
- Raw syscalls work for at least two functions
- Tracer logs syscall entry and exit reliably
- Output matches
stracefor a simple program - You can explain the syscall ABI for x86-64
Project 9: File System from Scratch
Real World Outcome
You implement a simple file system in a disk image, build mkfs and fsck tools, and mount it with FUSE.
$ ./mkfs.myfs disk.img 10M
Created myfs: 10 MB, 256 inodes, 2560 blocks
$ ./myfs_fuse disk.img /mnt/myfs
$ echo "hello" > /mnt/myfs/notes.txt
$ ls -l /mnt/myfs
-rw-r--r-- 1 user user 6 Jan 01 12:00 notes.txt
$ ./fsck.myfs disk.img
myfs: clean
The Core Question You Are Answering
How do raw blocks become durable, named files, and how do you recover after crashes?
Concepts You Must Understand First
- C6 File systems and crash consistency (OSTEP Ch. 39-45)
- Inodes and directories (Understanding the Linux Kernel Ch. 12)
- Block allocation and bitmaps (OSTEP Ch. 40)
- FUSE interface basics (TLPI Ch. 14)
Questions to Guide Your Design
- What is your on-disk layout (superblock, inode table, bitmap, data)?
- How will you map filenames to inode numbers?
- How will you allocate and free blocks efficiently?
- How will you implement fsck and detect inconsistencies?
- What is your crash recovery strategy?
Thinking Exercise
Design a 1 MB file system with 4 KB blocks. How many inodes can you fit if each inode is 128 bytes and you reserve 10% for metadata?
The Interview Questions They Will Ask
- What is the difference between a directory and a file in Unix?
- Why does journaling improve crash recovery?
- What is the purpose of an inode?
- How does the OS resolve a path like /a/b/c?
Hints in Layers
Layer 1 - Flat files
- Implement fixed-size files with a simple table.
Layer 2 - Inodes and directories
- Add inode structures and directory entries.
Layer 3 - Allocation bitmap
- Track free blocks and implement allocation.
Layer 4 - FUSE mount
- Expose your file system via FUSE and test with shell tools.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 39-45 | File system design |
| Understanding the Linux Kernel | 12 | Inode internals |
| TLPI | 14 | File I/O and FUSE |
Common Pitfalls & Debugging
Problem: “Files disappear after remount”
- Why: Metadata not flushed or inconsistent.
- Fix: Ensure superblock/inodes are written and fsync as needed.
- Quick test: Mount, create file, unmount, remount, verify.
Problem: “Directory traversal fails”
- Why: Incorrect directory entry parsing.
- Fix: Validate entry sizes and offsets.
- Quick test: Dump raw directory blocks with hexdump.
Definition of Done
- mkfs creates a valid disk image
- create/read/write/delete works via FUSE
- fsck detects and repairs simple inconsistencies
- You can explain your on-disk layout
Project 10: Everything Is a File (Device Files)
Real World Outcome
You build a set of custom device files (via a kernel module or FUSE) that behave like real devices: a random generator, a counter, and a logging sink.
$ sudo ./mydevctl load
$ ls -l /dev/myrand /dev/mycount /dev/mylog
crw-rw---- 1 root root 240, 0 Jan 01 12:00 /dev/myrand
crw-rw---- 1 root root 240, 1 Jan 01 12:00 /dev/mycount
crw-rw---- 1 root root 240, 2 Jan 01 12:00 /dev/mylog
$ head -c 8 /dev/myrand | hexdump -C
00000000 4f 12 a9 7c 2b 90 33 01 |O..|+.3.|
$ cat /dev/mycount
1
$ cat /dev/mycount
2
$ echo "hello" > /dev/mylog
$ dmesg | tail -n 2
[12345.67] mylog: hello
The Core Question You Are Answering
How does the OS present hardware and kernel services as simple files with read/write semantics?
Concepts You Must Understand First
- C7 I/O subsystem (interrupts and drivers)
- C6 File system semantics and device nodes
- C9 Permissions and access control
- Character vs block devices (Linux kernel docs)
Questions to Guide Your Design
- Should your device be character or block?
- What does read/write mean for each device?
- How will you handle concurrency and multiple readers?
- What permissions should the device nodes have?
- How will you expose device state to user space?
Thinking Exercise
Design a device /dev/uptime that returns a string with seconds since boot. What kernel data would you use?
The Interview Questions They Will Ask
- What is the difference between a character and block device?
- Why does Unix treat devices as files?
- How does a kernel module register a device?
- What does
mknoddo?
Hints in Layers
Layer 1 - Stub device
- Register a char device that returns a fixed string.
Layer 2 - Stateful reads
- Add internal counters or a ring buffer.
Layer 3 - Permissions
- Create udev rules or set permissions on /dev nodes.
Layer 4 - Concurrency
- Use locks to prevent races on shared state.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| Linux Kernel Development | 6-7 | Device registration |
| TLPI | 13 | Device files and I/O |
| APUE | 4 | File system interfaces |
Common Pitfalls & Debugging
Problem: “read blocks forever”
- Why: Driver never returns EOF or data length.
- Fix: Return 0 when no more data is available.
- Quick test: Use
straceto see read loop behavior.
Problem: “permission denied”
- Why: Device node permissions are too strict.
- Fix: Adjust udev rules or chmod the node.
- Quick test:
ls -l /dev/myrandand fix mode bits.
Definition of Done
- Device nodes appear under /dev with correct major/minor
- Read/write semantics behave as documented
- Concurrency is safe under parallel reads
- Permissions and ownership are correct
Project 11: Inter-Process Communication Toolkit
Real World Outcome
You build a toolkit that demonstrates pipes, shared memory, and Unix domain sockets, including a small local chat program.
$ ./ipc_demo pipe
parent -> child: hello via pipe
child -> parent: ack
$ ./ipc_demo shm
writer: wrote 4096 bytes to shared segment
reader: verified checksum OK
$ ./chat_server
[server] listening on /tmp/chat.sock
$ ./chat_client
> hello
[server] broadcast: hello
The Core Question You Are Answering
When processes are isolated, how do they communicate efficiently and safely?
Concepts You Must Understand First
- C8 IPC mechanisms (pipes, sockets, shared memory)
- C3 Synchronization primitives (locks, condition variables)
- File descriptor inheritance (APUE Ch. 3)
- Unix domain sockets (UNP Vol 1 Ch. 3)
Questions to Guide Your Design
- Which IPC mechanism is best for each demo and why?
- How will you synchronize shared memory writes?
- How will you handle client disconnects in the chat server?
- What buffer sizes are safe for your IPC messages?
- How will you benchmark latency between mechanisms?
Thinking Exercise
Compare the data path of a pipe vs shared memory. Where does copying occur in each?
The Interview Questions They Will Ask
- When would you choose sockets over pipes?
- Why is shared memory fast but dangerous?
- How does the OS wake a process waiting on a pipe?
- What is the difference between Unix and TCP sockets?
Hints in Layers
Layer 1 - Pipe demo
- Use pipe(), fork(), write/read.
Layer 2 - Shared memory
- Use shm_open/mmap and add a simple lock.
Layer 3 - Unix socket server
- Bind to /tmp/socket and accept clients.
Layer 4 - Benchmark
- Measure round-trip time for each mechanism.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| TLPI | 44-56 | IPC mechanisms |
| APUE | 14 | IPC overview |
| UNIX Network Programming | 3-5 | Unix domain sockets |
Common Pitfalls & Debugging
Problem: “Shared memory data is corrupted”
- Why: Missing synchronization or incorrect offsets.
- Fix: Add a mutex or atomic flag.
- Quick test: Add checksums and verify.
Problem: “Server stops accepting clients”
- Why: File descriptor leak or not handling EINTR.
- Fix: Close client FDs and restart accept on EINTR.
- Quick test: Use
lsofto monitor open sockets.
Definition of Done
- Pipe and shared memory demos work reliably
- Unix socket chat handles multiple clients
- Benchmarks show relative latency for each mechanism
- You can explain the trade-offs among IPC types
Project 12: Unix Philosophy Tools (Mini Coreutils)
Real World Outcome
You build small Unix tools (mini cat, grep, wc, head, tail) and verify that they compose correctly in pipelines.
$ ./mcat notes.txt | ./mgrep error | ./mwc -l
42
$ ./mhead -n 3 notes.txt
line1
line2
line3
$ ./mtail -n 2 notes.txt
line99
line100
The Core Question You Are Answering
Why do small, composable tools scale into powerful systems?
Concepts You Must Understand First
- C8 Pipes and composition (TLPI Ch. 44)
- C6 File I/O semantics (APUE Ch. 3)
- C2 Syscall boundary and errors
- Line buffering and streams
Questions to Guide Your Design
- How will your tools behave with stdin vs file input?
- How will you handle large files without loading into memory?
- How will you parse options and flags consistently?
- How will you handle binary data vs text?
- How will you test pipeline behavior?
Thinking Exercise
Design a pipeline that extracts the top 5 IPs from a log using only your tools.
The Interview Questions They Will Ask
- Why is “everything is a file” useful for tool composition?
- What is the difference between buffered and unbuffered I/O?
- Why should tools read from stdin by default?
- What does it mean for a tool to be stream-friendly?
Hints in Layers
Layer 1 - Single tool
- Implement mcat with read/write loops.
Layer 2 - Filtering
- Implement mgrep with simple substring search.
Layer 3 - Counting
- Implement mwc with line/word/byte counts.
Layer 4 - Robustness
- Add flags, error messages, and usage output.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| APUE | 3 | File I/O patterns |
| TLPI | 4-5 | File descriptor basics |
| The Linux Command Line | 9-12 | Unix tool behavior |
Common Pitfalls & Debugging
Problem: “Pipelines hang”
- Why: Tool never closes stdout or reads stdin fully.
- Fix: Close outputs and handle EOF correctly.
- Quick test: Run with
straceand verify read/write loops.
Problem: “Counts are wrong”
- Why: Incorrect line ending handling or buffer logic.
- Fix: Normalize line endings and test with known files.
- Quick test: Compare with system
wcon the same file.
Definition of Done
- Tools behave like their Unix counterparts on basic tests
- Tools compose correctly in pipelines
- Tools handle stdin and file arguments
- You can explain how composition relies on file descriptors
Project 13: Containers from Scratch
Real World Outcome
You build a minimal container runtime that creates a new PID, mount, and network namespace, applies cgroup limits, and runs a command inside.
$ sudo ./mini_container /bin/bash
[container] pid: 1
[container] hostname: mini
[container] /proc shows only container processes
# inside container
$ hostname
mini
$ ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
The Core Question You Are Answering
How do namespaces and cgroups isolate processes without virtualizing hardware?
Concepts You Must Understand First
- C9 Security and isolation (namespaces, cgroups)
- C2 Syscall boundary and privilege checks
- C8 Networking and IPC (network namespaces)
- C6 Mount namespaces and filesystem views
Questions to Guide Your Design
- Which namespaces do you need for isolation (PID, mount, net)?
- How will you set up a minimal root filesystem?
- How will you configure cgroup limits (memory, CPU)?
- How will you handle UID/GID mapping for safety?
- How will you attach and clean up namespaces?
Thinking Exercise
Compare a container and a VM in terms of kernel, filesystem, and process isolation.
The Interview Questions They Will Ask
- Why is a container not a VM?
- What does a PID namespace change?
- What is a cgroup and why is it needed?
- How does a container runtime start a process?
Hints in Layers
Layer 1 - Unshare namespaces
- Use
unshareor clone with namespace flags.
Layer 2 - Mount /proc
- Mount a new procfs in the container namespace.
Layer 3 - cgroup limits
- Write memory and CPU limits to cgroup files.
Layer 4 - Rootfs
- Use pivot_root or chroot for filesystem isolation.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| Modern Operating Systems | 9-10 | Protection and virtualization |
| TLPI | 40, 42 | Namespaces and resource limits |
| Linux Kernel Development | 7 | Process management context |
Common Pitfalls & Debugging
Problem: “Container sees host processes”
- Why: Missing PID namespace or /proc mount.
- Fix: Use CLONE_NEWPID and mount a new /proc.
- Quick test:
psinside container should show PID 1 only.
Problem: “Memory limit not enforced”
- Why: cgroup not configured or wrong controller.
- Fix: Enable cgroup v2 and write to memory.max.
- Quick test: Allocate memory until OOM and observe.
Definition of Done
- Container has isolated PID and mount namespaces
- /proc shows only container processes
- cgroup limits are enforced
- Container runs arbitrary commands and exits cleanly
Project 14: xv6 Operating System Study
Real World Outcome
You read and annotate the xv6 codebase, run it in QEMU, and implement small extensions (e.g., a new syscall, scheduler tweak, or file system change).
$ make qemu
xv6...
$ ls
README cat echo grep init sh ls mkdir rm wc
$ uptime
1 1 1 100
The Core Question You Are Answering
How do real kernel subsystems fit together in a minimal, understandable OS?
Concepts You Must Understand First
- C1 Bootstrapping and CPU modes
- C2 Syscall boundary and ABI
- C3 Scheduling and process model
- C4 Virtual memory and paging
- C6 File system layout
Questions to Guide Your Design
- Which files implement boot and trap entry?
- How does xv6 represent processes and scheduling queues?
- Where are page tables created and switched?
- How does the file system map paths to inodes?
- Where would you add a new syscall safely?
Thinking Exercise
Trace the syscall path for ls from user space to kernel and back. Write the function names in order.
The Interview Questions They Will Ask
- Why is xv6 a good teaching OS?
- How does xv6 handle copy-on-write or lack it?
- Where are locks used to protect shared data?
- How do you add a syscall in xv6?
Hints in Layers
Layer 1 - Boot and traps
- Identify entry.S and trap handlers.
Layer 2 - Process model
- Read proc.c and scheduler loop.
Layer 3 - VM
- Read vm.c and map a page by hand.
Layer 4 - Extend
- Add a syscall that returns system uptime.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| OSTEP | 4-6, 13-15 | Processes and memory |
| Modern Operating Systems | 1-3 | Kernel structures |
| Understanding the Linux Kernel | 1-3 | Real kernel architecture |
Common Pitfalls & Debugging
Problem: “Kernel panic after change”
- Why: Modified code breaks an invariant or missed a lock.
- Fix: Revert to last known working state and diff.
- Quick test: Use QEMU with GDB to inspect trap state.
Problem: “New syscall not found”
- Why: Syscall table or user stub not updated.
- Fix: Add entry in syscall.c and user header.
- Quick test: Verify syscall number and rebuild.
Definition of Done
- xv6 builds and boots in QEMU
- You can explain at least 3 core subsystems
- You implement a small, working kernel change
- You document the syscall path for a command
Project 15: Kernel Module Development
Real World Outcome
You write a loadable kernel module that registers a character device, exposes a /dev node, and logs kernel events.
$ make
$ sudo insmod mymod.ko
$ dmesg | tail -n 3
[12345.00] mymod: loaded
[12345.01] mymod: registered /dev/mymod
$ echo "ping" > /dev/mymod
$ dmesg | tail -n 1
[12345.50] mymod: ping
$ sudo rmmod mymod
$ dmesg | tail -n 1
[12346.00] mymod: unloaded
The Core Question You Are Answering
How do you safely extend a running kernel without crashing the system?
Concepts You Must Understand First
- C2 Syscall boundary and kernel mode rules
- C7 I/O subsystem and device registration
- C9 Security and privilege checks
- Kernel logging and debugging
Questions to Guide Your Design
- What init and exit hooks are required?
- How will you register and unregister a device cleanly?
- How will you handle concurrency in read/write callbacks?
- How will you expose module parameters safely?
- How will you test failure paths?
Thinking Exercise
List three reasons kernel code is more dangerous than user code, and how modules mitigate them.
The Interview Questions They Will Ask
- Why can a kernel module crash the whole system?
- How do you debug kernel code?
- What is the difference between a built-in driver and a module?
- How does module reference counting work?
Hints in Layers
Layer 1 - Hello module
- Print a message in init and exit.
Layer 2 - Device node
- Register a char device and implement read/write.
Layer 3 - Concurrency
- Add a mutex to protect shared buffers.
Layer 4 - Robustness
- Handle errors and unregister cleanly on failure.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| Linux Kernel Development | 6-7 | Modules and devices |
| TLPI | 13 | Device files |
| Understanding the Linux Kernel | 1-2 | Kernel structure |
Common Pitfalls & Debugging
Problem: “Kernel oops on write”
- Why: User pointer not validated or buffer overflow.
- Fix: Use copy_from_user and bounds checks.
- Quick test: Write a long string and verify safe handling.
Problem: “Module will not unload”
- Why: Reference count still held.
- Fix: Ensure file descriptors are closed and release callbacks run.
- Quick test:
lsmodand check refcount.
Definition of Done
- Module loads and unloads cleanly
- /dev node works with read and write
- Concurrency is safe under parallel access
- You can explain kernel/user safety constraints
Project 16: Network Stack Exploration
Real World Outcome
You build a packet capture tool and a minimal user-space TCP/IP stack simulator, then compare its behavior with the kernel stack.
$ sudo ./pktcap eth0
[12:00:01] RX TCP 192.168.1.10:443 -> 192.168.1.50:51432 len=74
[12:00:01] TX TCP 192.168.1.50:51432 -> 192.168.1.10:443 len=66
[12:00:02] RX DNS 192.168.1.1 -> 192.168.1.50 len=92
$ ./mini_tcp_sim
SYN -> SYN-ACK -> ACK
state: ESTABLISHED
send: GET / HTTP/1.1
recv: HTTP/1.1 200 OK
The Core Question You Are Answering
How does the OS move bytes from a socket API call to packets on the wire, and back again?
Concepts You Must Understand First
- C7 I/O subsystem and interrupts
- C8 Networking and sockets
- Buffering and queueing (socket buffers)
- TCP state machine basics (TCP/IP Illustrated Ch. 1-6)
Questions to Guide Your Design
- What layers will your simulator implement (Ethernet, IP, TCP)?
- How will you handle checksum calculation and validation?
- How will you model retransmission and timeouts?
- How will you compare kernel vs user-space behavior?
- What metrics will you collect (latency, throughput, drops)?
Thinking Exercise
Draw the TCP three-way handshake and label the state transitions.
The Interview Questions They Will Ask
- What is the difference between TCP and UDP?
- Why does TCP need sequence numbers?
- What is the role of the NIC driver in packet handling?
- How does the OS wake a process waiting on a socket?
Hints in Layers
Layer 1 - Packet capture
- Use raw sockets or libpcap to capture packets.
Layer 2 - Decode headers
- Parse Ethernet, IP, and TCP headers.
Layer 3 - Simulated TCP
- Implement a tiny state machine for SYN/SYN-ACK/ACK.
Layer 4 - Compare
- Run a real HTTP request and compare with your simulator logs.
Books That Will Help
| Book | Chapters | Why It Helps |
|---|---|---|
| TCP/IP Illustrated | 1-6 | Protocol fundamentals |
| UNIX Network Programming | 1-5 | Socket APIs |
| Understanding Linux Network Internals | 1-3 | Kernel stack design |
Common Pitfalls & Debugging
Problem: “Captured packets look wrong”
- Why: Incorrect header offsets or endianness.
- Fix: Use htons/ntohs and verify header sizes.
- Quick test: Compare against tcpdump output.
Problem: “Simulator does not retransmit”
- Why: No timer or state tracking.
- Fix: Add a retransmission timer and state machine.
- Quick test: Drop a packet and ensure resend occurs.
Definition of Done
- Packet capture tool decodes Ethernet/IP/TCP
- Simulator implements a basic TCP handshake
- Output matches tcpdump for a sample flow
- You can explain how socket calls map to packets