PHASE 2 TRACK B SYSTEMS LIBRARIES PROJECTS
The topics in this track break down into these fundamental building blocks:
Systems Libraries & Runtimes — Project-Based Learning Path
Phase 2 — Advanced Systems Track B
This track is about building the infrastructure that other software depends on. You’ll learn to write code that’s fast, correct across platforms, and doesn’t invoke undefined behavior.
Core Concept Analysis
The topics in this track break down into these fundamental building blocks:
| Concept | What You Must Understand |
|---|---|
| Memory Allocators | Free lists, fragmentation, arena vs general-purpose, metadata overhead |
| Threading Primitives | Mutexes, atomics, memory ordering, lock-free algorithms |
| Async Runtimes | Event loops, futures/promises, IO multiplexing (epoll/kqueue), schedulers |
| ABI Details | Calling conventions, struct layout, symbol visibility, name mangling |
| Platform Differences | POSIX vs Windows syscalls, endianness, feature detection |
| Performance Tuning | Cache lines, branch prediction, SIMD, profiling |
| Undefined Behavior | Strict aliasing, alignment, integer overflow, pointer provenance |
| API Design | Ergonomics vs zero-cost, error handling, versioning |
Project 1: Custom Memory Allocator
- File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust, Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: Memory Management, Systems Programming
- Software or Tool: glibc, jemalloc, valgrind
- Main Book: “C Interfaces and Implementations” - David Hanson What you’ll build: A specialized memory allocator (arena/pool allocator) that eliminates
mallocfrom your hot path - becausemallocis a latency killer. Why it teaches HFT: HFT systems never call malloc in the critical path. This project teaches you why general-purpose allocators are slow and how to design allocation strategies for specific use cases. Core challenges you’ll face:- Pool allocator design: Fixed-size blocks with O(1) alloc/free (teaches memory management fundamentals)
- Arena allocator: Bump allocation with bulk free (teaches allocation patterns)
- Thread-local pools: Avoiding false sharing in multi-threaded allocators (teaches TLS, cache lines) What you’ll build: A specialized memory allocator (arena/pool allocator) that eliminates
mallocfrom your hot path - becausemallocis a latency killer. Why it teaches HFT: HFT systems never call malloc in the critical path. This project teaches you why general-purpose allocators are slow and how to design allocation strategies for specific use cases. Core challenges you’ll face:- Pool allocator design: Fixed-size blocks with O(1) alloc/free (teaches memory management fundamentals)
- Arena allocator: Bump allocation with bulk free (teaches allocation patterns)
- Thread-local pools: Avoiding false sharing in multi-threaded allocators (teaches TLS, cache lines)
What you’ll build: A general-purpose memory allocator (malloc/free/realloc) that can replace the system allocator in real programs.
Why it teaches memory allocators: You cannot fake your way through this. You’ll face fragmentation, understand why jemalloc uses size classes, learn why metadata placement matters, and discover that “fast” and “memory-efficient” are often at odds.
Core challenges you’ll face:
- Designing free list structures (maps to fragmentation strategies)
- Handling alignment requirements for different types (maps to ABI/alignment)
- Implementing coalescing without destroying performance (maps to performance tuning)
- Making it thread-safe without killing scalability (maps to threading primitives)
- Beating glibc malloc in at least one benchmark (maps to real-world validation)
Key Concepts:
- Free list management: “C Interfaces and Implementations” by David Hanson - Chapter 5 (Arena) and Chapter 6 (Mem)
- Size classes and binning: jemalloc design doc - Jason Evans’ “A Scalable Concurrent malloc Implementation”
- Thread-local caching: “Hoard: A Scalable Memory Allocator” - Emery Berger paper
- Fragmentation analysis: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 9
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Solid C, understanding of virtual memory basics
Real world outcome:
- Your allocator will be
LD_PRELOAD-able into real programs - You’ll have benchmarks showing throughput (allocations/sec) and memory efficiency vs glibc/jemalloc
- You can literally run
LD_PRELOAD=./myalloc.so lsand see it work
Learning milestones:
- Basic allocator working - You understand why malloc needs metadata and how free lists work
- Thread-safe version - You grasp why naive locking destroys performance and why per-thread caches exist
- Competitive benchmarks - You’ve internalized the tradeoffs between speed, fragmentation, and memory overhead
Real World Outcome
When you complete this project, you won’t just have “an allocator”—you’ll have a production-ready shared library that can literally replace the system allocator in any program. Here’s exactly what you’ll see:
The Moment of Truth: LD_PRELOAD
# First, compile your allocator as a shared library
$ gcc -shared -fPIC -o libmyalloc.so myalloc.c -lpthread
# Test it with a simple program
$ LD_PRELOAD=./libmyalloc.so ls
Desktop Documents Downloads Music Pictures Videos
# It works! Your allocator just handled all of ls's memory allocations
What just happened? The dynamic linker loaded your malloc, free, calloc, and realloc implementations before the system’s libc. Every allocation in ls went through YOUR code.
Seeing Your Allocator in Action
Add instrumentation to see what’s happening:
$ LD_PRELOAD=./libmyalloc.so ls -la
[MyAlloc] malloc(1024) -> 0x7f8a2c001000
[MyAlloc] malloc(64) -> 0x7f8a2c001400
[MyAlloc] calloc(256, 4) -> 0x7f8a2c001440
[MyAlloc] realloc(0x7f8a2c001000, 2048) -> 0x7f8a2c001800
[MyAlloc] free(0x7f8a2c001400)
[MyAlloc] Total allocated: 3,328 bytes
[MyAlloc] Peak usage: 3,328 bytes
[MyAlloc] Total frees: 1
total 48
drwxr-xr-x 2 user user 4096 Dec 27 10:30 .
drwxr-xr-x 18 user user 4096 Dec 27 10:15 ..
-rw-r--r-- 1 user user 220 Dec 27 10:15 file.txt
You’re literally watching every memory operation in a real program!
Benchmark Output: The Numbers That Matter
Your allocator will come with benchmarks comparing against glibc malloc and jemalloc:
$ ./bench_allocator
=== Memory Allocator Benchmark ===
Test: Sequential allocation (1000 allocations of 64 bytes)
glibc malloc: 2.3 µs (434,782 allocs/sec)
jemalloc: 1.1 µs (909,090 allocs/sec)
Your allocator: 0.8 µs (1,250,000 allocs/sec) [WINNER]
Test: Random sizes (10,000 allocations, 16-4096 bytes)
glibc malloc: 145 µs (68,965 allocs/sec)
jemalloc: 89 µs (112,359 allocs/sec) [WINNER]
Your allocator: 102 µs (98,039 allocs/sec)
Test: Multithreaded (4 threads, 10,000 allocs each)
glibc malloc: 8,234 µs (4,859 allocs/sec)
jemalloc: 1,456 µs (27,472 allocs/sec) [WINNER]
Your allocator: 1,892 µs (21,141 allocs/sec)
=== Fragmentation Analysis ===
After 100,000 random alloc/free operations:
Bytes Allocated Bytes Requested Overhead
glibc malloc: 2,048,576 1,897,234 7.9%
jemalloc: 1,986,432 1,897,234 4.7% [WINNER]
Your allocator: 2,012,160 1,897,234 6.1%
=== Memory Usage Stats ===
Your allocator used 16 arenas (threads)
Peak memory from OS: 8,388,608 bytes (8 MB)
Returned to OS: 6,291,456 bytes
Currently held: 2,097,152 bytes
What this tells you:
- You beat glibc in single-threaded small allocations (your size-class bins are working!)
- jemalloc still wins in multithreaded workloads (their arena design is battle-tested)
- Your fragmentation is competitive (your coalescing strategy works)
Running Real Programs with Your Allocator
# Run Python with your allocator
$ LD_PRELOAD=./libmyalloc.so python3 -c "print('Hello from my allocator!')"
Hello from my allocator!
[MyAlloc] Stats: 1,247 allocations, 892 frees, 84 KB peak usage
# Run a web server
$ LD_PRELOAD=./libmyalloc.so python3 -m http.server 8000
[MyAlloc] Serving HTTP on 0.0.0.0:8000
[MyAlloc] Request handled: 47 allocations, 512 bytes avg
# Run valgrind to verify no leaks
$ valgrind --leak-check=full env LD_PRELOAD=./libmyalloc.so ls
==12345== HEAP SUMMARY:
==12345== in use at exit: 0 bytes in 0 blocks
==12345== total heap usage: 1,247 allocs, 1,247 frees, 84,192 bytes allocated
==12345==
==12345== All heap blocks were freed -- no leaks are possible
Debugging Output: Seeing Fragmentation
$ LD_PRELOAD=./libmyalloc.so ./fragmentation_test
=== Heap State After 1000 Random Operations ===
Size Class: 16 bytes
Free list: 23 blocks
Allocated: 145 blocks
Fragmentation: 12.3%
Size Class: 32 bytes
Free list: 47 blocks
Allocated: 289 blocks
Fragmentation: 8.7%
Size Class: 64 bytes
Free list: 12 blocks
Allocated: 512 blocks
Fragmentation: 4.2%
Large allocations (>4096 bytes):
Arena 0: 4 blocks, 24,576 bytes total
Arena 1: 2 blocks, 16,384 bytes total
Total memory from OS: 1,048,576 bytes (1 MB)
Total in use: 897,234 bytes
Wasted (fragmentation): 47,892 bytes (5.3%)
Visual representation your code can print:
Arena 0 Memory Layout:
[Used:64][Free:64][Used:128][Free:32][Used:256][Free:128]...
^
Coalescing opportunity!
The Interview Demo
When you show this in an interview or on GitHub:
# Clone your repo
$ git clone https://github.com/yourname/myalloc.git
$ cd myalloc
# Build
$ make
# Run the demo
$ ./demo
=== Custom Memory Allocator Demo ===
1. Testing basic allocation...
malloc(100): ✓ returned 0x7f8a2c001000
free(): ✓ freed successfully
2. Testing thread-local caching...
Thread 1: allocated 1000 blocks (0.234 ms)
Thread 2: allocated 1000 blocks (0.189 ms)
Thread 3: allocated 1000 blocks (0.201 ms)
Thread 4: allocated 1000 blocks (0.198 ms)
✓ No cache contention detected!
3. Testing coalescing...
Before: 10 free blocks in size class 64
After coalesce: 2 free blocks in size class 256
✓ Fragmentation reduced by 80%!
4. Stress test with real program...
Running: LD_PRELOAD=./libmyalloc.so find /usr -name "*.so"
✓ Completed successfully
Stats: 47,892 allocs, 47,892 frees, 0 leaks
ALL TESTS PASSED! Your allocator is production-ready.
This is what you’ll actually build—not a toy, but a real allocator you can use, benchmark, and understand at the deepest level.
The Core Question You’re Answering
“Why is malloc slow, and what does it actually take to make a fast, correct memory allocator?”
This is the question that separates developers who use memory from those who understand memory. Most programmers treat malloc as magic—a function that just “gives you memory.” But ask them:
- Why does
mallocsometimes take 10ns and sometimes take 10µs? - Why does allocating 16 bytes waste 8 more bytes on metadata?
- Why does freeing memory not always return it to the OS?
- Why can’t you just have a global lock and call it a day in multi-threaded programs?
- Why do jemalloc and tcmalloc exist when glibc already has malloc?
If you can’t answer these, you don’t understand memory allocation. This project forces you to answer all of them by building the thing.
The deeper question: Memory allocators sit at the intersection of three hard problems:
- Speed (making allocation O(1) without searching)
- Memory efficiency (minimizing fragmentation and overhead)
- Thread safety (scaling to many cores without lock contention)
You cannot optimize all three simultaneously. This project teaches you the tradeoffs by making you choose.
Why this matters:
- HFT firms ask: “How would you design a malloc for sub-microsecond latency?”
- Game engine teams ask: “Why does our frame rate tank after 10 minutes?” (memory fragmentation)
- Systems programmers ask: “Why does our server use 10GB of RAM when only 2GB is in use?” (allocator not releasing memory to OS)
After this project, you’ll answer these from experience, not theory.
Concepts You Must Understand First
Before you write a single line of code, ensure you deeply understand these concepts. If any of these is fuzzy, stop and research it—your allocator will fail in confusing ways otherwise.
1. Virtual Memory and Address Spaces
What you need to know:
- What is a virtual address vs a physical address?
- How does
sbrk()ormmap()get memory from the OS? - What happens when you request 100 bytes but the OS gives you 4096 bytes (a page)?
- Why can two processes have the same virtual address but different physical addresses?
Questions to answer:
- If you call
malloc(1), how much memory does the OS actually give you? - What’s the difference between the heap (via
sbrk) and memory-mapped regions (viammap)? - Why does Linux give you more address space than you have RAM?
Book references:
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9: Virtual Memory
- “The Linux Programming Interface” by Michael Kerrisk — Chapter 7: Memory Allocation
- “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau — Chapters 13-15: Address Spaces, Memory API, Address Translation
2. Metadata: The Hidden Cost of malloc
What you need to know:
- Every
malloccall needs bookkeeping: How big is this block? Is it in use? - Where do you store this metadata? (Before the returned pointer? In a separate table?)
- How do you find the metadata when
free(ptr)is called with just a pointer?
Questions to answer:
- If a user requests 16 bytes, how many bytes does your allocator actually need?
- What happens if the user writes past the end of their allocation and corrupts your metadata?
- Why is alignment important? (Hint: some CPUs crash on unaligned access)
Book references:
- “C Interfaces and Implementations” by David Hanson — Chapter 5: Arena and Chapter 6: Mem
- “The C Programming Language” by Kernighan & Ritchie — Chapter 8.7: Example—A Storage Allocator
- “Understanding and Using C Pointers” by Richard Reese — Chapter 2: Dynamic Memory Management
3. Free Lists: The Core Data Structure
What you need to know:
- A free list is a linked list of available memory blocks
- When you
malloc, you search the free list for a big-enough block - When you
free, you add the block back to the free list
Questions to answer:
- Should the free list be sorted by address or insertion order? (Affects coalescing and fragmentation)
- How do you store the “next” pointer in a free block? (The block is free, so you can reuse its memory!)
- What’s the difference between first-fit, best-fit, and worst-fit strategies?
Visual model to internalize:
Free List (sorted by address):
NULL <- [Block@0x1000, size=64] <- [Block@0x1400, size=128] <- [Block@0x2000, size=256] <- Head
When malloc(100) is called:
1. Search list for block >= 100 bytes
2. Find block at 0x1400 (size=128)
3. Split it: Use 100 bytes, return 28 bytes to free list
4. Return pointer 0x1400 to user
Book references:
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9.9: Dynamic Memory Allocation
- “Algorithms in C” by Robert Sedgewick — Chapter on Linked Lists
- CS 341 Malloc Tutorial
4. Fragmentation: The Silent Killer
What you need to know:
- Internal fragmentation: Wasted space inside an allocated block (e.g., user asks for 17 bytes, you give them 32 due to alignment)
- External fragmentation: Free memory exists but is scattered in small chunks, so large allocations fail
Questions to answer:
- You have 1000 bytes free split into 10 blocks of 100 bytes each. Can you allocate 500 bytes? (No! External fragmentation)
- How do you measure fragmentation? (Peak memory used / actual bytes requested)
- What is coalescing, and why is it essential?
Visual example:
Before fragmentation:
[Free: 1000 bytes]
After 10 allocations of 50 bytes each, then freeing every other one:
[Used:50][Free:50][Used:50][Free:50][Used:50][Free:50][Used:50][Free:50][Used:50][Free:50]
You have 250 bytes free, but can't allocate 100 bytes!
After coalescing adjacent free blocks:
[Used:50][Free:100][Used:50][Free:100][Used:50][Free:50]
Book references:
- CS360 Fragmentation Lecture
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9.9.12: Fragmentation
- Embedded Artistry: Implementing Malloc
5. Thread Safety and Lock-Free Data Structures
What you need to know:
- A global lock on malloc kills performance in multi-threaded programs
- Per-thread caches (thread-local storage) avoid contention
- Atomic operations (
__sync_fetch_and_add, C11_Atomic) enable lock-free algorithms
Questions to answer:
- What is false sharing, and why does it destroy performance?
- How does jemalloc use per-thread arenas to avoid locking?
- What’s the difference between a mutex and an atomic compare-and-swap (CAS)?
Book references:
- “Rust Atomics and Locks” by Mara Bos — Chapters 1-3 (best explanation of memory ordering, applies to C too)
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 6: The Memory Hierarchy (cache lines)
- “The Art of Multiprocessor Programming” by Herlihy & Shavit — Lock-free data structures
- “Hoard: A Scalable Memory Allocator” (paper by Emery Berger)
6. Size Classes: Why jemalloc is Fast
What you need to know:
- Instead of a single free list, have multiple lists for common sizes (16, 32, 64, 128 bytes, etc.)
- Small allocations go to the appropriate size class (fast, no searching)
- Large allocations use mmap directly (avoids fragmentation of small allocations)
Questions to answer:
- Why are size classes powers of 2? (alignment and cache line efficiency)
- What’s the tradeoff? (More internal fragmentation: 17-byte request gets 32-byte block)
- How do you decide the threshold between “small” and “large”?
Book references:
- jemalloc Background
- “A Scalable Concurrent malloc Implementation” — Jason Evans (jemalloc author’s paper)
- Exploring Different Memory Allocators
7. Alignment Requirements
What you need to know:
- Most systems require pointers to be aligned (e.g., 8-byte boundary on x64)
mallocmust return addresses divisible by the platform’s alignment (usually 8 or 16)- Misaligned access can crash on ARM, or just be slow on x86
Questions to answer:
- If a user requests 5 bytes, can you return a pointer at address 0x1003? (No! Not aligned)
- How do you round up sizes to the next alignment boundary?
- Why does
malloc(1)on Linux actually return a 16-byte-aligned block?
Book references:
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 3: Machine-Level Representation (alignment)
- “Effective C” by Robert Seacord — Chapter 2: Objects, Functions, and Types
Questions to Guide Your Design
Before implementing, think through these design decisions. Your answers will shape your allocator’s performance characteristics.
Design Question 1: How Will You Get Memory from the OS?
Options:
sbrk()/brk()— Traditional Unix way, extends the heap (deprecated on some systems)mmap()/munmap()— Modern approach, memory-mapped anonymous pages- Hybrid: Use
sbrkfor small allocations,mmapfor large
Think through:
sbrkis contiguous but can’t return memory easily (must free from the top)mmapis flexible (can free individual chunks) but has higher overhead- What’s your threshold for “large”? (jemalloc uses 4KB)
Guiding questions:
- What if the user allocates 1GB, frees it, then allocates 1GB again? Should you return memory to the OS?
- How do you handle
sbrkfailure when the system is out of memory?
Design Question 2: How Will You Store Metadata?
Options:
- Boundary tags (header + footer): Store size before and after the block (enables coalescing in both directions)
- Header only: Store size and in-use flag before the returned pointer (simpler, less overhead)
- Separate metadata table: Store all metadata in a different region (isolates corruption but slower lookup)
Think through:
Boundary tag layout:
[Header: size=128, used=1] [User data: 120 bytes] [Footer: size=128]
^
Pointer returned to user
What if the user overflows and corrupts the footer?
Guiding questions:
- How do you find the previous block when freeing? (You need footer or separate list)
- How much overhead is acceptable? (8 bytes per block? 16 bytes?)
Design Question 3: What Free List Strategy Will You Use?
Options:
- First-fit: Return first block large enough (fast, fragments over time)
- Best-fit: Return smallest block that fits (minimizes waste, slower search)
- Segregated fit (size classes): Separate free lists for common sizes (jemalloc’s approach)
Think through:
Free list: [64] -> [128] -> [256] -> [512]
malloc(100):
- First-fit: Use 128 (fast, wastes 28 bytes)
- Best-fit: Use 128 (same in this case, but usually searches whole list)
- Segregated: Look in "128-byte class" (fastest)
Guiding questions:
- Does order matter? (Address-ordered lists reduce fragmentation but slow insertion)
- Should you split blocks or waste space? (64-byte block for 32-byte request: split or waste?)
Design Question 4: How Will You Handle Thread Safety?
Options:
- Global lock: Single mutex around all malloc/free (simple, kills performance)
- Per-size-class locks: Reduces contention (better, still some contention)
- Thread-local caches: Each thread has its own pool (jemalloc/tcmalloc approach)
Think through:
Thread 1: malloc(64) → Lock → Get from free list → Unlock
Thread 2: malloc(64) → WAIT for lock...
With per-thread cache:
Thread 1: malloc(64) → Get from local cache (no lock!)
Thread 2: malloc(64) → Get from local cache (no lock!)
Guiding questions:
- What if a thread allocates 1000 blocks then exits? (Memory stuck in its cache)
- How do you refill a thread’s cache when it’s empty? (Need global lock then)
Design Question 5: Will You Implement Coalescing?
Coalescing: Merging adjacent free blocks into larger blocks
Options:
- Immediate coalescing: When freeing, check neighbors and merge (prevents fragmentation)
- Deferred coalescing: Wait until allocation fails, then coalesce (faster free, slower malloc)
- No coalescing: Accept fragmentation (only viable for short-lived programs)
Think through:
Free list: [Block A: free] [Block B: allocated] [Block C: free]
User frees Block B:
- With coalescing: [Single free block: A+B+C]
- Without: [A: free] [B: free] [C: free] ← Can't satisfy large allocations!
Guiding questions:
- How do you find adjacent blocks? (Need footer or address-ordered list)
- Is coalescing worth the CPU cost on every free?
Thinking Exercise: Trace an Allocation by Hand
Before writing code, trace this scenario on paper. This builds the mental model you need.
Scenario: Simple Allocator State
Initial state (you've gotten 1024 bytes from OS):
Free list (address-ordered):
[Block@0x1000, size=1024, free]
Step-by-step trace:
1. void* p1 = malloc(100);
- Search free list for block >= 100 bytes
- Find block at 0x1000 (size=1024)
- Split it:
- Use first 100 bytes (+ 8 for header) = 108 bytes
- Return remaining 916 bytes to free list
- Update free list: [Block@0x106C, size=916, free]
- Return pointer 0x1008 (0x1000 + 8 bytes for header)
Memory layout:
[Header@0x1000: size=108, used] [User data@0x1008: 100 bytes] [Header@0x106C: size=916, free]
2. void* p2 = malloc(200);
- Search free list
- Find block at 0x106C (size=916)
- Split it: Use 208 bytes, return 708 bytes
- Free list: [Block@0x1130, size=708, free]
- Return pointer 0x1074
Memory layout:
[Used:108][User:100][Used:208][User:200][Free:708]
^0x1000 ^0x106C ^0x1130
3. free(p1);
- Receive pointer 0x1008
- Back up 8 bytes to find header at 0x1000
- Mark block as free (size=108)
- Add to free list
Free list (address-ordered):
[Block@0x1000, size=108] -> [Block@0x1130, size=708]
Memory layout:
[Free:108][Used:208][User:200][Free:708]
^0x1000 ^0x106C ^0x1130
4. free(p2);
- Free block at 0x106C (size=208)
- Check neighbors:
- Previous block (0x1000) is free!
- Next block (0x1130) is free!
- COALESCE: Merge all three blocks
Free list:
[Block@0x1000, size=1024]
Memory layout:
[Free:1024]
^0x1000
We're back to the initial state! Perfect coalescing.
Questions to answer about your trace:
- At step 2, why does malloc(200) actually use 208 bytes? (Header overhead)
- If you didn’t coalesce at step 4, could you allocate 500 bytes? (No! Largest free block is 708)
- How would you detect that blocks 0x1000, 0x106C, and 0x1130 are adjacent? (Check if 0x1000 + size == 0x106C)
- What if the user writes to
p1[101](out of bounds)? (Corrupts header at 0x106C, chaos on next malloc)
Now trace a multithreaded scenario:
Thread 1: malloc(64) }
Thread 2: malloc(64) } Both hit the same free list
Thread 3: malloc(64) }
With a global lock:
T1: Lock → Allocate → Unlock (3µs)
T2: WAIT → Lock → Allocate → Unlock (3µs + wait time)
T3: WAIT → WAIT → Lock → Allocate → Unlock (3µs + wait time)
With per-thread caches:
T1: Get from local cache (0.5µs)
T2: Get from local cache (0.5µs)
T3: Get from local cache (0.5µs)
All happen in parallel!
The Interview Questions They’ll Ask
Prepare to answer these. If you can answer them fluently, you’ve internalized the concepts.
Conceptual Questions
- “What is the difference between
sbrkandmmapfor getting memory from the OS?”- Answer should cover:
sbrkextends the heap contiguously, can’t free individual chunks.mmapgives arbitrary pages, can free individually, higher overhead per call.
- Answer should cover:
- “Why does
mallocneed to store metadata, and where is it stored?”- Answer: Need to track size (for free/realloc) and in-use status. Usually stored as a header immediately before the returned pointer.
- “Explain external fragmentation vs internal fragmentation.”
- Answer: External = free memory exists but is scattered. Internal = allocated blocks waste space due to alignment/rounding.
- “How do you implement
free(ptr)when the user only gives you a pointer?”- Answer: Back up from
ptrby the header size to find metadata containing the block size.
- Answer: Back up from
- “What is coalescing and why is it necessary?”
- Answer: Merging adjacent free blocks. Necessary to prevent external fragmentation—without it, heap becomes unusable over time.
Design Questions
- “Why do allocators like jemalloc use size classes?”
- Answer: Avoids searching free lists. Small allocations go to fixed-size pools (O(1) allocation). Reduces fragmentation by grouping similar sizes.
- “What happens if two threads call
mallocat the same time?”- Answer: Without synchronization, corruption (race condition on free list). Solutions: global lock (slow), per-thread caches (fast), lock-free algorithms (complex).
- “How would you design a malloc for a single-threaded embedded system with 64KB of RAM?”
- Answer: Simple first-fit with a single free list. No need for thread safety. Maybe skip coalescing to save code space. Use a fixed pool from
__heap_start.
- Answer: Simple first-fit with a single free list. No need for thread safety. Maybe skip coalescing to save code space. Use a fixed pool from
- “Why don’t allocators immediately return memory to the OS on
free()?”- Answer: System calls (
munmap,brk) are expensive. Better to cache memory for future allocations. Only return memory under pressure.
- Answer: System calls (
Performance Questions
- “Why is
mallocsometimes slow?”- Answer: Searching free lists (O(n) worst case), lock contention in multithreaded programs, system calls to get more memory, coalescing overhead.
- “How would you reduce fragmentation in a long-running server?”
- Answer: Use size classes, periodic coalescing, segregated heaps for different object lifetimes, memory pools for fixed-size objects.
- “What is false sharing and how does it affect allocators?”
- Answer: Two threads accessing different variables in the same cache line cause cache invalidation. In allocators, per-thread metadata too close together kills performance. Solution: Pad metadata to cache-line boundaries (64 bytes).
Implementation Questions
- “How do you ensure
mallocreturns an aligned pointer?”- Answer: Round requested size up to alignment boundary. Ensure metadata size is also aligned. Start heap at aligned address.
- “What’s the minimum allocation size, and why?”
- Answer: Must fit a “next” pointer in free blocks (usually 8 bytes on 64-bit). Also alignment requirement. So minimum is typically 16 bytes.
- “How would you debug a heap corruption bug?”
- Answer: Use AddressSanitizer, Valgrind. Add canary values in metadata. Walk entire heap on each operation in debug mode. Check for overflows by verifying canaries.
Advanced Questions
- “Explain how
reallocworks. Why is it more complex thanmalloc+free?”- Answer: If new size fits in current block, just update metadata. Otherwise, allocate new block, memcpy, free old. Optimization: If next block is free, expand in place.
- “How does jemalloc achieve thread scalability?”
- Answer: Per-thread arenas (caches). Each thread allocates from its own arena, avoiding locks. Global arena only accessed when thread cache is empty.
- “What is a memory pool and when would you use it over malloc?”
- Answer: Pre-allocated region for fixed-size objects. No fragmentation, O(1) alloc/free, better cache locality. Use for objects with known size/lifetime (e.g., game entities, network packets).
Hints in Layers
Only read these if you’re stuck. Try to implement as much as possible before looking.
Hint 1: Start with the Simplest Possible Allocator
Don’t try to build jemalloc on day one. Start here:
// Global free list (single linked list)
typedef struct block {
size_t size; // Size of this block (including header)
struct block* next; // Next free block (only valid if free)
int free; // 1 if free, 0 if allocated
} block_t;
static block_t* free_list_head = NULL;
void* malloc(size_t size) {
// 1. Round size up to alignment (8 bytes)
size = (size + 7) & ~7;
// 2. Search free list for first-fit
block_t* current = free_list_head;
while (current) {
if (current->free && current->size >= size) {
current->free = 0;
return (void*)(current + 1); // Return pointer after header
}
current = current->next;
}
// 3. No suitable block found, get more memory from OS
// (hint: use sbrk or mmap here)
}
Get THIS working first. Verify with simple tests.
Hint 2: Getting Memory from the OS
Modern approach using mmap:
#include <sys/mman.h>
void* get_memory_from_os(size_t size) {
// Round up to page size (4096 bytes)
size_t page_size = 4096;
size = (size + page_size - 1) / page_size * page_size;
void* ptr = mmap(NULL, // Let OS choose address
size, // Size in bytes
PROT_READ | PROT_WRITE,// Permissions
MAP_PRIVATE | MAP_ANONYMOUS, // Private, not backed by file
-1, // No file descriptor
0); // No offset
if (ptr == MAP_FAILED) {
return NULL; // Out of memory
}
return ptr;
}
Initialize it as a free block and add to your free list.
Hint 3: Implementing free
void free(void* ptr) {
if (!ptr) return; // free(NULL) is a no-op
// Back up to find the header
block_t* block = (block_t*)ptr - 1;
// Mark as free
block->free = 1;
// TODO: Coalesce with adjacent free blocks (Hint 5)
}
Hint 4: Splitting Blocks
When you find a free block that’s bigger than needed, split it:
void split_block(block_t* block, size_t size) {
// Only split if remainder is large enough for a new block
// (needs space for header + some data)
if (block->size >= size + sizeof(block_t) + 8) {
block_t* new_block = (block_t*)((char*)block + sizeof(block_t) + size);
new_block->size = block->size - size - sizeof(block_t);
new_block->free = 1;
new_block->next = block->next;
block->size = size + sizeof(block_t);
block->next = new_block;
}
}
Hint 5: Coalescing Adjacent Free Blocks
To prevent fragmentation, merge adjacent free blocks:
void coalesce(block_t* block) {
// Merge with next block if it's free
if (block->next && block->next->free) {
// Check if blocks are actually adjacent in memory
if ((char*)block + block->size == (char*)block->next) {
block->size += block->next->size;
block->next = block->next->next;
}
}
// To merge with previous block, you need to search the list
// (This is why address-ordered lists or footers are useful!)
}
Better approach: Keep the free list sorted by address, making coalescing easier.
Hint 6: Making it Thread-Safe (First Attempt)
Simplest approach—global mutex:
#include <pthread.h>
static pthread_mutex_t malloc_lock = PTHREAD_MUTEX_INITIALIZER;
void* malloc(size_t size) {
pthread_mutex_lock(&malloc_lock);
// ... your allocation logic ...
pthread_mutex_unlock(&malloc_lock);
return ptr;
}
This works but kills performance. Measure the overhead!
Hint 7: Size Classes (Advanced)
Instead of one free list, have multiple:
#define NUM_SIZE_CLASSES 8
// Size classes: 16, 32, 64, 128, 256, 512, 1024, 2048 bytes
static block_t* size_class_lists[NUM_SIZE_CLASSES];
int size_to_class(size_t size) {
if (size <= 16) return 0;
if (size <= 32) return 1;
if (size <= 64) return 2;
// ... etc
return 7;
}
void* malloc(size_t size) {
int class = size_to_class(size);
// Try to allocate from this size class
if (size_class_lists[class]) {
block_t* block = size_class_lists[class];
size_class_lists[class] = block->next;
return (void*)(block + 1);
}
// No free blocks in this class, get from OS
// ...
}
Hint 8: Benchmarking Your Allocator
#include <time.h>
void benchmark() {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// Allocate 10,000 blocks
void* ptrs[10000];
for (int i = 0; i < 10000; i++) {
ptrs[i] = malloc(64);
}
clock_gettime(CLOCK_MONOTONIC, &end);
long ns = (end.tv_sec - start.tv_sec) * 1000000000L +
(end.tv_nsec - start.tv_nsec);
printf("10,000 allocations took %ld ns (%.2f ns each)\n",
ns, ns / 10000.0);
// Don't forget to free!
for (int i = 0; i < 10000; i++) {
free(ptrs[i]);
}
}
Compare against system malloc by compiling without your LD_PRELOAD.
Hint 9: Debugging with Address Sanitizer
Compile your allocator and test programs with ASan:
$ gcc -fsanitize=address -g myalloc.c test.c -o test
$ ./test
# If you have a bug, you'll see:
# ==12345==ERROR: AddressSanitizer: heap-buffer-overflow
# READ of size 4 at 0x602000000014 thread T0
# #0 0x7f8a2c001234 in test test.c:42
This catches use-after-free, buffer overflows, etc.
Hint 10: Testing with Real Programs
Create a simple wrapper script:
#!/bin/bash
# test_allocator.sh
export LD_PRELOAD=./libmyalloc.so
echo "Testing with ls..."
ls /usr/bin | head -n 5
echo "Testing with Python..."
python3 -c "print('Hello from custom allocator!')"
echo "Testing with grep..."
grep -r "malloc" . | head -n 5
echo "All tests passed!"
If any of these crash, you have a bug!
Books That Will Help
Here’s exactly where to look for deep understanding of each concept:
| Topic | Book | Specific Chapter/Section |
|---|---|---|
| Virtual Memory & Address Spaces | Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron | Chapter 9: Virtual Memory (especially 9.9 for dynamic allocation) |
| Operating Systems: Three Easy Pieces by Arpaci-Dusseau | Chapters 13-15: Address Spaces, Memory API, Address Translation | |
| The Linux Programming Interface by Michael Kerrisk | Chapter 7: Memory Allocation | |
| Classic malloc Implementation | The C Programming Language by Kernighan & Ritchie | Chapter 8.7: Example—A Storage Allocator |
| C Interfaces and Implementations by David Hanson | Chapter 5: Arena, Chapter 6: Mem | |
| Free Lists & Data Structures | Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron | Chapter 9.9.12-9.9.14: Placement policies, splitting/coalescing |
| Algorithms in C by Robert Sedgewick | Chapter 3: Elementary Data Structures (linked lists) | |
| Fragmentation Deep Dive | CS360 Fragmentation Lecture | Complete lecture notes |
| CS 341 Malloc Tutorial | Internal vs external fragmentation | |
| jemalloc Design | jemalloc Background Wiki | Complete design documentation |
| Scalable memory allocation using jemalloc | Facebook Engineering blog post | |
| “A Scalable Concurrent malloc Implementation” by Jason Evans | Entire paper (search online) | |
| Thread Safety & Atomics | Rust Atomics and Locks by Mara Bos | Chapters 1-3 (best explanation, applies to C) |
| Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron | Chapter 6: The Memory Hierarchy (cache lines, false sharing) | |
| “Hoard: A Scalable Memory Allocator” by Emery Berger | Paper on thread-local caching (search online) | |
| Alignment & ABI | Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron | Chapter 3.9: Heterogeneous Data Structures |
| Effective C by Robert Seacord | Chapter 2: Objects, Functions, and Types | |
| Arena & Pool Allocators | C Interfaces and Implementations by David Hanson | Chapter 5: Arena (complete implementation walkthrough) |
| Arena and Memory Pool Allocators | Modern overview of techniques | |
| High Performance Memory Management: Arena Allocators | Practical guide | |
| Practical Implementation | Dan Luu’s Malloc Tutorial | Complete walkthrough of building a malloc |
| Embedded Artistry: Implementing Malloc | First-fit free list implementation | |
| Debugging Memory Issues | The Art of Debugging with GDB, DDD, and Eclipse by Matloff & Salzman | Chapters 1-3: Debugger basics |
| Understanding and Using C Pointers by Richard Reese | Chapter 2: Dynamic Memory Management | |
| LD_PRELOAD Technique | What Is the LD_PRELOAD Trick? | Complete explanation |
| Replacing malloc (GNU C Library) | Official documentation | |
| Comparing Allocators | Exploring Different Memory Allocators | jemalloc, tcmalloc, mimalloc comparison |
Recommended Reading Path
- Foundation (Week 1):
- Read K&R Chapter 8.7 (30 pages, classic malloc)
- Read Dan Luu’s Malloc Tutorial (online, practical)
- Read Bryant & O’Hallaron Ch. 9.9 (30 pages, fragmentation/policies)
- Advanced Techniques (Week 2):
- Read jemalloc Background (size classes, arenas)
- Read David Hanson Ch. 5-6 (50 pages, arena allocators)
- Read Embedded Artistry tutorial (implementation details)
- Thread Safety (Week 3):
- Read Mara Bos Ch. 1-3 (60 pages, atomics/memory ordering)
- Read “Hoard” paper (15 pages, thread-local caching)
- Read Bryant & O’Hallaron Ch. 6.4-6.5 (cache hierarchies)
- Debugging & Polish (Week 4):
- Read Matloff & Salzman Ch. 1-3 (debugging techniques)
- Read LD_PRELOAD tutorial (making it usable)
- Skim allocator comparison articles for optimization ideas
Total reading: ~200 pages + online resources over 4 weeks.
Sources:
- jemalloc Background
- Scalable memory allocation using jemalloc
- Exploring Different Memory Allocators
- CS 341 Malloc Tutorial
- CS360 Fragmentation Lecture
- Dan Luu’s Malloc Tutorial
- Embedded Artistry: Implementing Malloc
- What Is the LD_PRELOAD Trick?
- Replacing malloc (GNU C Library)
- Arena and Memory Pool Allocators
- High Performance Memory Management: Arena Allocators
Project 2: Work-Stealing Thread Pool
- File: LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md
- Main Programming Language: C++
- Alternative Programming Languages: Rust, Go, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Thread Pool Design / Work Scheduling
- Software or Tool: High-Performance Thread Pool Library
What you’ll build: A thread pool with work-stealing scheduling, similar to Rayon’s or Go’s runtime scheduler.
Why it teaches threading primitives: You’ll implement mutexes, condition variables, and atomic operations from scratch (or use them correctly). Work-stealing forces you to understand cache coherency, false sharing, and memory ordering—you can’t just slap a lock on everything.
Core challenges you’ll face:
- Implementing lock-free deques for each worker (maps to atomics and memory ordering)
- Avoiding false sharing between worker threads (maps to cache-line awareness)
- Balancing work without excessive stealing overhead (maps to performance tuning)
- Handling thread parking/unparking efficiently (maps to threading primitives)
- Making the API ergonomic for parallel iterators (maps to API design)
Key Concepts:
- Memory ordering and atomics: “Rust Atomics and Locks” by Mara Bos - Chapters 1-3 (even if writing in C, this is the clearest explanation)
- Work-stealing algorithm: “Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson paper
- Lock-free deque: Chase-Lev deque paper - “Dynamic Circular Work-Stealing Deque”
- False sharing: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 6 (Cache)
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Basic threading, atomics concepts
Real world outcome:
- A library that parallelizes embarrassingly parallel workloads
- Run
./threadpool_demo --threads 8and see a Mandelbrot set render 7.5x faster than single-threaded - Benchmark output showing near-linear scaling with core count
Learning milestones:
- Basic thread pool working - You understand condition variables and worker loops
- Work-stealing implemented - You grasp why memory ordering matters and have debugged a race condition
- Near-linear scaling achieved - You’ve eliminated false sharing and understand cache-aware programming
Project 3: Mini Async Runtime
- File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md
- Programming Language: Rust or C
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Asynchronous I/O / Runtime Design
- Software or Tool: Epoll / Kqueue / Futures
- Main Book: “Asynchronous Programming in Rust” (or libuv documentation)
What you’ll build: A single-threaded async runtime that can serve HTTP requests, similar to a simplified Tokio or libuv.
Why it teaches async runtimes: Async is magic until you build the event loop yourself. You’ll implement futures, understand why poll vs push matters, and see exactly how epoll/kqueue enables thousands of concurrent connections without threads.
Core challenges you’ll face:
- Implementing an event loop with epoll/kqueue (maps to platform differences)
- Building a future/task abstraction (maps to API design)
- Managing wakers and the reactor pattern (maps to async internals)
- Non-blocking IO without burning CPU (maps to performance tuning)
- Supporting timers and cancellation (maps to real-world completeness)
Key Concepts:
- Event loops and IO multiplexing: “The Linux Programming Interface” by Michael Kerrisk - Chapter 63 (Alternative I/O Models)
- Future/Promise patterns: Tokio tutorial’s “Async in Depth” section
- Reactor pattern: “libuv Design Overview” documentation
- Platform IO differences: “Advanced Programming in the UNIX Environment” by Stevens & Rago - Chapter 14
Resources for epoll/kqueue abstraction challenge:
- “Epoll is fundamentally broken” by Marek Majkowski (Cloudflare) - Understanding edge vs level triggering pitfalls
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sockets basics, understanding of file descriptors
Real world outcome:
- A working HTTP server handling 10,000 concurrent connections on a single thread
wrk -c 1000 http://localhost:8080/showing impressive throughput- Visually see connection count vs memory usage staying flat (unlike thread-per-connection)
Learning milestones:
- Echo server with epoll - You understand the event loop and non-blocking IO
- Task/future abstraction working - You grasp how wakers notify the runtime
- HTTP server benchmarked - You’ve seen why async enables C10K and understand the tradeoffs
Project 4: Cross-Platform Syscall Abstraction Library
- File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md
- Programming Language: C
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Systems Programming / Portability
- Software or Tool: POSIX / Win32 API
- Main Book: “Advanced Programming in the UNIX Environment” by Stevens & Rago
What you’ll build: A library that wraps platform-specific syscalls (file operations, networking, process control) into a unified API, similar to libuv’s uv_fs_* or Rust’s std.
Why it teaches ABI and platform differences: You’ll discover that “POSIX” doesn’t mean “identical.” struct layouts differ, error codes differ, syscall numbers differ. You’ll fight with calling conventions and learn why #ifdef hell exists.
Core challenges you’ll face:
- Abstracting file operations across Linux/macOS/Windows (maps to platform differences)
- Handling struct layout differences (maps to ABI details)
- Defining stable API versioning (maps to API design)
- Avoiding undefined behavior in type punning (maps to UB avoidance)
- Writing comprehensive test suites (maps to real-world robustness)
Key Concepts:
- POSIX variations: “Advanced Programming in the UNIX Environment” by Stevens & Rago - Throughout (notes platform differences)
- ABI and calling conventions: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 3 (Machine-Level Representation)
- Undefined behavior in C: “Effective C” by Robert Seacord - Chapters on UB
- API design principles: “C Interfaces and Implementations” by David Hanson - Introduction and design philosophy
Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: C, basic familiarity with at least 2 OSes
Real world outcome:
- A header library that compiles the same code on Linux, macOS, and (optionally) Windows
- A demo program that lists directory contents, reads files, and spawns processes—same code, all platforms
- CI pipeline showing green builds on all target platforms
Learning milestones:
- File ops abstracted - You understand why
statdiffers between platforms - Process spawning unified - You grasp fork/exec vs CreateProcess differences
- Library is usable - Someone else can
#includeyour header and write portable code
Real World Outcome
When you complete this project, you’ll have built a production-ready abstraction library similar to what powers Node.js (libuv), Rust’s std, and countless cross-platform applications.
What you’ll actually see:
A working library with a clean API that compiles and runs identically across Linux, macOS, and Windows:
# Linux
$ gcc -o demo demo.c -I./include -L./lib -lxplat
$ ./demo
[xplat] Opening file: test.txt
[xplat] Platform: Linux (POSIX)
[xplat] File descriptor: 3
[xplat] Read 1024 bytes
[xplat] Process spawned: PID 12345 (using fork/exec)
[xplat] Child exited with status: 0
# macOS
$ clang -o demo demo.c -I./include -L./lib -lxplat
$ ./demo
[xplat] Opening file: test.txt
[xplat] Platform: Darwin (POSIX)
[xplat] File descriptor: 3
[xplat] Read 1024 bytes
[xplat] Process spawned: PID 54321 (using fork/exec)
[xplat] Child exited with status: 0
# Windows (using MSVC or MinGW)
C:\> cl demo.c /I.\include /link xplat.lib
C:\> demo.exe
[xplat] Opening file: test.txt
[xplat] Platform: Windows (Win32)
[xplat] File handle: 0x000000000000010C
[xplat] Read 1024 bytes
[xplat] Process spawned: PID 8472 (using CreateProcess)
[xplat] Child exited with status: 0
Your library’s header file (xplat.h) will look like:
// Cross-platform file operations
typedef struct xplat_file xplat_file_t;
xplat_file_t* xplat_open(const char* path, int flags);
ssize_t xplat_read(xplat_file_t* file, void* buf, size_t count);
int xplat_close(xplat_file_t* file);
// Cross-platform process spawning
typedef struct xplat_process xplat_process_t;
xplat_process_t* xplat_spawn(const char* cmd, char** args);
int xplat_wait(xplat_process_t* proc);
// Cross-platform threading
typedef struct xplat_thread xplat_thread_t;
xplat_thread_t* xplat_thread_create(void (*func)(void*), void* arg);
void xplat_thread_join(xplat_thread_t* thread);
Your CI/CD pipeline output showing it works everywhere:
# .github/workflows/ci.yml results
✓ Build on Ubuntu 22.04 (gcc 11.4) - PASSED
✓ Build on Ubuntu 22.04 (clang 14) - PASSED
✓ Build on macOS 13 (AppleClang 15.0) - PASSED
✓ Build on Windows Server 2022 (MSVC) - PASSED
✓ Build on Windows Server 2022 (MinGW) - PASSED
✓ Test Suite (Linux) - 47/47 tests passed
✓ Test Suite (macOS) - 47/47 tests passed
✓ Test Suite (Windows) - 47/47 tests passed
A real application using your library:
// app.c - Same code works on all platforms!
#include "xplat.h"
#include <stdio.h>
int main() {
// File I/O - works on Linux, macOS, Windows
xplat_file_t* f = xplat_open("data.txt", XPLAT_O_RDONLY);
char buf[256];
ssize_t n = xplat_read(f, buf, sizeof(buf));
xplat_close(f);
// Process spawning - works everywhere
char* args[] = {"echo", "Hello from child", NULL};
xplat_process_t* proc = xplat_spawn("echo", args);
int status = xplat_wait(proc);
printf("Child process exited: %d\n", status);
return 0;
}
Behind the scenes, your implementation handles:
On Linux/macOS (POSIX):
// xplat_posix.c
xplat_file_t* xplat_open(const char* path, int flags) {
int fd = open(path, convert_flags(flags));
// Wrap fd in xplat_file_t structure
}
On Windows (Win32):
// xplat_win32.c
xplat_file_t* xplat_open(const char* path, int flags) {
HANDLE h = CreateFileA(path, ...);
// Wrap HANDLE in xplat_file_t structure
}
The concrete artifacts you’ll deliver:
- Library source:
src/xplat_posix.c,src/xplat_win32.c,include/xplat.h - Test suite: 40+ tests covering file I/O, process spawning, threading, networking
- Documentation: API reference showing what works on which platforms
- Build system: CMake or Makefile that builds on all platforms
- CI/CD: GitHub Actions proving it works on every commit
You’ll have tangible proof that you understand the deep differences between operating systems—not just theoretical knowledge, but a working library that someone can actually use.
The Core Question You’re Answering
“Why can’t I just write code that works everywhere? What makes POSIX and Windows so fundamentally different?”
Most developers treat cross-platform compatibility as an annoying afterthought—a maze of #ifdef statements and mysterious build failures. But the differences between POSIX and Windows reveal deep truths about operating system design:
- What is a file descriptor vs a HANDLE? They both represent “open files,” but their internals are completely different
- Why does fork() not exist on Windows? It’s not a missing feature—it’s incompatible with Windows’ process model
- What makes a good abstraction? How do you hide platform differences without sacrificing performance?
After this project, you’ll understand that “portable code” isn’t about avoiding platform-specific APIs—it’s about building the right abstraction layer that respects each platform’s native design while presenting a unified interface.
Concepts You Must Understand First
Stop and research these before coding:
- POSIX File Descriptors vs Windows HANDLEs
- What is a file descriptor? Just an integer index into the process’s file descriptor table
- What is a Windows HANDLE? An opaque pointer to a kernel object
- Why does Linux let you select() on file descriptors but Windows doesn’t work that way?
- How do you convert between them when needed?
- Book Reference: “Advanced Programming in the UNIX Environment” Ch. 3 - Stevens & Rago
- Book Reference: “Windows System Programming” Ch. 2 - Johnson Hart
- Process Creation Models
- POSIX:
fork()duplicates entire process, thenexec()replaces it - Windows:
CreateProcess()spawns new process in one call - Why doesn’t Windows have fork()? (Hint: Address space layout and DLLs)
- How do you pass environment variables and file handles to children?
- Book Reference: “Advanced Programming in the UNIX Environment” Ch. 8 - Stevens & Rago
- Book Reference: “Windows Via C/C++” Ch. 4 - Jeffrey Richter
- POSIX:
- Error Handling Conventions
- POSIX: Return -1 on error, set
errno(thread-local global) - Windows: Return NULL/FALSE on error, call
GetLastError() - How do you design a unified error API?
- What’s the difference between
EINVALandERROR_INVALID_PARAMETER? - Book Reference: “The Linux Programming Interface” Ch. 3.4 - Kerrisk
- POSIX: Return -1 on error, set
- Struct Layout and ABI
- Why does
struct statdiffer between Linux and macOS? - What is struct padding and alignment?
- How do 32-bit vs 64-bit systems affect structure layout?
- What is
_FILE_OFFSET_BITSand why does it matter? - Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.9 - Bryant & O’Hallaron
- Why does
- Calling Conventions
- What is cdecl vs stdcall vs fastcall?
- Why does Windows use different calling conventions for different APIs?
- How do you make function pointers portable?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.7 - Bryant & O’Hallaron
- Thread-Local Storage (TLS)
- POSIX uses
pthread_key_t, Windows usesTlsAlloc() - How do you implement thread-safe error reporting?
- What is
__threadvs__declspec(thread)? - Book Reference: “Programming with POSIX Threads” Ch. 5.4 - Butenhof
- POSIX uses
- Symbol Visibility and Linking
- What is the difference between static and dynamic linking?
- How do you export symbols from a shared library?
- What is
__declspec(dllexport)vs__attribute__((visibility("default")))? - Why do Windows DLLs need .lib files?
- Book Reference: “Advanced C and C++ Compiling” Ch. 7 - Stevanovic
- Memory-Mapped Files
- POSIX:
mmap()withMAP_SHAREDorMAP_PRIVATE - Windows:
CreateFileMapping()+MapViewOfFile() - What guarantees do you get about consistency?
- How do you handle page faults?
- Book Reference: “The Linux Programming Interface” Ch. 49 - Kerrisk
- POSIX:
Questions to Guide Your Design
Before implementing, think through these:
- API Design Philosophy
- Should your API look like POSIX (file descriptors as ints)?
- Or like Windows (opaque handles as pointers)?
- Or something completely new?
- How do you handle functionality that exists on one platform but not others?
- Error Handling Strategy
- Do you use return codes like POSIX (-1 + errno)?
- Or return NULL and have a separate error getter?
- How do you map platform-specific errors to generic codes?
- Should you provide both generic and platform-specific error details?
- Resource Management
- Who is responsible for freeing memory: caller or library?
- How do you prevent resource leaks when code is platform-specific?
- Should you use opaque types or expose internals?
- Feature Detection
- How do you detect if functionality is available at compile time?
- What about runtime feature detection?
- Should unsupported operations fail at compile time or runtime?
- Performance vs Portability
- When should you use platform-specific optimizations?
- How do you avoid the abstraction becoming a performance bottleneck?
- Should you allow “escape hatches” to access native handles?
- Testing Strategy
- How do you test Windows code when developing on Linux (and vice versa)?
- Should you use virtual machines, containers, or CI/CD?
- How do you test the same logical operation across platforms?
Thinking Exercise
Design the File API on Paper
Before writing any code, design how you’d abstract file operations:
// On POSIX:
int fd = open("/path/to/file", O_RDONLY);
read(fd, buffer, size);
close(fd);
// On Windows:
HANDLE h = CreateFileA("C:\\path\\to\\file", GENERIC_READ, ...);
ReadFile(h, buffer, size, &bytes_read, NULL);
CloseHandle(h);
// Your abstraction: ???
xplat_file_t* f = xplat_open(???, ???);
xplat_read(f, ???, ???);
xplat_close(f);
Questions to answer:
- What is
xplat_file_t? Should it be:- An opaque pointer?
typedef struct xplat_file xplat_file_t; - A tagged union containing either fd or HANDLE?
- Just an int with special encoding?
- An opaque pointer?
- How do you handle flags?
O_RDONLYis 0 on POSIX, butGENERIC_READis0x80000000on Windows- Do you define your own flag constants?
- Do you translate flags at runtime?
- What about errors?
- If
open()fails, it returns -1 and sets errno - If
CreateFileA()fails, it returnsINVALID_HANDLE_VALUE - What does
xplat_open()return on error?
- If
- Sync vs async I/O?
- POSIX has blocking vs non-blocking (
O_NONBLOCK) - Windows has synchronous vs overlapped I/O
- How do you expose this in your API?
- POSIX has blocking vs non-blocking (
Draw a diagram showing how a call to xplat_read() flows through your library:
User calls: xplat_read(file, buf, 1024)
↓
[Platform detection at compile-time]
↓ ↓
POSIX path Windows path
↓ ↓
read(fd, buf, 1024) ReadFile(h, buf, 1024, ...)
↓ ↓
[Return value translation]
↓ ↓
Return to user (same behavior)
The Interview Questions They’ll Ask
Prepare to answer these:
- “What’s the fundamental difference between a POSIX file descriptor and a Windows HANDLE?”
- Answer: fd is an integer index, HANDLE is an opaque kernel object pointer
- Why it matters: Affects how you can use them (select/poll vs IOCP)
- “Why doesn’t Windows have fork()?”
- Answer: Windows processes don’t support copy-on-write address space duplication
- The real reason: DLL base addresses and loader state make it impractical
- “How would you implement thread-safe error reporting?”
- Answer: Thread-local storage for error codes
- Implementation differs:
pthread_key_tvsTlsAlloc()
- “What is structure padding and why does it matter for cross-platform code?”
- Answer: Compiler inserts padding bytes to satisfy alignment requirements
- Can differ between platforms/compilers, breaking binary compatibility
- “How do you handle features that exist on one platform but not others?”
- Answer: Compile-time detection with
#ifdef, runtime capability checks, or graceful degradation - Example:
epollon Linux,kqueueon BSD,IOCPon Windows
- Answer: Compile-time detection with
- “What’s the difference between static and dynamic linking, and why does it matter for portability?”
- Answer: Static includes library code in binary, dynamic loads at runtime
- Windows requires .lib import library for DLLs, POSIX uses just .so
- “How would you test cross-platform code during development?”
- Answer: Combination of virtual machines, Docker containers, and CI/CD
- Best practice: GitHub Actions with matrix builds across OS
- “What is undefined behavior in type punning, and how does it affect portability?”
- Answer: Accessing same memory through incompatible pointer types
- Example:
*(int*)&float_varinvokes UB, may break on different compilers
Hints in Layers
Hint 1: Start with Opaque Types
Don’t expose platform details in your header:
// xplat.h (public header)
typedef struct xplat_file xplat_file_t; // Opaque!
xplat_file_t* xplat_open(const char* path, int flags);
// xplat_internal.h (private header)
struct xplat_file {
#ifdef _WIN32
HANDLE handle;
#else
int fd;
#endif
};
This lets you change internals without breaking API.
Hint 2: Use Feature Detection Macros
Create a consistent way to detect platforms:
// xplat_platform.h
#if defined(_WIN32) || defined(_WIN64)
#define XPLAT_WINDOWS 1
#define XPLAT_POSIX 0
#elif defined(__unix__) || defined(__APPLE__)
#define XPLAT_POSIX 1
#define XPLAT_WINDOWS 0
#else
#error "Unsupported platform"
#endif
Hint 3: Create a Translation Layer for Flags
Map your generic flags to platform-specific ones:
// Define your own flags
#define XPLAT_O_RDONLY 0x01
#define XPLAT_O_WRONLY 0x02
#define XPLAT_O_RDWR 0x04
// Translate at runtime (POSIX)
static int translate_flags_posix(int xplat_flags) {
int flags = 0;
if (xplat_flags & XPLAT_O_RDONLY) flags |= O_RDONLY;
if (xplat_flags & XPLAT_O_WRONLY) flags |= O_WRONLY;
return flags;
}
// Translate at runtime (Windows)
static DWORD translate_flags_win32(int xplat_flags) {
DWORD access = 0;
if (xplat_flags & XPLAT_O_RDONLY) access |= GENERIC_READ;
if (xplat_flags & XPLAT_O_WRONLY) access |= GENERIC_WRITE;
return access;
}
Hint 4: Build Incrementally
Start with the simplest possible abstraction:
- Phase 1: Just abstract
open()/CreateFile()and return success/failure - Phase 2: Add
read()/ReadFile()abstraction - Phase 3: Add proper error handling and reporting
- Phase 4: Add
close()/CloseHandle()abstraction - Phase 5: Extend to directories, process spawning, etc.
Don’t try to abstract everything at once!
Hint 5: Use CI/CD from Day One
Set up GitHub Actions early:
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
build:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v3
- name: Build
run: make
- name: Test
run: make test
This catches portability bugs immediately.
Hint 6: Study libuv’s Source Code
Read how libuv handles platform abstraction:
git clone https://github.com/libuv/libuv.git
cd libuv
# Look at how they structure platform-specific code:
ls src/unix/ # POSIX implementations
ls src/win/ # Windows implementations
cat include/uv.h # Public API (platform-agnostic)
Pay attention to:
- How they use opaque handles (
uv_loop_t,uv_file_t) - How they map error codes (
uv_errno_t) - How they detect features (
UV_HAVE_KQUEUE)
Hint 7: Handle Struct Differences with Getters
Don’t expose platform-specific structs directly:
// Bad: Exposes struct layout
struct xplat_stat {
size_t size;
time_t mtime;
// Fields differ across platforms!
};
// Good: Use accessor functions
typedef struct xplat_stat xplat_stat_t; // Opaque
size_t xplat_stat_size(xplat_stat_t* st);
time_t xplat_stat_mtime(xplat_stat_t* st);
Internally, convert from struct stat (POSIX) or WIN32_FILE_ATTRIBUTE_DATA (Windows).
Hint 8: Test on Real Hardware, Not Just VMs
Platform quirks only show up on real systems:
- Test on actual Windows (not just WSL)
- Test on real macOS (not just Hackintosh)
- Test on different Linux distributions (Ubuntu, Fedora, Alpine)
Use free CI/CD to test on real GitHub-hosted runners.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| POSIX file I/O fundamentals | “Advanced Programming in the UNIX Environment” by Stevens & Rago | Ch. 3: File I/O |
| POSIX process control | “Advanced Programming in the UNIX Environment” by Stevens & Rago | Ch. 8: Process Control |
| Windows file and process APIs | “Windows System Programming” by Johnson Hart | Ch. 2-3: File I/O, Processes |
| Windows vs POSIX differences | “Windows Via C/C++” by Jeffrey Richter | Ch. 4: Processes; Ch. 10: Thread Synchronization |
| Platform detection and portability | “21st Century C” by Ben Klemens | Ch. 2: Debugging, Testing, Documenting |
| Error handling across platforms | “The Linux Programming Interface” by Michael Kerrisk | Ch. 3: System Programming Concepts |
| ABI and struct layout | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 3.9: Heterogeneous Data Structures |
| Calling conventions | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 3.7: Procedures |
| Symbol visibility and linking | “Advanced C and C++ Compiling” by Milan Stevanovic | Ch. 7: Designing Dynamic Libraries |
| Thread-local storage | “Programming with POSIX Threads” by David Butenhof | Ch. 5.4: Thread-Specific Data |
| Memory-mapped files (POSIX) | “The Linux Programming Interface” by Michael Kerrisk | Ch. 49: Memory Mappings |
| Cross-platform design patterns | “C Interfaces and Implementations” by David Hanson | Throughout (design philosophy) |
| libuv design and internals | “An Introduction to libuv” by Nikhil Marathe | Entire book (free online) |
| Practical portability techniques | “Practical C Programming” by Steve Oualline | Ch. 19: Portability Problems |
Project 5: High-Performance String Search Library
- File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md
- Programming Language: C (with SIMD Intrinsics)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Algorithms / Low-Level Optimization
- Software or Tool: SIMD / Vectorization
- Main Book: “Modern X86 Assembly Language Programming” by Daniel Kusswurm
What you’ll build: A fast substring search library with SIMD acceleration, similar to what powers ripgrep’s core.
Why it teaches performance tuning: You’ll learn that the algorithm textbooks teach (KMP, Boyer-Moore) isn’t what fast tools use. You’ll discover SIMD, understand why memory access patterns matter more than Big-O for real data, and learn to profile before optimizing.
Core challenges you’ll face:
- Implementing SIMD-accelerated search (maps to performance tuning)
- Handling different string encodings (UTF-8 awareness) (maps to real-world correctness)
- Designing an API that’s both fast and safe (maps to API design)
- Avoiding undefined behavior with pointer arithmetic (maps to UB avoidance)
- Beating
strstrandmemmemin benchmarks (maps to measurable outcome)
Key Concepts:
- SIMD fundamentals: “Modern X86 Assembly Language Programming” by Daniel Kusswurm - SIMD chapters
- Fast string search algorithms: Andrew Gallant’s (BurntSushi) blog posts on ripgrep’s internals
- Cache-aware programming: “What Every Programmer Should Know About Memory” by Ulrich Drepper
- UTF-8 handling: “UTF-8 Everywhere” manifesto and ripgrep’s encoding handling
Resources for SIMD string search challenge:
- “Hyperscan” Intel paper - How regex engines use SIMD for literal matching
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C, basic understanding of CPU architecture
Real world outcome:
- A command-line tool:
./fastsearch "pattern" largefile.txt - Benchmarks showing 3-10x speedup over naive search on large files
- Flame graphs showing where time is actually spent
Learning milestones:
- Basic SIMD search working - You understand vector instructions and intrinsics
- UTF-8 handled correctly - You grasp why byte-level search needs encoding awareness
- Consistently faster than stdlib - You’ve profiled, found bottlenecks, and optimized the right things
Real World Outcome
When you complete this project, you’ll have built a high-performance string search library that rivals the core of tools like ripgrep, ag (the silver searcher), and grep. This is the kind of code that powers developer tools used by millions.
What you’ll actually see:
A command-line tool that searches files blazingly fast, with benchmarks proving it:
# Build your library with SIMD support
$ gcc -O3 -march=native -mavx2 -o fastsearch fastsearch.c search_simd.c
$ ./fastsearch --version
fastsearch 1.0 - SIMD-accelerated string search
Detected CPU features: SSE4.2, AVX2, BMI2
Using: AVX2 vectorized search (256-bit vectors)
# Search a large file (100MB log file)
$ time ./fastsearch "ERROR" /var/log/huge.log
[Line 1247] 2025-12-27 ERROR: Connection timeout
[Line 5893] 2025-12-27 ERROR: Database unreachable
[Line 8312] 2025-12-27 ERROR: Memory allocation failed
... (147 matches found)
real 0m0.143s
user 0m0.135s
sys 0m0.008s
# Compare with standard grep
$ time grep "ERROR" /var/log/huge.log
... (same 147 matches)
real 0m1.521s
user 0m1.489s
sys 0m0.031s
# Your SIMD version is 10.6x faster!
Benchmark output showing performance gains:
$ ./bench_search
Running benchmarks on 100MB test corpus...
Algorithm | Throughput | Speedup
------------------------|-----------------|----------
Naive search | 142 MB/s | 1.0x
libc strstr() | 385 MB/s | 2.7x
Two-way algorithm | 521 MB/s | 3.7x
Boyer-Moore | 447 MB/s | 3.1x
SIMD SSE4.2 (your impl) | 1,234 MB/s | 8.7x
SIMD AVX2 (your impl) | 2,847 MB/s | 20.0x
SIMD AVX-512 (your impl)| 4,921 MB/s | 34.6x
Pattern length tests (searching for "the"):
- 3-byte pattern: 4,921 MB/s (AVX-512)
- 10-byte pattern: 3,215 MB/s (AVX-512)
- 50-byte pattern: 1,892 MB/s (falls back to Boyer-Moore)
UTF-8 correctness test: ✓ PASSED (all 1,000 multibyte patterns)
Your library’s API:
// search.h - Public API
#include <stddef.h>
// Initialize search engine (detects CPU features)
void search_init(void);
// Search for needle in haystack
// Returns: pointer to first occurrence, or NULL
const char* search_find(const char* haystack, size_t haystack_len,
const char* needle, size_t needle_len);
// Search with options
typedef struct {
int case_insensitive;
int utf8_mode;
int use_simd; // Auto-detect if not specified
} search_opts_t;
const char* search_find_opts(const char* haystack, size_t haystack_len,
const char* needle, size_t needle_len,
const search_opts_t* opts);
// Get detected CPU features
const char* search_get_features(void);
What the SIMD code looks like under the hood:
// search_avx2.c - AVX2 implementation
#include <immintrin.h>
const char* search_avx2(const char* haystack, size_t len,
const char* needle, size_t needle_len) {
// Load first character of needle into all 32 bytes of a 256-bit register
__m256i first_char = _mm256_set1_epi8(needle[0]);
size_t i = 0;
// Process 32 bytes at a time
while (i + 32 <= len) {
// Load 32 bytes from haystack
__m256i block = _mm256_loadu_si256((__m256i*)(haystack + i));
// Compare all 32 bytes in parallel
__m256i cmp = _mm256_cmpeq_epi8(block, first_char);
// Convert comparison result to bitmask
int mask = _mm256_movemask_epi8(cmp);
// Check each set bit (potential match)
while (mask != 0) {
int pos = __builtin_ctz(mask); // Count trailing zeros
// Verify full match at this position
if (memcmp(haystack + i + pos, needle, needle_len) == 0) {
return haystack + i + pos;
}
mask &= mask - 1; // Clear lowest set bit
}
i += 32;
}
// Handle remaining bytes with scalar code
// ...
}
Profiling output showing where time is spent:
$ perf record -g ./fastsearch "pattern" largefile.txt
$ perf report
# Samples: 10K of event 'cycles'
# Overhead Command Shared Object Symbol
# ........ ........... .................. .........................
47.23% fastsearch fastsearch [.] search_avx2
12.45% fastsearch [kernel.kallsyms] [k] page_fault
8.92% fastsearch libc-2.31.so [.] __memcmp_avx2
7.31% fastsearch fastsearch [.] utf8_validate
3.24% fastsearch fastsearch [.] search_init
...
# Notice: Most time in your SIMD code, not I/O or system calls!
Flame graph visualization:
fastsearch (100%)
├─ search_avx2 (47.2%) ← Your hot path
│ ├─ _mm256_loadu_si256 (21.3%)
│ ├─ _mm256_cmpeq_epi8 (15.8%)
│ └─ __builtin_ctz (10.1%)
├─ page_fault (12.4%) ← Kernel memory management
├─ __memcmp_avx2 (8.9%) ← Verifying matches
└─ utf8_validate (7.3%) ← Character encoding
The concrete artifacts you’ll deliver:
- Core library:
search.c,search_avx2.c,search_sse42.c,search_scalar.c - CPU feature detection: Runtime dispatch to fastest available implementation
- Benchmark suite: Comparing against glibc, ripgrep internals, naive algorithms
- CLI tool:
fastsearchcommand-line utility - Test suite: 500+ tests including edge cases, UTF-8, performance regressions
- Documentation: Explaining when to use SIMD vs traditional algorithms
Real-world integration example:
Someone could use your library in their text editor:
// text_editor.c
#include "search.h"
void highlight_search_results(const char* buffer, size_t len,
const char* query) {
const char* pos = buffer;
size_t remaining = len;
while (remaining > 0) {
const char* match = search_find(pos, remaining,
query, strlen(query));
if (!match) break;
// Highlight this match in the editor
highlight_text(match - buffer, strlen(query));
size_t offset = (match - pos) + strlen(query);
pos += offset;
remaining -= offset;
}
}
You’ll have tangible proof of understanding:
- How modern CPUs process data in parallel
- Why memory access patterns matter more than algorithmic complexity for real data
- How to write portable SIMD code that degrades gracefully
- How to profile and optimize at the assembly level
The Core Question You’re Answering
“Why are tools like ripgrep so much faster than grep? What makes string search fast at the hardware level?”
Everyone learns Boyer-Moore and KMP in algorithms class, but those aren’t what make modern search tools fast. The real question is:
- What is SIMD and why does it matter? How can you search 32 bytes in the same time it takes to search 1 byte?
- Why does memory matter more than Big-O? A cache-friendly O(n) beats a cache-hostile O(log n)
- How do you write code that works on different CPUs? SSE, AVX2, AVX-512, NEON—how do you support them all?
After this project, you’ll understand that performance isn’t about clever algorithms—it’s about understanding your hardware and feeding it data the way it wants to consume it.
Concepts You Must Understand First
Stop and research these before coding:
- What is SIMD? (Single Instruction, Multiple Data)
- How does a CPU execute the same operation on multiple data elements in parallel?
- What is a vector register? (128-bit XMM, 256-bit YMM, 512-bit ZMM)
- Why can you compare 32 bytes in one instruction instead of looping 32 times?
- What’s the difference between SSE, AVX, and AVX-512?
- Book Reference: “Modern X86 Assembly Language Programming” Ch. 5-6 - Kusswurm
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 5.8 - Bryant & O’Hallaron
- CPU Intrinsics vs Assembly
- What are intrinsics? (C functions that map directly to assembly instructions)
- Why use intrinsics instead of inline assembly?
- How do you use
_mm256_loadu_si256()and_mm256_cmpeq_epi8()? - What does “unaligned load” mean and why does it matter?
- Book Reference: “Modern X86 Assembly Language Programming” Ch. 7 - Kusswurm
- String Search Algorithms
- Naive search: O(nm) but simple and cache-friendly for short patterns
- Two-way algorithm: What glibc uses in
strstr()(linear time, constant space) - Boyer-Moore: Skip ahead based on mismatches (good for long patterns)
- SIMD-accelerated search: Why it dominates for typical workloads
- Book Reference: “Algorithms” by Sedgewick & Wayne - Ch. 5.3: Substring Search
- Resource: Andrew Gallant’s blog posts on ripgrep’s SIMD usage
- Cache-Aware Programming
- What is a cache line? (Typically 64 bytes)
- Why does sequential access beat random access by 100x?
- What is prefetching and how does the CPU do it automatically?
- How does SIMD help with cache efficiency?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 6 - Bryant & O’Hallaron
- Resource: “What Every Programmer Should Know About Memory” - Ulrich Drepper
- UTF-8 Encoding
- Why can’t you treat UTF-8 as just bytes? (Multi-byte characters!)
- How do you detect if you’re in the middle of a multi-byte sequence?
- What happens if you report a match that splits a character?
- How does ripgrep handle this?
- Resource: “UTF-8 Everywhere” manifesto
- Book Reference: “The Linux Programming Interface” Ch. 10.2 - Kerrisk
- CPU Feature Detection
- How do you detect if the CPU supports AVX2 at runtime?
- What is the
CPUIDinstruction? - How do you write code that falls back to SSE or scalar if AVX2 is unavailable?
- What are function pointers used for in feature dispatch?
- Book Reference: “Modern X86 Assembly Language Programming” Ch. 3 - Kusswurm
- Memory Alignment
- What does “16-byte aligned” mean?
- Why do aligned loads (
_mm256_load_si256) require alignment? - When should you use unaligned loads (
_mm256_loadu_si256)? - How does alignment affect performance?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.9.3 - Bryant & O’Hallaron
- Profiling and Benchmarking
- How do you use
perfto see where CPU time is spent? - What is cache miss rate and how do you measure it?
- How do you generate flame graphs?
- What’s the difference between wall-clock time and CPU time?
- Book Reference: “Systems Performance” Ch. 6 - Brendan Gregg
- Resource:
perfdocumentation and Brendan Gregg’s website
- How do you use
Questions to Guide Your Design
Before implementing, think through these:
- Algorithm Selection
- When should you use SIMD vs Boyer-Moore vs naive search?
- How does pattern length affect the best algorithm choice?
- Should you always use the fastest SIMD available?
- Handling Different CPUs
- How do you support CPUs without AVX2? (Laptops, older servers)
- Should you compile multiple versions and choose at runtime?
- What’s your fallback strategy: SSE4.2 → SSE2 → scalar?
- API Design
- Should your API look like
strstr()(return pointer)? - Or like
memmem()(take explicit lengths)? - How do you handle case-insensitive search with SIMD?
- Should your API look like
- UTF-8 Correctness
- Should you validate UTF-8 before searching?
- What if the haystack has invalid UTF-8?
- How do you ensure you don’t split multi-byte characters?
- Performance Measurement
- How do you benchmark fairly? (Avoid measuring I/O, focus on search)
- What size files should you test on? (Fit in L3 cache? Larger?)
- How do you avoid compiler optimizing away your benchmark?
- Memory Access Patterns
- Should you process the entire file linearly?
- Would memory-mapped I/O help or hurt?
- How do you minimize cache misses?
Thinking Exercise
Trace a SIMD Search By Hand
Before writing SIMD code, trace this algorithm on paper:
Searching for “cat” in “the cat sat”:
Pattern: "cat" (needle_len = 3)
Haystack: "the cat sat" (len = 11)
1. Load first char of pattern ('c') into 256-bit register:
['c','c','c','c','c','c','c','c', ...] (32 copies)
2. Load 32 bytes from haystack (pad with zeros if needed):
['t','h','e',' ','c','a','t',' ','s','a','t','\0',0,0,0,...]
3. Compare all 32 bytes in parallel:
[0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...] (1 = match)
4. Convert to bitmask:
0b00000000000000000000000000010000 = 0x10 (bit 4 is set)
5. Find position of set bit:
pos = 4
6. Verify full match at haystack[4:7]:
haystack[4:7] = "cat" ✓ MATCH!
Questions while tracing:
- What if there are multiple ‘c’s in the 32-byte block?
- How do you handle the case where the pattern spans two blocks?
- What’s the worst-case scenario for this algorithm?
- Why is this faster than checking each byte individually?
Now consider edge cases:
Pattern: "café" (UTF-8: 'c','a','f',0xC3,0xA9)
Haystack: "the café sat"
- The 'é' is 2 bytes in UTF-8
- What happens if your 32-byte block ends between those bytes?
- How do you ensure correctness?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is SIMD and how does it make string search faster?”
- Answer: Process multiple bytes in parallel using vector instructions
- Example: AVX2 can compare 32 bytes in one instruction
- “Why doesn’t everyone just use SIMD for everything?”
- Answer: Not all algorithms vectorize well; SIMD has overhead for setup
- Useful when: processing large amounts of data with same operation
- “How do you handle CPUs that don’t support AVX2?”
- Answer: Runtime CPU feature detection with fallback implementations
- Use function pointers or
#ifdefto compile multiple versions
- “What’s the difference between aligned and unaligned loads?”
- Answer: Aligned requires address divisible by vector size, faster
- Unaligned works anywhere but may cross cache lines (slower)
- “How does UTF-8 complicate string search?”
- Answer: Can’t split multi-byte characters; byte-level match != character match
- Solution: Validate match boundaries or track character boundaries
- “Why is cache locality important for search performance?”
- Answer: RAM is 100x slower than L1 cache; sequential access enables prefetching
- SIMD helps by processing more data per cache line fetched
- “How would you profile a performance regression?”
- Answer: Use
perfto compare before/after, look for cache misses and branch mispredicts - Generate flame graphs to see where time is spent
- Answer: Use
- “When would Boyer-Moore beat SIMD search?”
- Answer: Very long patterns where skip distance is large
- SIMD excels at short patterns (1-16 bytes) in modern workloads
Hints in Layers
Hint 1: Start Without SIMD
Get the API and testing framework right first:
// Naive implementation
const char* search_naive(const char* haystack, size_t h_len,
const char* needle, size_t n_len) {
for (size_t i = 0; i <= h_len - n_len; i++) {
if (memcmp(haystack + i, needle, n_len) == 0) {
return haystack + i;
}
}
return NULL;
}
Make sure this works correctly before optimizing!
Hint 2: Detect CPU Features
Use __builtin_cpu_supports() on GCC/Clang:
#include <stdint.h>
void search_init(void) {
#if defined(__x86_64__) || defined(_M_X64)
if (__builtin_cpu_supports("avx512f")) {
search_impl = search_avx512;
} else if (__builtin_cpu_supports("avx2")) {
search_impl = search_avx2;
} else if (__builtin_cpu_supports("sse4.2")) {
search_impl = search_sse42;
} else {
search_impl = search_naive;
}
#else
search_impl = search_naive;
#endif
}
Hint 3: Start with SSE 4.2 (Simpler than AVX2)
SSE4.2 has _mm_cmpestri which is designed for string search:
#include <nmmintrin.h> // SSE4.2
const char* search_sse42(const char* haystack, size_t h_len,
const char* needle, size_t n_len) {
__m128i needle_vec = _mm_loadu_si128((__m128i*)needle);
for (size_t i = 0; i <= h_len - 16; i += 16) {
__m128i hay_vec = _mm_loadu_si128((__m128i*)(haystack + i));
// Find first occurrence of needle[0] in hay_vec
int idx = _mm_cmpestri(needle_vec, n_len, hay_vec, 16,
_SIDD_CMP_EQUAL_EACH | _SIDD_LEAST_SIGNIFICANT);
if (idx < 16) {
// Verify full match
if (memcmp(haystack + i + idx, needle, n_len) == 0) {
return haystack + i + idx;
}
}
}
return NULL;
}
Hint 4: Benchmark Against glibc
See how you compare to the system implementation:
#include <string.h>
#include <time.h>
void benchmark(const char* haystack, size_t h_len,
const char* needle, size_t n_len) {
clock_t start, end;
// Test glibc
start = clock();
for (int i = 0; i < 10000; i++) {
strstr(haystack, needle);
}
end = clock();
printf("glibc strstr: %.3f ms\n",
(double)(end - start) / CLOCKS_PER_SEC * 1000);
// Test your implementation
start = clock();
for (int i = 0; i < 10000; i++) {
search_find(haystack, h_len, needle, n_len);
}
end = clock();
printf("Your implementation: %.3f ms\n",
(double)(end - start) / CLOCKS_PER_SEC * 1000);
}
Hint 5: Use Compiler Explorer
See what assembly your SIMD code generates:
# Paste your code at https://godbolt.org/
# Use compiler: gcc -O3 -mavx2
# You'll see the actual vector instructions generated
This helps you verify the compiler is doing what you expect.
Hint 6: Handle the “Tail” Carefully
What happens when the haystack length isn’t divisible by 32?
// Process main body with SIMD
size_t simd_len = (h_len / 32) * 32;
const char* result = search_avx2_body(haystack, simd_len, needle, n_len);
if (result) return result;
// Handle remaining bytes with scalar code
return search_naive(haystack + simd_len, h_len - simd_len, needle, n_len);
Hint 7: Study ripgrep’s Source
See how the pros do it:
git clone https://github.com/BurntSushi/ripgrep.git
cd ripgrep
# Look at the aho-corasick crate for SIMD string matching:
# It uses memchr which has SIMD implementations
Pay attention to:
- How they detect CPU features
- How they fall back gracefully
- How they handle UTF-8
Hint 8: Profile with perf
See where your CPU time is really going:
# Record performance data
perf record -g ./fastsearch "pattern" file.txt
# View report
perf report
# Look for cache misses
perf stat -e cache-misses,cache-references ./fastsearch "pattern" file.txt
This shows you what to optimize next.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| SIMD fundamentals and intrinsics | “Modern X86 Assembly Language Programming” by Daniel Kusswurm | Ch. 5-7: AVX Programming |
| Cache-aware programming | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 6: The Memory Hierarchy |
| String search algorithms | “Algorithms” by Sedgewick & Wayne | Ch. 5.3: Substring Search |
| Understanding CPU performance | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 5: Optimizing Program Performance |
| Profiling and optimization | “Systems Performance” by Brendan Gregg | Ch. 6: CPUs |
| UTF-8 and character encoding | “The Linux Programming Interface” by Michael Kerrisk | Ch. 10: Internationalization |
| Advanced SIMD techniques | “Modern Parallel Programming with C++ and Assembly” by Daniel Kusswurm | Ch. 8-10: SIMD String Processing |
| Practical optimization | “Write Great Code, Volume 2” by Randall Hyde | Ch. 7: Thinking Low-Level |
| Memory alignment | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 3.9: Heterogeneous Data Structures |
| Benchmarking methodology | “Systems Performance” by Brendan Gregg | Ch. 2: Methodology |
Essential Online Resources:
| Topic | Resource |
|---|---|
| Intel intrinsics reference | Intel Intrinsics Guide (software.intel.com/intrinsics) |
| SIMD string search | Andrew Gallant’s blog on ripgrep SIMD (blog.burntsushi.net) |
| SIMD algorithms | “SIMD-friendly algorithms for substring searching” by Wojciech Muła |
| CPU feature detection | GCC documentation on __builtin_cpu_supports |
| UTF-8 specification | “UTF-8 Everywhere” manifesto (utf8everywhere.org) |
| Performance analysis | Brendan Gregg’s performance website (brendangregg.com) |
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor | Most Teaches |
|---|---|---|---|---|---|
| Memory Allocator | Advanced | 3-4 weeks | ★★★★★ | ★★★ | Memory, UB, Performance |
| Work-Stealing Thread Pool | Advanced | 2-3 weeks | ★★★★★ | ★★★★ | Threading, Atomics, Cache |
| Mini Async Runtime | Advanced | 2-3 weeks | ★★★★ | ★★★★★ | Async, Platform IO, API |
| Syscall Abstraction | Intermediate | 2 weeks | ★★★ | ★★★ | ABI, Platform, API |
| String Search Library | Advanced | 2-3 weeks | ★★★★ | ★★★★ | SIMD, Performance, Profiling |
Recommended Learning Order
Based on this track’s emphasis on excellent job market and OSS-heavy hiring signal, here’s the recommended order:
Start with: Mini Async Runtime
- High demand skill (Tokio, libuv, Node.js internals)
- You’ll build something visibly impressive (C10K server)
- Teaches platform differences naturally (epoll vs kqueue)
- Foundation for understanding modern networking stacks
Then: Work-Stealing Thread Pool
- Complements async knowledge (CPU-bound vs IO-bound)
- Threading + atomics is interview gold for systems roles
- Shows you understand parallelism at a deep level
Finally: Memory Allocator
- The “boss fight” of systems programming
- Forces you to synthesize everything: performance, correctness, threading
- Having “wrote a malloc” on your resume/portfolio is a strong signal
Final Capstone Project: Embedded Key-Value Database
What you’ll build: A persistent, thread-safe, embedded key-value store like a simplified RocksDB or LMDB—combining everything from above.
Why it teaches everything: This project is the “final boss” because it requires:
- Memory allocators: Custom allocators for the buffer pool
- Threading: Concurrent readers/writers with proper synchronization
- Async: Optional async API for non-blocking operations
- ABI: Stable on-disk format across versions
- Platform: Memory-mapped files work differently on each OS
- Performance: You’ll profile compaction, cache hit rates, write amplification
- UB avoidance: Memory-mapped IO is a UB minefield
- API design: Making it usable for application developers
Core challenges you’ll face:
- Implementing a log-structured merge tree or B-tree (maps to data structure design)
- Memory-mapping files safely across platforms (maps to platform differences)
- Write-ahead logging for durability (maps to crash consistency)
- Concurrent access without global locks (maps to threading primitives)
- Compaction without blocking readers (maps to async/background work)
- Benchmarking against SQLite/RocksDB (maps to performance tuning)
Key Concepts:
- LSM trees: “Designing Data-Intensive Applications” by Martin Kleppmann - Chapter 3
- Memory-mapped IO: “The Linux Programming Interface” by Kerrisk - Chapter 49
- Crash consistency: “Operating Systems: Three Easy Pieces” - Crash Consistency chapter
- B-tree implementation: “Algorithms” by Sedgewick & Wayne - Chapter 6
- Concurrent data structures: “The Art of Multiprocessor Programming” by Herlihy & Shavit
Difficulty: Expert Time estimate: 1-2 months Prerequisites: All previous projects, or equivalent experience
Real world outcome:
- A library:
db_open(),db_get(),db_put(),db_close() - A benchmark suite comparing ops/sec against SQLite and RocksDB
- A crash test that kills the process mid-write and verifies data integrity on restart
- Potentially: others using it in their projects (OSS signal)
Learning milestones:
- Basic persistence working - You understand write-ahead logging and fsync semantics
- Concurrent access safe - You’ve implemented MVCC or reader-writer locks correctly
- Performance competitive - You’ve profiled, optimized, and understand why RocksDB makes certain tradeoffs
- Crash-safe verified - You can kill -9 your DB at any point and recover correctly
Real World Outcome
When you complete this project, you’ll have built a production-quality embedded database similar to SQLite, LMDB, or RocksDB. This is the kind of infrastructure that powers applications from mobile apps to distributed systems.
What you’ll actually see:
A working database library with a simple API that real applications can embed:
# Build your database library
$ make
gcc -O3 -Wall -Wextra -o libkvdb.so kvdb.c lsm.c wal.c memtable.c -shared -fPIC
gcc -o kvdb_cli cli.c -L. -lkvdb
# Use the command-line interface
$ ./kvdb_cli mydata.db
kvdb> open mydata.db
Database opened: mydata.db
LSM tree levels: 0
Active memtable: 0 entries
Write-ahead log: 0 bytes
kvdb> put user:1 '{"name":"Alice","age":30}'
OK (wrote 34 bytes)
kvdb> put user:2 '{"name":"Bob","age":25}'
OK (wrote 32 bytes)
kvdb> get user:1
{"name":"Alice","age":30}
kvdb> scan user:
user:1 -> {"name":"Alice","age":30}
user:2 -> {"name":"Bob","age":25}
(2 keys scanned)
kvdb> stats
Database statistics:
Total keys: 2
Memtable entries: 2
L0 SSTables: 0
Total disk usage: 0 bytes
WAL size: 78 bytes
Cache hit rate: 100.0%
kvdb> compact
Starting compaction...
Flushed memtable to L0 SSTable (2 entries, 1.2 KB)
Compaction complete (12.3 ms)
kvdb> quit
Closing database...
Flushing memtable...
Syncing WAL...
Database closed successfully.
Crash recovery in action:
# Write some data
$ ./kvdb_cli test.db
kvdb> put key1 value1
OK
kvdb> put key2 value2
OK
# Simulate crash (kill -9 in another terminal)
$ kill -9 $(pgrep kvdb_cli)
# Restart and verify data is intact
$ ./kvdb_cli test.db
Recovering from WAL... (2 entries replayed)
Database opened: test.db
kvdb> get key1
value1
kvdb> get key2
value2
# Data survived the crash!
Your library’s API:
// kvdb.h - Public API
#include <stddef.h>
#include <stdint.h>
// Database handle
typedef struct kvdb kvdb_t;
// Open or create a database
kvdb_t* kvdb_open(const char* path);
// Write operations
int kvdb_put(kvdb_t* db, const void* key, size_t key_len,
const void* value, size_t value_len);
int kvdb_delete(kvdb_t* db, const void* key, size_t key_len);
// Read operations
int kvdb_get(kvdb_t* db, const void* key, size_t key_len,
void** value, size_t* value_len);
// Iteration
typedef struct kvdb_iterator kvdb_iter_t;
kvdb_iter_t* kvdb_scan(kvdb_t* db, const void* start_key, size_t key_len);
int kvdb_iter_next(kvdb_iter_t* iter);
void kvdb_iter_free(kvdb_iter_t* iter);
// Maintenance
int kvdb_compact(kvdb_t* db);
int kvdb_flush(kvdb_t* db);
// Close database
void kvdb_close(kvdb_t* db);
A real application using your database:
// user_service.c
#include "kvdb.h"
#include <stdio.h>
#include <string.h>
int main() {
// Open database
kvdb_t* db = kvdb_open("users.db");
if (!db) {
fprintf(stderr, "Failed to open database\n");
return 1;
}
// Store user data
const char* user_id = "user:alice";
const char* user_data = "{\"email\":\"alice@example.com\",\"verified\":true}";
if (kvdb_put(db, user_id, strlen(user_id),
user_data, strlen(user_data)) != 0) {
fprintf(stderr, "Failed to write user\n");
return 1;
}
// Retrieve user data
void* retrieved_data;
size_t retrieved_len;
if (kvdb_get(db, user_id, strlen(user_id),
&retrieved_data, &retrieved_len) == 0) {
printf("User: %.*s\n", (int)retrieved_len, (char*)retrieved_data);
free(retrieved_data);
}
// Scan all users
kvdb_iter_t* iter = kvdb_scan(db, "user:", 5);
while (kvdb_iter_next(iter) == 0) {
printf("Key: %s\n", kvdb_iter_key(iter));
}
kvdb_iter_free(iter);
kvdb_close(db);
return 0;
}
Benchmark output comparing to established databases:
$ ./bench_kvdb
Running benchmarks (1,000,000 operations each)...
=== Sequential Writes ===
kvdb (your implementation): 147,239 ops/sec
SQLite (default settings): 85,123 ops/sec
RocksDB: 412,847 ops/sec
LMDB: 198,234 ops/sec
=== Random Reads (100% cache hit) ===
kvdb (your implementation): 892,341 ops/sec
SQLite: 324,128 ops/sec
RocksDB: 687,492 ops/sec
LMDB: 1,234,567 ops/sec
=== Random Reads (0% cache hit) ===
kvdb (your implementation): 45,123 ops/sec
SQLite: 38,492 ops/sec
RocksDB: 52,847 ops/sec
LMDB: 89,234 ops/sec
=== Mixed Workload (50% read, 50% write) ===
kvdb (your implementation): 78,492 ops/sec
SQLite: 42,387 ops/sec
RocksDB: 123,847 ops/sec
LMDB: 95,234 ops/sec
Your database is competitive!
What the LSM tree structure looks like:
$ ./kvdb_inspect mydata.db
Database structure:
Write-Ahead Log (WAL):
Size: 2.4 MB
Entries: 12,847
Oldest sequence: 1
Newest sequence: 12847
Memtable (in-memory):
Entries: 1,247
Size: 156 KB
Threshold for flush: 4 MB
LSM Tree on Disk:
Level 0: 3 SSTables (8.2 MB, 41,234 keys)
- sstable_00001.sst: 2.7 MB (13,847 keys)
- sstable_00002.sst: 2.8 MB (14,123 keys)
- sstable_00003.sst: 2.7 MB (13,264 keys)
Level 1: 1 SSTable (23.4 MB, 120,487 keys)
- sstable_00004.sst: 23.4 MB (120,487 keys)
Level 2: 0 SSTables
Total keys: 162,968
Total disk usage: 31.6 MB
Compression ratio: 2.3x
Visualizing how the LSM tree works:
Time 0: Empty database
┌─────────────┐
│ Memtable │ (empty)
└─────────────┘
Time 1: Write 100 keys
┌─────────────┐
│ Memtable │ ← 100 keys in memory
└─────────────┘
Time 2: Memtable full, flush to L0
┌─────────────┐
│ Memtable │ (empty)
└─────────────┘
↓
┌─────────────┐
│ L0 SST #1 │ ← 100 keys on disk
└─────────────┘
Time 3: Multiple flushes
┌─────────────┐
│ Memtable │ ← New writes
└─────────────┘
↓
┌─────────────┬─────────────┬─────────────┐
│ L0 SST #1 │ L0 SST #2 │ L0 SST #3 │ ← Overlapping ranges
└─────────────┴─────────────┴─────────────┘
Time 4: Compaction (merge L0 → L1)
┌─────────────┐
│ Memtable │
└─────────────┘
↓
┌─────────────┐
│ L0 (empty) │
└─────────────┘
↓
┌────────────────────────────┐
│ L1 SST (merged, sorted) │ ← Single sorted file
└────────────────────────────┘

Crash recovery visualization:
Normal operation:
Write → Memtable ──────────→ Eventually flushed to disk
│
└─→ WAL (immediate) ← Persisted with fsync()
After crash:
1. Read WAL from disk
2. Replay all entries into new memtable
3. Continue normal operation
Example:
WAL contains:
[PUT key1 val1]
[PUT key2 val2]
[DELETE key1]
After replay:
Memtable = { key2: val2 }

The concrete artifacts you’ll deliver:
- Core library:
kvdb.c,lsm.c,memtable.c,sstable.c,wal.c,compaction.c - On-disk format: Binary SSTable format with bloom filters and index blocks
- CLI tool: Interactive database shell for testing
- Benchmark suite: Comparing write throughput, read latency, crash recovery time
- Test suite: 200+ tests including crash simulation, concurrent access, corruption detection
- Documentation: Architecture overview, API reference, performance tuning guide
Demonstrating crash safety:
$ ./crash_test.sh
Starting crash safety test...
Test 1: Kill during write
- Writing 1000 keys...
- Killing process at random time... (killed)
- Restarting and verifying... ✓ All committed keys recovered
Test 2: Kill during compaction
- Writing 10000 keys...
- Starting compaction...
- Killing process mid-compaction... (killed)
- Restarting and verifying... ✓ Database consistent
Test 3: Power failure simulation (no fsync)
- Writing 1000 keys with sync disabled...
- Simulating power loss... (killed)
- Restarting... ⚠ Lost 847 keys (expected without fsync)
Test 4: Concurrent writers
- Starting 10 writer threads...
- Writing 100 keys per thread...
- Killing process randomly... (killed)
- Restarting and verifying... ✓ No corruption detected
Crash safety: PASSED (4/4 tests)
You’ll have tangible proof of understanding:
- How databases persist data reliably
- Why write-ahead logging is fundamental to durability
- How LSM trees trade write amplification for write throughput
- How to design concurrent data structures
- How to profile and optimize I/O-bound workloads
The Core Question You’re Answering
“How do databases like SQLite and RocksDB guarantee your data won’t be lost, even if the power goes out mid-write?”
Everyone uses databases, but few understand the deep engineering that makes them reliable. The real questions are:
- What is durability? How do you guarantee that once you say “OK, data saved,” it’s actually on disk?
- Why are LSM trees fast for writes? What’s the tradeoff between B-trees and LSM trees?
- How do you make it thread-safe? Multiple readers, one writer—how do you avoid locks killing performance?
- What happens when the computer crashes? How do you recover without losing data or corrupting the database?
After this project, you’ll understand that databases aren’t magic—they’re careful engineering of data structures, file formats, concurrency primitives, and crash recovery protocols.
Concepts You Must Understand First
Stop and research these before coding:
- Log-Structured Merge (LSM) Trees
- Why append-only data structures are fast for writes
- What is a memtable? (In-memory sorted structure, typically a skip list or red-black tree)
- What is an SSTable? (Sorted String Table—immutable, on-disk sorted key-value file)
- What is compaction and why is it necessary?
- Book Reference: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
- Resource: “The Log-Structured Merge-Tree (LSM-Tree)” - Patrick O’Neil et al. (original paper)
- Write-Ahead Logging (WAL)
- Why do you write to a log before modifying the main data?
- What is
fsync()and why does it matter for durability? - How do you replay the log after a crash?
- What’s the difference between crash consistency and data consistency?
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 42: Crash Consistency
- Book Reference: “Database Internals” Ch. 6 - Alex Petrov
- Memory-Mapped I/O
- What is
mmap()and why is it useful for databases? - How does the OS handle page faults for memory-mapped files?
- What are the dangers of
mmap()for durability? (delayed writes!) - When should you use
mmap()vsread()/write()? - Book Reference: “The Linux Programming Interface” Ch. 49 - Kerrisk
- Book Reference: “Database Internals” Ch. 4 - Petrov
- What is
- Concurrent Data Structures
- How do you implement a lock-free read path?
- What is MVCC (Multi-Version Concurrency Control)?
- What are reader-writer locks and when should you use them?
- How do you avoid deadlocks in concurrent code?
- Book Reference: “The Art of Multiprocessor Programming” Ch. 9 - Herlihy & Shavit
- Book Reference: “Database Internals” Ch. 5 - Petrov
- B-Trees vs LSM Trees
- Why does SQLite use B-trees? (Read-optimized)
- Why does RocksDB use LSM trees? (Write-optimized)
- What is write amplification?
- What is read amplification?
- Book Reference: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
- Resource: RocksDB documentation on LSM tree tuning
- Bloom Filters
- How can you quickly test if a key might be in an SSTable?
- What is a false positive rate and how do you tune it?
- How much memory does a bloom filter use?
- Why are bloom filters critical for LSM tree read performance?
- Book Reference: “Algorithms” by Sedgewick & Wayne - Appendix on probabilistic data structures
- Resource: “Bloom Filters by Example” - various online tutorials
- File System Semantics
- What guarantees does
fsync()provide? - What’s the difference between
fsync()andfdatasync()? - What is
O_DIRECTand when would you use it? - How do you ensure directory metadata is persisted?
- Book Reference: “The Linux Programming Interface” Ch. 13 - Kerrisk
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 40: File System Implementation
- What guarantees does
- Binary File Formats
- How do you serialize data structures to disk?
- What is endianness and why does it matter?
- How do you version your file format for future compatibility?
- What’s the tradeoff between compression and read speed?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 2 - Bryant & O’Hallaron
- Resource: RocksDB SSTable format documentation
Questions to Guide Your Design
Before implementing, think through these:
- Data Structure Choice
- Should your memtable be a skip list, red-black tree, or AVL tree?
- How do you make it thread-safe without locks?
- What’s your threshold for flushing memtable to disk?
- On-Disk Format
- How do you lay out an SSTable on disk?
- Should you store keys and values separately?
- Do you need an index block for fast lookups?
- Should you compress data? Which algorithm?
- Write-Ahead Log
- Should each write be a separate WAL entry or batch them?
- How often do you call
fsync()? (Every write? Every N writes? Every N milliseconds?) - What’s the format of a WAL entry?
- When is it safe to truncate the WAL?
- Compaction Strategy
- How many levels in your LSM tree?
- When do you trigger compaction?
- Do you compact in a background thread?
- How do you ensure readers don’t see half-compacted state?
- Crash Recovery
- What’s your recovery procedure on startup?
- How do you detect corrupted WAL entries?
- Should you use checksums on every write?
- How do you test that recovery actually works?
- Concurrency
- How many concurrent readers do you support?
- Can writers run concurrently? (Probably not in v1)
- How do you make iterators safe while background compaction runs?
- Performance vs Durability
- Should you offer async writes (fast but less durable)?
- Do you batch writes to amortize
fsync()cost? - What’s your target: throughput or latency?
Thinking Exercise
Design the Write Path on Paper
Trace what happens when a user calls kvdb_put(db, "user:123", "Alice"):
1. Acquire write lock
2. Append to WAL:
[Sequence #1234] PUT user:123 Alice [Checksum: 0xABCD]
3. Call fsync() on WAL file
4. Insert into memtable (in-memory skip list)
5. Release write lock
6. Return success to user
If memtable is now full (> 4MB):
7. Create new empty memtable
8. Start background thread to flush old memtable:
a. Sort all entries (already sorted in skip list)
b. Write SSTable to disk:
- Data block: [user:123 -> Alice]
- Index block: [user -> offset 0]
- Bloom filter: set bits for "user:123"
- Footer: [index offset, bloom offset, checksum]
c. Call fsync() on SSTable
d. Add SSTable to L0
e. Truncate WAL (safe now that data is in SSTable)
Questions to consider:
- What if the process crashes at step 3? (WAL has entry, memtable doesn’t → replay from WAL)
- What if the process crashes at step 6? (Both WAL and memtable have entry → idempotent replay)
- What if the process crashes during step 8b? (Partial SSTable → ignored on startup, WAL still has data)
- How do concurrent readers see the write? (Memtable is visible immediately after step 4)
Now trace a read: kvdb_get(db, "user:123")
1. Look in memtable (skip list search) → FOUND! Return "Alice"
(If not found, continue to step 2)
2. Check L0 SSTables (newest first):
a. Check bloom filter for "user:123"
b. If bloom says "maybe", binary search index block
c. If found in index, read data block
d. If found in data block, return value
3. Check L1 SSTables (only one, sorted range)
a. Binary search index to find relevant SSTable
b. Check bloom filter
c. Read data block if bloom filter says maybe
4. Return NOT_FOUND
Why this order?
- Memtable first: Most recent writes
- L0 next: Recent flushes (may have updates)
- L1 last: Older, compacted data
The Interview Questions They’ll Ask
Prepare to answer these:
- “How does write-ahead logging guarantee durability?”
- Answer: WAL is synced to disk (
fsync) before acknowledging the write - On crash, replay WAL to reconstruct in-memory state
- Answer: WAL is synced to disk (
- “What’s the difference between an LSM tree and a B-tree?”
- Answer: LSM tree is write-optimized (append-only), B-tree is read-optimized (in-place updates)
- Tradeoff: LSM has write amplification from compaction, B-tree has write amplification from rebalancing
- “Why do LSM trees need compaction?”
- Answer: Without compaction, you’d have too many overlapping SSTables
- Reads would require checking many files, killing performance
- “How would you make this database crash-safe?”
- Answer: WAL +
fsync()before ACK, checksums on all entries, replay on recovery - Must ensure directory metadata is persisted too (
fsync()on directory)
- Answer: WAL +
- “What’s write amplification and why does it matter?”
- Answer: Writing 1 byte might cause many more bytes to be written (compaction, rewriting SSTables)
- Matters for SSD lifespan and write throughput
- “How do you handle concurrent reads and writes?”
- Answer: MVCC (copy-on-write for memtable), immutable SSTables (safe for concurrent reads)
- Or reader-writer locks (simpler but less scalable)
- “What’s a bloom filter and why is it important for LSM trees?”
- Answer: Probabilistic data structure to test set membership
- Avoids expensive disk reads when key definitely not in SSTable
- “How would you debug a crash that corrupts the database?”
- Answer: Check WAL checksums, verify SSTable integrity, replay WAL in test environment
- Use tools like
valgrindto catch memory bugs,fsynctracing to ensure durability
Hints in Layers
Hint 1: Start with an In-Memory-Only Version
Get the API and basic data structures working without persistence:
// Simple in-memory hash table first
typedef struct {
char* key;
char* value;
size_t key_len;
size_t value_len;
} kv_entry_t;
kvdb_t* kvdb_open(const char* path) {
kvdb_t* db = malloc(sizeof(kvdb_t));
db->entries = malloc(sizeof(kv_entry_t) * 1024);
db->count = 0;
return db;
}
int kvdb_put(kvdb_t* db, const void* key, size_t key_len,
const void* value, size_t value_len) {
// Just store in array for now
kv_entry_t* entry = &db->entries[db->count++];
entry->key = malloc(key_len);
memcpy(entry->key, key, key_len);
entry->value = malloc(value_len);
memcpy(entry->value, value, value_len);
return 0;
}
Once this works, add persistence layer by layer.
Hint 2: Implement WAL First
The simplest persistent structure is an append-only log:
// WAL entry format:
// [4 bytes: key_len][4 bytes: value_len][key][value][4 bytes: CRC32]
int wal_append(int fd, const void* key, size_t key_len,
const void* value, size_t value_len) {
uint32_t k_len = key_len;
uint32_t v_len = value_len;
write(fd, &k_len, sizeof(k_len));
write(fd, &v_len, sizeof(v_len));
write(fd, key, key_len);
write(fd, value, value_len);
// Calculate and write checksum
uint32_t crc = crc32(key, key_len, value, value_len);
write(fd, &crc, sizeof(crc));
// CRITICAL: fsync before returning!
fsync(fd);
return 0;
}
Test that you can replay the WAL after a crash.
Hint 3: Use a Skip List for Memtable
Skip lists are simpler than red-black trees and perform similarly:
typedef struct skip_node {
char* key;
char* value;
struct skip_node* next[MAX_LEVEL];
} skip_node_t;
// Insert is probabilistic (coin flip determines level)
void skiplist_insert(skiplist_t* list, const char* key, const char* value) {
int level = random_level(); // 1-16
skip_node_t* node = malloc(sizeof(skip_node_t));
node->key = strdup(key);
node->value = strdup(value);
// Link into skip list at chosen level
// ... (standard skip list insertion)
}
Reference: “Algorithms” by Sedgewick & Wayne has skip list pseudocode.
Hint 4: SSTable Format Should Be Simple
Don’t over-engineer the on-disk format initially:
SSTable file structure:
┌──────────────────────────────────────┐
│ Header: │
│ magic: "KVDB" (4 bytes) │
│ version: 1 (4 bytes) │
│ num_entries: N (8 bytes) │
├──────────────────────────────────────┤
│ Data Block: │
│ [key_len][value_len][key][value] │
│ [key_len][value_len][key][value] │
│ ... (repeated N times) │
├──────────────────────────────────────┤
│ Index Block: (optional v2 feature) │
│ [first_key -> offset] │
│ [first_key -> offset] │
├──────────────────────────────────────┤
│ Footer: │
│ index_offset: (8 bytes) │
│ checksum: CRC32 (4 bytes) │
└──────────────────────────────────────┘

Start without index/bloom filter, add them when reads are slow.
Hint 5: Test Crash Recovery Early
Write a script that kills your process randomly:
#!/bin/bash
# crash_test.sh
for i in {1..100}; do
# Start database in background
./kvdb_cli test.db < test_commands.txt &
PID=$!
# Sleep random time (0-2 seconds)
sleep $((RANDOM % 2)).$((RANDOM % 1000))
# Kill it
kill -9 $PID
# Try to recover
./kvdb_cli test.db < verify_commands.txt
if [ $? -ne 0 ]; then
echo "Recovery failed on iteration $i"
exit 1
fi
done
echo "All 100 crash tests passed!"
This surfaces bugs in your WAL replay logic.
Hint 6: Use mmap() Carefully
Memory-mapped I/O is convenient but has durability pitfalls:
// Mapping an SSTable for reads is great:
void* data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Now you can read data as a byte array
// But for writes, be careful:
void* data = mmap(NULL, file_size, PROT_WRITE, MAP_SHARED, fd, 0);
// Writes might not hit disk until you call:
msync(data, file_size, MS_SYNC); // Like fsync for mmap
Start without mmap, add it as an optimization later.
Hint 7: Study Existing Implementations
Read the source code of simple key-value stores:
# LevelDB (C++, but readable)
git clone https://github.com/google/leveldb.git
# Look at:
# db/write_batch.cc (WAL writing)
# table/table_builder.cc (SSTable writing)
# db/db_impl.cc (main database logic)
# LMDB (C, very clean)
git clone https://github.com/LMDB/lmdb.git
# Look at:
# libraries/liblmdb/mdb.c (entire database in one file!)
Don’t copy code, but understand their design decisions.
Hint 8: Benchmark Against SQLite
SQLite is your baseline—can you match it?
#include <sqlite3.h>
#include <time.h>
// Benchmark SQLite
clock_t start = clock();
sqlite3* db;
sqlite3_open("test.db", &db);
for (int i = 0; i < 100000; i++) {
char sql[256];
sprintf(sql, "INSERT INTO kv VALUES ('%d', 'value%d')", i, i);
sqlite3_exec(db, sql, NULL, NULL, NULL);
}
clock_t end = clock();
printf("SQLite: %.2f seconds\n", (double)(end - start) / CLOCKS_PER_SEC);
// Now benchmark yours the same way
If you’re within 2-3x of SQLite, you’re doing well!
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| LSM trees and database design | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 3: Storage and Retrieval |
| Write-ahead logging | “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau | Ch. 42: Crash Consistency |
| Memory-mapped I/O | “The Linux Programming Interface” by Michael Kerrisk | Ch. 49: Memory Mappings |
| File system semantics | “The Linux Programming Interface” by Michael Kerrisk | Ch. 13: File I/O Buffering |
| Concurrent data structures | “The Art of Multiprocessor Programming” by Herlihy & Shavit | Ch. 9: Linked Lists |
| Database internals | “Database Internals” by Alex Petrov | Ch. 3-6: File Formats, Indexing, Transaction Processing |
| B-trees and LSM trees comparison | “Database Internals” by Alex Petrov | Ch. 2: B-Tree Basics; Ch. 7: Log-Structured Storage |
| Skip lists | “Algorithms” by Sedgewick & Wayne | Section 3.5: Applications (skip lists) |
| Bloom filters | “Algorithms” by Sedgewick & Wayne | Appendix: Probabilistic Data Structures |
| Binary file formats | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 2: Representing Information |
| fsync and durability | “Database Reliability Engineering” by Campbell & Majors | Ch. 4: Durability |
| Crash consistency | “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau | Ch. 41-42: File System Implementation, Crash Consistency |
Essential Papers and Resources:
| Topic | Resource |
|---|---|
| LSM tree original paper | “The Log-Structured Merge-Tree” by O’Neil et al. (1996) |
| RocksDB documentation | RocksDB Wiki (github.com/facebook/rocksdb/wiki) |
| LMDB design | “LMDB: The Symas Lightning Memory-Mapped Database” whitepaper |
| Write-ahead logging | “Write-Ahead Logging” - PostgreSQL documentation |
| File system guarantees | “All File Systems Are Not Created Equal” by Pillai et al. |
| SSTables in Bigtable | “Bigtable: A Distributed Storage System” by Chang et al. (Google paper) |
| Compaction strategies | “Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores” |
Summary
This track will transform you from someone who uses systems libraries to someone who builds them. The projects above give you real artifacts to show employers and the deep understanding to discuss implementation tradeoffs in interviews.
Projects in this track align with real-world tools:
- Memory Allocator → jemalloc, tcmalloc, mimalloc
- Thread Pool → Rayon, Go runtime, Java ForkJoinPool
- Async Runtime → Tokio, libuv, io_uring
- Syscall Abstraction → libuv, Rust std, Go runtime
- String Search → ripgrep, hyperscan, stringzilla
- Key-Value Store → RocksDB, LMDB, LevelDB