PHASE 2 TRACK B SYSTEMS LIBRARIES PROJECTS

The topics in this track break down into these fundamental building blocks:

Systems Libraries & Runtimes — Project-Based Learning Path

Phase 2 — Advanced Systems Track B

This track is about building the infrastructure that other software depends on. You’ll learn to write code that’s fast, correct across platforms, and doesn’t invoke undefined behavior.

Core Concept Analysis

The topics in this track break down into these fundamental building blocks:

Concept	What You Must Understand
Memory Allocators	Free lists, fragmentation, arena vs general-purpose, metadata overhead
Threading Primitives	Mutexes, atomics, memory ordering, lock-free algorithms
Async Runtimes	Event loops, futures/promises, IO multiplexing (epoll/kqueue), schedulers
ABI Details	Calling conventions, struct layout, symbol visibility, name mangling
Platform Differences	POSIX vs Windows syscalls, endianness, feature detection
Performance Tuning	Cache lines, branch prediction, SIMD, profiling
Undefined Behavior	Strict aliasing, alignment, integer overflow, pointer provenance
API Design	Ergonomics vs zero-cost, error handling, versioning

Project 1: Custom Memory Allocator

📖 View Detailed Guide →

File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md

Main Programming Language: C

Alternative Programming Languages: C++, Rust, Zig

Coolness Level: Level 4: Hardcore Tech Flex

Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)

Difficulty: Level 3: Advanced (The Engineer)

Knowledge Area: Memory Management, Systems Programming

Software or Tool: glibc, jemalloc, valgrind

Main Book: “C Interfaces and Implementations” - David Hanson What you’ll build: A specialized memory allocator (arena/pool allocator) that eliminates malloc from your hot path - because malloc is a latency killer. Why it teaches HFT: HFT systems never call malloc in the critical path. This project teaches you why general-purpose allocators are slow and how to design allocation strategies for specific use cases. Core challenges you’ll face:

Pool allocator design: Fixed-size blocks with O(1) alloc/free (teaches memory management fundamentals)

Arena allocator: Bump allocation with bulk free (teaches allocation patterns)

Thread-local pools: Avoiding false sharing in multi-threaded allocators (teaches TLS, cache lines) What you’ll build: A specialized memory allocator (arena/pool allocator) that eliminates malloc from your hot path - because malloc is a latency killer. Why it teaches HFT: HFT systems never call malloc in the critical path. This project teaches you why general-purpose allocators are slow and how to design allocation strategies for specific use cases. Core challenges you’ll face:

Pool allocator design: Fixed-size blocks with O(1) alloc/free (teaches memory management fundamentals)

Arena allocator: Bump allocation with bulk free (teaches allocation patterns)

Thread-local pools: Avoiding false sharing in multi-threaded allocators (teaches TLS, cache lines)

What you’ll build: A general-purpose memory allocator (malloc/free/realloc) that can replace the system allocator in real programs.

Why it teaches memory allocators: You cannot fake your way through this. You’ll face fragmentation, understand why jemalloc uses size classes, learn why metadata placement matters, and discover that “fast” and “memory-efficient” are often at odds.

Core challenges you’ll face:

Designing free list structures (maps to fragmentation strategies)
Handling alignment requirements for different types (maps to ABI/alignment)
Implementing coalescing without destroying performance (maps to performance tuning)
Making it thread-safe without killing scalability (maps to threading primitives)
Beating glibc malloc in at least one benchmark (maps to real-world validation)

Key Concepts:

Free list management: “C Interfaces and Implementations” by David Hanson - Chapter 5 (Arena) and Chapter 6 (Mem)
Size classes and binning: jemalloc design doc - Jason Evans’ “A Scalable Concurrent malloc Implementation”
Thread-local caching: “Hoard: A Scalable Memory Allocator” - Emery Berger paper
Fragmentation analysis: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 9

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Solid C, understanding of virtual memory basics

Real world outcome:

Your allocator will be LD_PRELOAD-able into real programs
You’ll have benchmarks showing throughput (allocations/sec) and memory efficiency vs glibc/jemalloc
You can literally run LD_PRELOAD=./myalloc.so ls and see it work

Learning milestones:

Basic allocator working - You understand why malloc needs metadata and how free lists work
Thread-safe version - You grasp why naive locking destroys performance and why per-thread caches exist
Competitive benchmarks - You’ve internalized the tradeoffs between speed, fragmentation, and memory overhead

Real World Outcome

When you complete this project, you won’t just have “an allocator”—you’ll have a production-ready shared library that can literally replace the system allocator in any program. Here’s exactly what you’ll see:

The Moment of Truth: LD_PRELOAD

# First, compile your allocator as a shared library
$ gcc -shared -fPIC -o libmyalloc.so myalloc.c -lpthread

# Test it with a simple program
$ LD_PRELOAD=./libmyalloc.so ls
Desktop  Documents  Downloads  Music  Pictures  Videos

# It works! Your allocator just handled all of ls's memory allocations

What just happened? The dynamic linker loaded your malloc, free, calloc, and realloc implementations before the system’s libc. Every allocation in ls went through YOUR code.

Seeing Your Allocator in Action

Add instrumentation to see what’s happening:

$ LD_PRELOAD=./libmyalloc.so ls -la

[MyAlloc] malloc(1024) -> 0x7f8a2c001000
[MyAlloc] malloc(64) -> 0x7f8a2c001400
[MyAlloc] calloc(256, 4) -> 0x7f8a2c001440
[MyAlloc] realloc(0x7f8a2c001000, 2048) -> 0x7f8a2c001800
[MyAlloc] free(0x7f8a2c001400)
[MyAlloc] Total allocated: 3,328 bytes
[MyAlloc] Peak usage: 3,328 bytes
[MyAlloc] Total frees: 1

total 48
drwxr-xr-x  2 user user 4096 Dec 27 10:30 .
drwxr-xr-x 18 user user 4096 Dec 27 10:15 ..
-rw-r--r--  1 user user  220 Dec 27 10:15 file.txt

You’re literally watching every memory operation in a real program!

Benchmark Output: The Numbers That Matter

Your allocator will come with benchmarks comparing against glibc malloc and jemalloc:

$ ./bench_allocator

=== Memory Allocator Benchmark ===
Test: Sequential allocation (1000 allocations of 64 bytes)

glibc malloc:     2.3 µs  (434,782 allocs/sec)
jemalloc:         1.1 µs  (909,090 allocs/sec)
Your allocator:   0.8 µs  (1,250,000 allocs/sec)  [WINNER]

Test: Random sizes (10,000 allocations, 16-4096 bytes)

glibc malloc:     145 µs  (68,965 allocs/sec)
jemalloc:         89 µs   (112,359 allocs/sec)  [WINNER]
Your allocator:   102 µs  (98,039 allocs/sec)

Test: Multithreaded (4 threads, 10,000 allocs each)

glibc malloc:     8,234 µs  (4,859 allocs/sec)
jemalloc:         1,456 µs  (27,472 allocs/sec)  [WINNER]
Your allocator:   1,892 µs  (21,141 allocs/sec)

=== Fragmentation Analysis ===

After 100,000 random alloc/free operations:

                  Bytes Allocated    Bytes Requested    Overhead
glibc malloc:     2,048,576          1,897,234          7.9%
jemalloc:         1,986,432          1,897,234          4.7%   [WINNER]
Your allocator:   2,012,160          1,897,234          6.1%

=== Memory Usage Stats ===
Your allocator used 16 arenas (threads)
Peak memory from OS: 8,388,608 bytes (8 MB)
Returned to OS: 6,291,456 bytes
Currently held: 2,097,152 bytes

What this tells you:

You beat glibc in single-threaded small allocations (your size-class bins are working!)
jemalloc still wins in multithreaded workloads (their arena design is battle-tested)
Your fragmentation is competitive (your coalescing strategy works)

Running Real Programs with Your Allocator

# Run Python with your allocator
$ LD_PRELOAD=./libmyalloc.so python3 -c "print('Hello from my allocator!')"
Hello from my allocator!
[MyAlloc] Stats: 1,247 allocations, 892 frees, 84 KB peak usage

# Run a web server
$ LD_PRELOAD=./libmyalloc.so python3 -m http.server 8000
[MyAlloc] Serving HTTP on 0.0.0.0:8000
[MyAlloc] Request handled: 47 allocations, 512 bytes avg

# Run valgrind to verify no leaks
$ valgrind --leak-check=full env LD_PRELOAD=./libmyalloc.so ls
==12345== HEAP SUMMARY:
==12345==     in use at exit: 0 bytes in 0 blocks
==12345==   total heap usage: 1,247 allocs, 1,247 frees, 84,192 bytes allocated
==12345==
==12345== All heap blocks were freed -- no leaks are possible

Debugging Output: Seeing Fragmentation

$ LD_PRELOAD=./libmyalloc.so ./fragmentation_test

=== Heap State After 1000 Random Operations ===

Size Class: 16 bytes
  Free list: 23 blocks
  Allocated: 145 blocks
  Fragmentation: 12.3%

Size Class: 32 bytes
  Free list: 47 blocks
  Allocated: 289 blocks
  Fragmentation: 8.7%

Size Class: 64 bytes
  Free list: 12 blocks
  Allocated: 512 blocks
  Fragmentation: 4.2%

Large allocations (>4096 bytes):
  Arena 0: 4 blocks, 24,576 bytes total
  Arena 1: 2 blocks, 16,384 bytes total

Total memory from OS: 1,048,576 bytes (1 MB)
Total in use: 897,234 bytes
Wasted (fragmentation): 47,892 bytes (5.3%)

Visual representation your code can print:

Arena 0 Memory Layout:
[Used:64][Free:64][Used:128][Free:32][Used:256][Free:128]...
                      ^
                   Coalescing opportunity!

The Interview Demo

When you show this in an interview or on GitHub:

# Clone your repo
$ git clone https://github.com/yourname/myalloc.git
$ cd myalloc

# Build
$ make

# Run the demo
$ ./demo

=== Custom Memory Allocator Demo ===

1. Testing basic allocation...
   malloc(100): ✓ returned 0x7f8a2c001000
   free(): ✓ freed successfully

2. Testing thread-local caching...
   Thread 1: allocated 1000 blocks (0.234 ms)
   Thread 2: allocated 1000 blocks (0.189 ms)
   Thread 3: allocated 1000 blocks (0.201 ms)
   Thread 4: allocated 1000 blocks (0.198 ms)
   ✓ No cache contention detected!

3. Testing coalescing...
   Before: 10 free blocks in size class 64
   After coalesce: 2 free blocks in size class 256
   ✓ Fragmentation reduced by 80%!

4. Stress test with real program...
   Running: LD_PRELOAD=./libmyalloc.so find /usr -name "*.so"
   ✓ Completed successfully
   Stats: 47,892 allocs, 47,892 frees, 0 leaks

ALL TESTS PASSED! Your allocator is production-ready.

This is what you’ll actually build—not a toy, but a real allocator you can use, benchmark, and understand at the deepest level.

The Core Question You’re Answering

“Why is malloc slow, and what does it actually take to make a fast, correct memory allocator?”

This is the question that separates developers who use memory from those who understand memory. Most programmers treat malloc as magic—a function that just “gives you memory.” But ask them:

Why does malloc sometimes take 10ns and sometimes take 10µs?
Why does allocating 16 bytes waste 8 more bytes on metadata?
Why does freeing memory not always return it to the OS?
Why can’t you just have a global lock and call it a day in multi-threaded programs?
Why do jemalloc and tcmalloc exist when glibc already has malloc?

If you can’t answer these, you don’t understand memory allocation. This project forces you to answer all of them by building the thing.

The deeper question: Memory allocators sit at the intersection of three hard problems:

Speed (making allocation O(1) without searching)
Memory efficiency (minimizing fragmentation and overhead)
Thread safety (scaling to many cores without lock contention)

You cannot optimize all three simultaneously. This project teaches you the tradeoffs by making you choose.

Why this matters:

HFT firms ask: “How would you design a malloc for sub-microsecond latency?”
Game engine teams ask: “Why does our frame rate tank after 10 minutes?” (memory fragmentation)
Systems programmers ask: “Why does our server use 10GB of RAM when only 2GB is in use?” (allocator not releasing memory to OS)

After this project, you’ll answer these from experience, not theory.

Concepts You Must Understand First

Before you write a single line of code, ensure you deeply understand these concepts. If any of these is fuzzy, stop and research it—your allocator will fail in confusing ways otherwise.

1. Virtual Memory and Address Spaces

What you need to know:

What is a virtual address vs a physical address?
How does sbrk() or mmap() get memory from the OS?
What happens when you request 100 bytes but the OS gives you 4096 bytes (a page)?
Why can two processes have the same virtual address but different physical addresses?

Questions to answer:

If you call malloc(1), how much memory does the OS actually give you?
What’s the difference between the heap (via sbrk) and memory-mapped regions (via mmap)?
Why does Linux give you more address space than you have RAM?

Book references:

“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9: Virtual Memory
“The Linux Programming Interface” by Michael Kerrisk — Chapter 7: Memory Allocation
“Operating Systems: Three Easy Pieces” by Arpaci-Dusseau — Chapters 13-15: Address Spaces, Memory API, Address Translation

2. Metadata: The Hidden Cost of malloc

What you need to know:

Every malloc call needs bookkeeping: How big is this block? Is it in use?
Where do you store this metadata? (Before the returned pointer? In a separate table?)
How do you find the metadata when free(ptr) is called with just a pointer?

Questions to answer:

If a user requests 16 bytes, how many bytes does your allocator actually need?
What happens if the user writes past the end of their allocation and corrupts your metadata?
Why is alignment important? (Hint: some CPUs crash on unaligned access)

Book references:

“C Interfaces and Implementations” by David Hanson — Chapter 5: Arena and Chapter 6: Mem
“The C Programming Language” by Kernighan & Ritchie — Chapter 8.7: Example—A Storage Allocator
“Understanding and Using C Pointers” by Richard Reese — Chapter 2: Dynamic Memory Management

3. Free Lists: The Core Data Structure

What you need to know:

A free list is a linked list of available memory blocks
When you malloc, you search the free list for a big-enough block
When you free, you add the block back to the free list

Questions to answer:

Should the free list be sorted by address or insertion order? (Affects coalescing and fragmentation)
How do you store the “next” pointer in a free block? (The block is free, so you can reuse its memory!)
What’s the difference between first-fit, best-fit, and worst-fit strategies?

Visual model to internalize:

Free List (sorted by address):

NULL <- [Block@0x1000, size=64] <- [Block@0x1400, size=128] <- [Block@0x2000, size=256] <- Head

When malloc(100) is called:
1. Search list for block >= 100 bytes
2. Find block at 0x1400 (size=128)
3. Split it: Use 100 bytes, return 28 bytes to free list
4. Return pointer 0x1400 to user

Book references:

“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9.9: Dynamic Memory Allocation
“Algorithms in C” by Robert Sedgewick — Chapter on Linked Lists
CS 341 Malloc Tutorial

4. Fragmentation: The Silent Killer

What you need to know:

Internal fragmentation: Wasted space inside an allocated block (e.g., user asks for 17 bytes, you give them 32 due to alignment)
External fragmentation: Free memory exists but is scattered in small chunks, so large allocations fail

Questions to answer:

You have 1000 bytes free split into 10 blocks of 100 bytes each. Can you allocate 500 bytes? (No! External fragmentation)
How do you measure fragmentation? (Peak memory used / actual bytes requested)
What is coalescing, and why is it essential?

Visual example:

Before fragmentation:
[Free: 1000 bytes]

After 10 allocations of 50 bytes each, then freeing every other one:
[Used:50][Free:50][Used:50][Free:50][Used:50][Free:50][Used:50][Free:50][Used:50][Free:50]

You have 250 bytes free, but can't allocate 100 bytes!

After coalescing adjacent free blocks:
[Used:50][Free:100][Used:50][Free:100][Used:50][Free:50]

Book references:

CS360 Fragmentation Lecture
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9.9.12: Fragmentation
Embedded Artistry: Implementing Malloc

5. Thread Safety and Lock-Free Data Structures

What you need to know:

A global lock on malloc kills performance in multi-threaded programs
Per-thread caches (thread-local storage) avoid contention
Atomic operations (__sync_fetch_and_add, C11 _Atomic) enable lock-free algorithms

Questions to answer:

What is false sharing, and why does it destroy performance?
How does jemalloc use per-thread arenas to avoid locking?
What’s the difference between a mutex and an atomic compare-and-swap (CAS)?

Book references:

“Rust Atomics and Locks” by Mara Bos — Chapters 1-3 (best explanation of memory ordering, applies to C too)
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 6: The Memory Hierarchy (cache lines)
“The Art of Multiprocessor Programming” by Herlihy & Shavit — Lock-free data structures
“Hoard: A Scalable Memory Allocator” (paper by Emery Berger)

6. Size Classes: Why jemalloc is Fast

What you need to know:

Instead of a single free list, have multiple lists for common sizes (16, 32, 64, 128 bytes, etc.)
Small allocations go to the appropriate size class (fast, no searching)
Large allocations use mmap directly (avoids fragmentation of small allocations)

Questions to answer:

Why are size classes powers of 2? (alignment and cache line efficiency)
What’s the tradeoff? (More internal fragmentation: 17-byte request gets 32-byte block)
How do you decide the threshold between “small” and “large”?

Book references:

jemalloc Background
“A Scalable Concurrent malloc Implementation” — Jason Evans (jemalloc author’s paper)
Exploring Different Memory Allocators

7. Alignment Requirements

What you need to know:

Most systems require pointers to be aligned (e.g., 8-byte boundary on x64)
malloc must return addresses divisible by the platform’s alignment (usually 8 or 16)
Misaligned access can crash on ARM, or just be slow on x86

Questions to answer:

If a user requests 5 bytes, can you return a pointer at address 0x1003? (No! Not aligned)
How do you round up sizes to the next alignment boundary?
Why does malloc(1) on Linux actually return a 16-byte-aligned block?

Book references:

“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 3: Machine-Level Representation (alignment)
“Effective C” by Robert Seacord — Chapter 2: Objects, Functions, and Types

Questions to Guide Your Design

Before implementing, think through these design decisions. Your answers will shape your allocator’s performance characteristics.

Design Question 1: How Will You Get Memory from the OS?

Options:

sbrk() / brk() — Traditional Unix way, extends the heap (deprecated on some systems)
mmap() / munmap() — Modern approach, memory-mapped anonymous pages
Hybrid: Use sbrk for small allocations, mmap for large

Think through:

sbrk is contiguous but can’t return memory easily (must free from the top)
mmap is flexible (can free individual chunks) but has higher overhead
What’s your threshold for “large”? (jemalloc uses 4KB)

Guiding questions:

What if the user allocates 1GB, frees it, then allocates 1GB again? Should you return memory to the OS?
How do you handle sbrk failure when the system is out of memory?

Design Question 2: How Will You Store Metadata?

Options:

Boundary tags (header + footer): Store size before and after the block (enables coalescing in both directions)
Header only: Store size and in-use flag before the returned pointer (simpler, less overhead)
Separate metadata table: Store all metadata in a different region (isolates corruption but slower lookup)

Think through:

Boundary tag layout:
[Header: size=128, used=1] [User data: 120 bytes] [Footer: size=128]
                            ^
                            Pointer returned to user

What if the user overflows and corrupts the footer?

Guiding questions:

How do you find the previous block when freeing? (You need footer or separate list)
How much overhead is acceptable? (8 bytes per block? 16 bytes?)

Design Question 3: What Free List Strategy Will You Use?

Options:

First-fit: Return first block large enough (fast, fragments over time)
Best-fit: Return smallest block that fits (minimizes waste, slower search)
Segregated fit (size classes): Separate free lists for common sizes (jemalloc’s approach)

Think through:

Free list: [64] -> [128] -> [256] -> [512]

malloc(100):
- First-fit: Use 128 (fast, wastes 28 bytes)
- Best-fit: Use 128 (same in this case, but usually searches whole list)
- Segregated: Look in "128-byte class" (fastest)

Guiding questions:

Does order matter? (Address-ordered lists reduce fragmentation but slow insertion)
Should you split blocks or waste space? (64-byte block for 32-byte request: split or waste?)

Design Question 4: How Will You Handle Thread Safety?

Options:

Global lock: Single mutex around all malloc/free (simple, kills performance)
Per-size-class locks: Reduces contention (better, still some contention)
Thread-local caches: Each thread has its own pool (jemalloc/tcmalloc approach)

Think through:

Thread 1: malloc(64) → Lock → Get from free list → Unlock
Thread 2: malloc(64) → WAIT for lock...

With per-thread cache:
Thread 1: malloc(64) → Get from local cache (no lock!)
Thread 2: malloc(64) → Get from local cache (no lock!)

Guiding questions:

What if a thread allocates 1000 blocks then exits? (Memory stuck in its cache)
How do you refill a thread’s cache when it’s empty? (Need global lock then)

Design Question 5: Will You Implement Coalescing?

Coalescing: Merging adjacent free blocks into larger blocks

Options:

Immediate coalescing: When freeing, check neighbors and merge (prevents fragmentation)
Deferred coalescing: Wait until allocation fails, then coalesce (faster free, slower malloc)
No coalescing: Accept fragmentation (only viable for short-lived programs)

Think through:

Free list: [Block A: free] [Block B: allocated] [Block C: free]

User frees Block B:
- With coalescing: [Single free block: A+B+C]
- Without: [A: free] [B: free] [C: free]  ← Can't satisfy large allocations!

Guiding questions:

How do you find adjacent blocks? (Need footer or address-ordered list)
Is coalescing worth the CPU cost on every free?

Thinking Exercise: Trace an Allocation by Hand

Before writing code, trace this scenario on paper. This builds the mental model you need.

Scenario: Simple Allocator State

Initial state (you've gotten 1024 bytes from OS):

Free list (address-ordered):
[Block@0x1000, size=1024, free]

Step-by-step trace:

1. void* p1 = malloc(100);

- Search free list for block >= 100 bytes
- Find block at 0x1000 (size=1024)
- Split it:
  - Use first 100 bytes (+ 8 for header) = 108 bytes
  - Return remaining 916 bytes to free list
- Update free list: [Block@0x106C, size=916, free]
- Return pointer 0x1008 (0x1000 + 8 bytes for header)

Memory layout:
[Header@0x1000: size=108, used] [User data@0x1008: 100 bytes] [Header@0x106C: size=916, free]

2. void* p2 = malloc(200);

- Search free list
- Find block at 0x106C (size=916)
- Split it: Use 208 bytes, return 708 bytes
- Free list: [Block@0x1130, size=708, free]
- Return pointer 0x1074

Memory layout:
[Used:108][User:100][Used:208][User:200][Free:708]
 ^0x1000           ^0x106C           ^0x1130

3. free(p1);

- Receive pointer 0x1008
- Back up 8 bytes to find header at 0x1000
- Mark block as free (size=108)
- Add to free list

Free list (address-ordered):
[Block@0x1000, size=108] -> [Block@0x1130, size=708]

Memory layout:
[Free:108][Used:208][User:200][Free:708]
 ^0x1000  ^0x106C           ^0x1130

4. free(p2);

- Free block at 0x106C (size=208)
- Check neighbors:
  - Previous block (0x1000) is free!
  - Next block (0x1130) is free!
- COALESCE: Merge all three blocks

Free list:
[Block@0x1000, size=1024]

Memory layout:
[Free:1024]
 ^0x1000

We're back to the initial state! Perfect coalescing.

Questions to answer about your trace:

At step 2, why does malloc(200) actually use 208 bytes? (Header overhead)
If you didn’t coalesce at step 4, could you allocate 500 bytes? (No! Largest free block is 708)
How would you detect that blocks 0x1000, 0x106C, and 0x1130 are adjacent? (Check if 0x1000 + size == 0x106C)
What if the user writes to p1[101] (out of bounds)? (Corrupts header at 0x106C, chaos on next malloc)

Now trace a multithreaded scenario:

Thread 1: malloc(64)  }
Thread 2: malloc(64)  } Both hit the same free list
Thread 3: malloc(64)  }

With a global lock:
T1: Lock → Allocate → Unlock (3µs)
T2: WAIT → Lock → Allocate → Unlock (3µs + wait time)
T3: WAIT → WAIT → Lock → Allocate → Unlock (3µs + wait time)

With per-thread caches:
T1: Get from local cache (0.5µs)
T2: Get from local cache (0.5µs)
T3: Get from local cache (0.5µs)

All happen in parallel!

The Interview Questions They’ll Ask

Prepare to answer these. If you can answer them fluently, you’ve internalized the concepts.

Conceptual Questions

“What is the difference between sbrk and mmap for getting memory from the OS?”
- Answer should cover: sbrk extends the heap contiguously, can’t free individual chunks. mmap gives arbitrary pages, can free individually, higher overhead per call.
“Why does malloc need to store metadata, and where is it stored?”
- Answer: Need to track size (for free/realloc) and in-use status. Usually stored as a header immediately before the returned pointer.
“Explain external fragmentation vs internal fragmentation.”
- Answer: External = free memory exists but is scattered. Internal = allocated blocks waste space due to alignment/rounding.
“How do you implement free(ptr) when the user only gives you a pointer?”
- Answer: Back up from ptr by the header size to find metadata containing the block size.
“What is coalescing and why is it necessary?”
- Answer: Merging adjacent free blocks. Necessary to prevent external fragmentation—without it, heap becomes unusable over time.

Design Questions

“Why do allocators like jemalloc use size classes?”
- Answer: Avoids searching free lists. Small allocations go to fixed-size pools (O(1) allocation). Reduces fragmentation by grouping similar sizes.
“What happens if two threads call malloc at the same time?”
- Answer: Without synchronization, corruption (race condition on free list). Solutions: global lock (slow), per-thread caches (fast), lock-free algorithms (complex).
“How would you design a malloc for a single-threaded embedded system with 64KB of RAM?”
- Answer: Simple first-fit with a single free list. No need for thread safety. Maybe skip coalescing to save code space. Use a fixed pool from __heap_start.
“Why don’t allocators immediately return memory to the OS on free()?”
- Answer: System calls (munmap, brk) are expensive. Better to cache memory for future allocations. Only return memory under pressure.

Performance Questions

“Why is malloc sometimes slow?”
- Answer: Searching free lists (O(n) worst case), lock contention in multithreaded programs, system calls to get more memory, coalescing overhead.
“How would you reduce fragmentation in a long-running server?”
- Answer: Use size classes, periodic coalescing, segregated heaps for different object lifetimes, memory pools for fixed-size objects.
“What is false sharing and how does it affect allocators?”
- Answer: Two threads accessing different variables in the same cache line cause cache invalidation. In allocators, per-thread metadata too close together kills performance. Solution: Pad metadata to cache-line boundaries (64 bytes).

Implementation Questions

“How do you ensure malloc returns an aligned pointer?”
- Answer: Round requested size up to alignment boundary. Ensure metadata size is also aligned. Start heap at aligned address.
“What’s the minimum allocation size, and why?”
- Answer: Must fit a “next” pointer in free blocks (usually 8 bytes on 64-bit). Also alignment requirement. So minimum is typically 16 bytes.
“How would you debug a heap corruption bug?”
- Answer: Use AddressSanitizer, Valgrind. Add canary values in metadata. Walk entire heap on each operation in debug mode. Check for overflows by verifying canaries.

Advanced Questions

“Explain how realloc works. Why is it more complex than malloc + free?”
- Answer: If new size fits in current block, just update metadata. Otherwise, allocate new block, memcpy, free old. Optimization: If next block is free, expand in place.
“How does jemalloc achieve thread scalability?”
- Answer: Per-thread arenas (caches). Each thread allocates from its own arena, avoiding locks. Global arena only accessed when thread cache is empty.
“What is a memory pool and when would you use it over malloc?”
- Answer: Pre-allocated region for fixed-size objects. No fragmentation, O(1) alloc/free, better cache locality. Use for objects with known size/lifetime (e.g., game entities, network packets).

Hints in Layers

Only read these if you’re stuck. Try to implement as much as possible before looking.

Hint 1: Start with the Simplest Possible Allocator

Don’t try to build jemalloc on day one. Start here:

// Global free list (single linked list)
typedef struct block {
    size_t size;           // Size of this block (including header)
    struct block* next;    // Next free block (only valid if free)
    int free;              // 1 if free, 0 if allocated
} block_t;

static block_t* free_list_head = NULL;

void* malloc(size_t size) {
    // 1. Round size up to alignment (8 bytes)
    size = (size + 7) & ~7;

    // 2. Search free list for first-fit
    block_t* current = free_list_head;
    while (current) {
        if (current->free && current->size >= size) {
            current->free = 0;
            return (void*)(current + 1); // Return pointer after header
        }
        current = current->next;
    }

    // 3. No suitable block found, get more memory from OS
    // (hint: use sbrk or mmap here)
}

Get THIS working first. Verify with simple tests.

Hint 2: Getting Memory from the OS

Modern approach using mmap:

#include <sys/mman.h>

void* get_memory_from_os(size_t size) {
    // Round up to page size (4096 bytes)
    size_t page_size = 4096;
    size = (size + page_size - 1) / page_size * page_size;

    void* ptr = mmap(NULL,                  // Let OS choose address
                     size,                  // Size in bytes
                     PROT_READ | PROT_WRITE,// Permissions
                     MAP_PRIVATE | MAP_ANONYMOUS, // Private, not backed by file
                     -1,                    // No file descriptor
                     0);                    // No offset

    if (ptr == MAP_FAILED) {
        return NULL; // Out of memory
    }

    return ptr;
}

Initialize it as a free block and add to your free list.

Hint 3: Implementing `free`

void free(void* ptr) {
    if (!ptr) return; // free(NULL) is a no-op

    // Back up to find the header
    block_t* block = (block_t*)ptr - 1;

    // Mark as free
    block->free = 1;

    // TODO: Coalesce with adjacent free blocks (Hint 5)
}

Hint 4: Splitting Blocks

When you find a free block that’s bigger than needed, split it:

void split_block(block_t* block, size_t size) {
    // Only split if remainder is large enough for a new block
    // (needs space for header + some data)
    if (block->size >= size + sizeof(block_t) + 8) {
        block_t* new_block = (block_t*)((char*)block + sizeof(block_t) + size);
        new_block->size = block->size - size - sizeof(block_t);
        new_block->free = 1;
        new_block->next = block->next;

        block->size = size + sizeof(block_t);
        block->next = new_block;
    }
}

Hint 5: Coalescing Adjacent Free Blocks

To prevent fragmentation, merge adjacent free blocks:

void coalesce(block_t* block) {
    // Merge with next block if it's free
    if (block->next && block->next->free) {
        // Check if blocks are actually adjacent in memory
        if ((char*)block + block->size == (char*)block->next) {
            block->size += block->next->size;
            block->next = block->next->next;
        }
    }

    // To merge with previous block, you need to search the list
    // (This is why address-ordered lists or footers are useful!)
}

Better approach: Keep the free list sorted by address, making coalescing easier.

Hint 6: Making it Thread-Safe (First Attempt)

Simplest approach—global mutex:

#include <pthread.h>

static pthread_mutex_t malloc_lock = PTHREAD_MUTEX_INITIALIZER;

void* malloc(size_t size) {
    pthread_mutex_lock(&malloc_lock);

    // ... your allocation logic ...

    pthread_mutex_unlock(&malloc_lock);
    return ptr;
}

This works but kills performance. Measure the overhead!

Hint 7: Size Classes (Advanced)

Instead of one free list, have multiple:

#define NUM_SIZE_CLASSES 8

// Size classes: 16, 32, 64, 128, 256, 512, 1024, 2048 bytes
static block_t* size_class_lists[NUM_SIZE_CLASSES];

int size_to_class(size_t size) {
    if (size <= 16) return 0;
    if (size <= 32) return 1;
    if (size <= 64) return 2;
    // ... etc
    return 7;
}

void* malloc(size_t size) {
    int class = size_to_class(size);

    // Try to allocate from this size class
    if (size_class_lists[class]) {
        block_t* block = size_class_lists[class];
        size_class_lists[class] = block->next;
        return (void*)(block + 1);
    }

    // No free blocks in this class, get from OS
    // ...
}

Hint 8: Benchmarking Your Allocator

#include <time.h>

void benchmark() {
    struct timespec start, end;

    clock_gettime(CLOCK_MONOTONIC, &start);

    // Allocate 10,000 blocks
    void* ptrs[10000];
    for (int i = 0; i < 10000; i++) {
        ptrs[i] = malloc(64);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);

    long ns = (end.tv_sec - start.tv_sec) * 1000000000L +
              (end.tv_nsec - start.tv_nsec);

    printf("10,000 allocations took %ld ns (%.2f ns each)\n",
           ns, ns / 10000.0);

    // Don't forget to free!
    for (int i = 0; i < 10000; i++) {
        free(ptrs[i]);
    }
}

Compare against system malloc by compiling without your LD_PRELOAD.

Hint 9: Debugging with Address Sanitizer

Compile your allocator and test programs with ASan:

$ gcc -fsanitize=address -g myalloc.c test.c -o test
$ ./test

# If you have a bug, you'll see:
# ==12345==ERROR: AddressSanitizer: heap-buffer-overflow
# READ of size 4 at 0x602000000014 thread T0
#     #0 0x7f8a2c001234 in test test.c:42

This catches use-after-free, buffer overflows, etc.

Hint 10: Testing with Real Programs

Create a simple wrapper script:

#!/bin/bash
# test_allocator.sh

export LD_PRELOAD=./libmyalloc.so

echo "Testing with ls..."
ls /usr/bin | head -n 5

echo "Testing with Python..."
python3 -c "print('Hello from custom allocator!')"

echo "Testing with grep..."
grep -r "malloc" . | head -n 5

echo "All tests passed!"

If any of these crash, you have a bug!

Books That Will Help

Here’s exactly where to look for deep understanding of each concept:

Topic	Book	Specific Chapter/Section
Virtual Memory & Address Spaces	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron	Chapter 9: Virtual Memory (especially 9.9 for dynamic allocation)
	Operating Systems: Three Easy Pieces by Arpaci-Dusseau	Chapters 13-15: Address Spaces, Memory API, Address Translation
	The Linux Programming Interface by Michael Kerrisk	Chapter 7: Memory Allocation
Classic malloc Implementation	The C Programming Language by Kernighan & Ritchie	Chapter 8.7: Example—A Storage Allocator
	C Interfaces and Implementations by David Hanson	Chapter 5: Arena, Chapter 6: Mem
Free Lists & Data Structures	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron	Chapter 9.9.12-9.9.14: Placement policies, splitting/coalescing
	Algorithms in C by Robert Sedgewick	Chapter 3: Elementary Data Structures (linked lists)
Fragmentation Deep Dive	CS360 Fragmentation Lecture	Complete lecture notes
	CS 341 Malloc Tutorial	Internal vs external fragmentation
jemalloc Design	jemalloc Background Wiki	Complete design documentation
	Scalable memory allocation using jemalloc	Facebook Engineering blog post
	“A Scalable Concurrent malloc Implementation” by Jason Evans	Entire paper (search online)
Thread Safety & Atomics	Rust Atomics and Locks by Mara Bos	Chapters 1-3 (best explanation, applies to C)
	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron	Chapter 6: The Memory Hierarchy (cache lines, false sharing)
	“Hoard: A Scalable Memory Allocator” by Emery Berger	Paper on thread-local caching (search online)
Alignment & ABI	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron	Chapter 3.9: Heterogeneous Data Structures
	Effective C by Robert Seacord	Chapter 2: Objects, Functions, and Types
Arena & Pool Allocators	C Interfaces and Implementations by David Hanson	Chapter 5: Arena (complete implementation walkthrough)
	Arena and Memory Pool Allocators	Modern overview of techniques
	High Performance Memory Management: Arena Allocators	Practical guide
Practical Implementation	Dan Luu’s Malloc Tutorial	Complete walkthrough of building a malloc
	Embedded Artistry: Implementing Malloc	First-fit free list implementation
Debugging Memory Issues	The Art of Debugging with GDB, DDD, and Eclipse by Matloff & Salzman	Chapters 1-3: Debugger basics
	Understanding and Using C Pointers by Richard Reese	Chapter 2: Dynamic Memory Management
LD_PRELOAD Technique	What Is the LD_PRELOAD Trick?	Complete explanation
	Replacing malloc (GNU C Library)	Official documentation
Comparing Allocators	Exploring Different Memory Allocators	jemalloc, tcmalloc, mimalloc comparison

Project 2: Work-Stealing Thread Pool

📖 View Detailed Guide →

File: LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md

Main Programming Language: C++

Alternative Programming Languages: Rust, Go, Java

Coolness Level: Level 4: Hardcore Tech Flex

Business Potential: 4. The “Open Core” Infrastructure

Difficulty: Level 3: Advanced

Knowledge Area: Thread Pool Design / Work Scheduling

Software or Tool: High-Performance Thread Pool Library

What you’ll build: A thread pool with work-stealing scheduling, similar to Rayon’s or Go’s runtime scheduler.

Why it teaches threading primitives: You’ll implement mutexes, condition variables, and atomic operations from scratch (or use them correctly). Work-stealing forces you to understand cache coherency, false sharing, and memory ordering—you can’t just slap a lock on everything.

Core challenges you’ll face:

Implementing lock-free deques for each worker (maps to atomics and memory ordering)
Avoiding false sharing between worker threads (maps to cache-line awareness)
Balancing work without excessive stealing overhead (maps to performance tuning)
Handling thread parking/unparking efficiently (maps to threading primitives)
Making the API ergonomic for parallel iterators (maps to API design)

Key Concepts:

Memory ordering and atomics: “Rust Atomics and Locks” by Mara Bos - Chapters 1-3 (even if writing in C, this is the clearest explanation)
Work-stealing algorithm: “Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson paper
Lock-free deque: Chase-Lev deque paper - “Dynamic Circular Work-Stealing Deque”
False sharing: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 6 (Cache)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Basic threading, atomics concepts

Real world outcome:

A library that parallelizes embarrassingly parallel workloads
Run ./threadpool_demo --threads 8 and see a Mandelbrot set render 7.5x faster than single-threaded
Benchmark output showing near-linear scaling with core count

Learning milestones:

Basic thread pool working - You understand condition variables and worker loops
Work-stealing implemented - You grasp why memory ordering matters and have debugged a race condition
Near-linear scaling achieved - You’ve eliminated false sharing and understand cache-aware programming

Project 3: Mini Async Runtime

📖 View Detailed Guide →

File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md

Programming Language: Rust or C

Coolness Level: Level 5: Pure Magic (Super Cool)

Business Potential: 5. The “Industry Disruptor”

Difficulty: Level 5: Master

Knowledge Area: Asynchronous I/O / Runtime Design

Software or Tool: Epoll / Kqueue / Futures

Main Book: “Asynchronous Programming in Rust” (or libuv documentation)

What you’ll build: A single-threaded async runtime that can serve HTTP requests, similar to a simplified Tokio or libuv.

Why it teaches async runtimes: Async is magic until you build the event loop yourself. You’ll implement futures, understand why poll vs push matters, and see exactly how epoll/kqueue enables thousands of concurrent connections without threads.

Core challenges you’ll face:

Implementing an event loop with epoll/kqueue (maps to platform differences)
Building a future/task abstraction (maps to API design)
Managing wakers and the reactor pattern (maps to async internals)
Non-blocking IO without burning CPU (maps to performance tuning)
Supporting timers and cancellation (maps to real-world completeness)

Key Concepts:

Event loops and IO multiplexing: “The Linux Programming Interface” by Michael Kerrisk - Chapter 63 (Alternative I/O Models)
Future/Promise patterns: Tokio tutorial’s “Async in Depth” section
Reactor pattern: “libuv Design Overview” documentation
Platform IO differences: “Advanced Programming in the UNIX Environment” by Stevens & Rago - Chapter 14

Resources for epoll/kqueue abstraction challenge:

“Epoll is fundamentally broken” by Marek Majkowski (Cloudflare) - Understanding edge vs level triggering pitfalls

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sockets basics, understanding of file descriptors

Real world outcome:

A working HTTP server handling 10,000 concurrent connections on a single thread
wrk -c 1000 http://localhost:8080/ showing impressive throughput
Visually see connection count vs memory usage staying flat (unlike thread-per-connection)

Learning milestones:

Echo server with epoll - You understand the event loop and non-blocking IO
Task/future abstraction working - You grasp how wakers notify the runtime
HTTP server benchmarked - You’ve seen why async enables C10K and understand the tradeoffs

Project 4: Cross-Platform Syscall Abstraction Library

📖 View Detailed Guide →

File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md

Programming Language: C

Coolness Level: Level 2: Practical but Forgettable

Business Potential: 4. The “Open Core” Infrastructure

Difficulty: Level 3: Advanced

Knowledge Area: Systems Programming / Portability

Software or Tool: POSIX / Win32 API

Main Book: “Advanced Programming in the UNIX Environment” by Stevens & Rago

What you’ll build: A library that wraps platform-specific syscalls (file operations, networking, process control) into a unified API, similar to libuv’s uv_fs_* or Rust’s std.

Why it teaches ABI and platform differences: You’ll discover that “POSIX” doesn’t mean “identical.” struct layouts differ, error codes differ, syscall numbers differ. You’ll fight with calling conventions and learn why #ifdef hell exists.

Core challenges you’ll face:

Abstracting file operations across Linux/macOS/Windows (maps to platform differences)
Handling struct layout differences (maps to ABI details)
Defining stable API versioning (maps to API design)
Avoiding undefined behavior in type punning (maps to UB avoidance)
Writing comprehensive test suites (maps to real-world robustness)

Key Concepts:

POSIX variations: “Advanced Programming in the UNIX Environment” by Stevens & Rago - Throughout (notes platform differences)
ABI and calling conventions: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 3 (Machine-Level Representation)
Undefined behavior in C: “Effective C” by Robert Seacord - Chapters on UB
API design principles: “C Interfaces and Implementations” by David Hanson - Introduction and design philosophy

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: C, basic familiarity with at least 2 OSes

Real world outcome:

A header library that compiles the same code on Linux, macOS, and (optionally) Windows
A demo program that lists directory contents, reads files, and spawns processes—same code, all platforms
CI pipeline showing green builds on all target platforms

Learning milestones:

File ops abstracted - You understand why stat differs between platforms
Process spawning unified - You grasp fork/exec vs CreateProcess differences
Library is usable - Someone else can #include your header and write portable code

Real World Outcome

When you complete this project, you’ll have built a production-ready abstraction library similar to what powers Node.js (libuv), Rust’s std, and countless cross-platform applications.

What you’ll actually see:

A working library with a clean API that compiles and runs identically across Linux, macOS, and Windows:

# Linux
$ gcc -o demo demo.c -I./include -L./lib -lxplat
$ ./demo
[xplat] Opening file: test.txt
[xplat] Platform: Linux (POSIX)
[xplat] File descriptor: 3
[xplat] Read 1024 bytes
[xplat] Process spawned: PID 12345 (using fork/exec)
[xplat] Child exited with status: 0

# macOS
$ clang -o demo demo.c -I./include -L./lib -lxplat
$ ./demo
[xplat] Opening file: test.txt
[xplat] Platform: Darwin (POSIX)
[xplat] File descriptor: 3
[xplat] Read 1024 bytes
[xplat] Process spawned: PID 54321 (using fork/exec)
[xplat] Child exited with status: 0

# Windows (using MSVC or MinGW)
C:\> cl demo.c /I.\include /link xplat.lib
C:\> demo.exe
[xplat] Opening file: test.txt
[xplat] Platform: Windows (Win32)
[xplat] File handle: 0x000000000000010C
[xplat] Read 1024 bytes
[xplat] Process spawned: PID 8472 (using CreateProcess)
[xplat] Child exited with status: 0

Your library’s header file (xplat.h) will look like:

// Cross-platform file operations
typedef struct xplat_file xplat_file_t;
xplat_file_t* xplat_open(const char* path, int flags);
ssize_t xplat_read(xplat_file_t* file, void* buf, size_t count);
int xplat_close(xplat_file_t* file);

// Cross-platform process spawning
typedef struct xplat_process xplat_process_t;
xplat_process_t* xplat_spawn(const char* cmd, char** args);
int xplat_wait(xplat_process_t* proc);

// Cross-platform threading
typedef struct xplat_thread xplat_thread_t;
xplat_thread_t* xplat_thread_create(void (*func)(void*), void* arg);
void xplat_thread_join(xplat_thread_t* thread);

Your CI/CD pipeline output showing it works everywhere:

# .github/workflows/ci.yml results
✓ Build on Ubuntu 22.04 (gcc 11.4)      - PASSED
✓ Build on Ubuntu 22.04 (clang 14)      - PASSED
✓ Build on macOS 13 (AppleClang 15.0)   - PASSED
✓ Build on Windows Server 2022 (MSVC)   - PASSED
✓ Build on Windows Server 2022 (MinGW)  - PASSED

✓ Test Suite (Linux)    - 47/47 tests passed
✓ Test Suite (macOS)    - 47/47 tests passed
✓ Test Suite (Windows)  - 47/47 tests passed

A real application using your library:

// app.c - Same code works on all platforms!
#include "xplat.h"
#include <stdio.h>

int main() {
    // File I/O - works on Linux, macOS, Windows
    xplat_file_t* f = xplat_open("data.txt", XPLAT_O_RDONLY);
    char buf[256];
    ssize_t n = xplat_read(f, buf, sizeof(buf));
    xplat_close(f);

    // Process spawning - works everywhere
    char* args[] = {"echo", "Hello from child", NULL};
    xplat_process_t* proc = xplat_spawn("echo", args);
    int status = xplat_wait(proc);

    printf("Child process exited: %d\n", status);
    return 0;
}

Behind the scenes, your implementation handles:

On Linux/macOS (POSIX):

// xplat_posix.c
xplat_file_t* xplat_open(const char* path, int flags) {
    int fd = open(path, convert_flags(flags));
    // Wrap fd in xplat_file_t structure
}

On Windows (Win32):

// xplat_win32.c
xplat_file_t* xplat_open(const char* path, int flags) {
    HANDLE h = CreateFileA(path, ...);
    // Wrap HANDLE in xplat_file_t structure
}

The concrete artifacts you’ll deliver:

Library source: src/xplat_posix.c, src/xplat_win32.c, include/xplat.h
Test suite: 40+ tests covering file I/O, process spawning, threading, networking
Documentation: API reference showing what works on which platforms
Build system: CMake or Makefile that builds on all platforms
CI/CD: GitHub Actions proving it works on every commit

You’ll have tangible proof that you understand the deep differences between operating systems—not just theoretical knowledge, but a working library that someone can actually use.

The Core Question You’re Answering

“Why can’t I just write code that works everywhere? What makes POSIX and Windows so fundamentally different?”

Most developers treat cross-platform compatibility as an annoying afterthought—a maze of #ifdef statements and mysterious build failures. But the differences between POSIX and Windows reveal deep truths about operating system design:

What is a file descriptor vs a HANDLE? They both represent “open files,” but their internals are completely different
Why does fork() not exist on Windows? It’s not a missing feature—it’s incompatible with Windows’ process model
What makes a good abstraction? How do you hide platform differences without sacrificing performance?

After this project, you’ll understand that “portable code” isn’t about avoiding platform-specific APIs—it’s about building the right abstraction layer that respects each platform’s native design while presenting a unified interface.

Concepts You Must Understand First

Stop and research these before coding:

POSIX File Descriptors vs Windows HANDLEs
- What is a file descriptor? Just an integer index into the process’s file descriptor table
- What is a Windows HANDLE? An opaque pointer to a kernel object
- Why does Linux let you select() on file descriptors but Windows doesn’t work that way?
- How do you convert between them when needed?
- Book Reference: “Advanced Programming in the UNIX Environment” Ch. 3 - Stevens & Rago
- Book Reference: “Windows System Programming” Ch. 2 - Johnson Hart
Process Creation Models
- POSIX: fork() duplicates entire process, then exec() replaces it
- Windows: CreateProcess() spawns new process in one call
- Why doesn’t Windows have fork()? (Hint: Address space layout and DLLs)
- How do you pass environment variables and file handles to children?
- Book Reference: “Advanced Programming in the UNIX Environment” Ch. 8 - Stevens & Rago
- Book Reference: “Windows Via C/C++” Ch. 4 - Jeffrey Richter
Error Handling Conventions
- POSIX: Return -1 on error, set errno (thread-local global)
- Windows: Return NULL/FALSE on error, call GetLastError()
- How do you design a unified error API?
- What’s the difference between EINVAL and ERROR_INVALID_PARAMETER?
- Book Reference: “The Linux Programming Interface” Ch. 3.4 - Kerrisk
Struct Layout and ABI
- Why does struct stat differ between Linux and macOS?
- What is struct padding and alignment?
- How do 32-bit vs 64-bit systems affect structure layout?
- What is _FILE_OFFSET_BITS and why does it matter?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.9 - Bryant & O’Hallaron
Calling Conventions
- What is cdecl vs stdcall vs fastcall?
- Why does Windows use different calling conventions for different APIs?
- How do you make function pointers portable?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.7 - Bryant & O’Hallaron
Thread-Local Storage (TLS)
- POSIX uses pthread_key_t, Windows uses TlsAlloc()
- How do you implement thread-safe error reporting?
- What is __thread vs __declspec(thread)?
- Book Reference: “Programming with POSIX Threads” Ch. 5.4 - Butenhof
Symbol Visibility and Linking
- What is the difference between static and dynamic linking?
- How do you export symbols from a shared library?
- What is __declspec(dllexport) vs __attribute__((visibility("default")))?
- Why do Windows DLLs need .lib files?
- Book Reference: “Advanced C and C++ Compiling” Ch. 7 - Stevanovic
Memory-Mapped Files
- POSIX: mmap() with MAP_SHARED or MAP_PRIVATE
- Windows: CreateFileMapping() + MapViewOfFile()
- What guarantees do you get about consistency?
- How do you handle page faults?
- Book Reference: “The Linux Programming Interface” Ch. 49 - Kerrisk

Questions to Guide Your Design

Before implementing, think through these:

API Design Philosophy
- Should your API look like POSIX (file descriptors as ints)?
- Or like Windows (opaque handles as pointers)?
- Or something completely new?
- How do you handle functionality that exists on one platform but not others?
Error Handling Strategy
- Do you use return codes like POSIX (-1 + errno)?
- Or return NULL and have a separate error getter?
- How do you map platform-specific errors to generic codes?
- Should you provide both generic and platform-specific error details?
Resource Management
- Who is responsible for freeing memory: caller or library?
- How do you prevent resource leaks when code is platform-specific?
- Should you use opaque types or expose internals?
Feature Detection
- How do you detect if functionality is available at compile time?
- What about runtime feature detection?
- Should unsupported operations fail at compile time or runtime?
Performance vs Portability
- When should you use platform-specific optimizations?
- How do you avoid the abstraction becoming a performance bottleneck?
- Should you allow “escape hatches” to access native handles?
Testing Strategy
- How do you test Windows code when developing on Linux (and vice versa)?
- Should you use virtual machines, containers, or CI/CD?
- How do you test the same logical operation across platforms?

Thinking Exercise

Design the File API on Paper

Before writing any code, design how you’d abstract file operations:

// On POSIX:
int fd = open("/path/to/file", O_RDONLY);
read(fd, buffer, size);
close(fd);

// On Windows:
HANDLE h = CreateFileA("C:\\path\\to\\file", GENERIC_READ, ...);
ReadFile(h, buffer, size, &bytes_read, NULL);
CloseHandle(h);

// Your abstraction: ???
xplat_file_t* f = xplat_open(???, ???);
xplat_read(f, ???, ???);
xplat_close(f);

Questions to answer:

What is xplat_file_t? Should it be:
- An opaque pointer? typedef struct xplat_file xplat_file_t;
- A tagged union containing either fd or HANDLE?
- Just an int with special encoding?
How do you handle flags? O_RDONLY is 0 on POSIX, but GENERIC_READ is 0x80000000 on Windows
- Do you define your own flag constants?
- Do you translate flags at runtime?
What about errors?
- If open() fails, it returns -1 and sets errno
- If CreateFileA() fails, it returns INVALID_HANDLE_VALUE
- What does xplat_open() return on error?
Sync vs async I/O?
- POSIX has blocking vs non-blocking (O_NONBLOCK)
- Windows has synchronous vs overlapped I/O
- How do you expose this in your API?

Draw a diagram showing how a call to xplat_read() flows through your library:

User calls: xplat_read(file, buf, 1024)
              ↓
     [Platform detection at compile-time]
         ↓                    ↓
    POSIX path           Windows path
         ↓                    ↓
  read(fd, buf, 1024)   ReadFile(h, buf, 1024, ...)
         ↓                    ↓
    [Return value translation]
         ↓                    ↓
      Return to user (same behavior)

The Interview Questions They’ll Ask

Prepare to answer these:

“What’s the fundamental difference between a POSIX file descriptor and a Windows HANDLE?”
- Answer: fd is an integer index, HANDLE is an opaque kernel object pointer
- Why it matters: Affects how you can use them (select/poll vs IOCP)
“Why doesn’t Windows have fork()?”
- Answer: Windows processes don’t support copy-on-write address space duplication
- The real reason: DLL base addresses and loader state make it impractical
“How would you implement thread-safe error reporting?”
- Answer: Thread-local storage for error codes
- Implementation differs: pthread_key_t vs TlsAlloc()
“What is structure padding and why does it matter for cross-platform code?”
- Answer: Compiler inserts padding bytes to satisfy alignment requirements
- Can differ between platforms/compilers, breaking binary compatibility
“How do you handle features that exist on one platform but not others?”
- Answer: Compile-time detection with #ifdef, runtime capability checks, or graceful degradation
- Example: epoll on Linux, kqueue on BSD, IOCP on Windows
“What’s the difference between static and dynamic linking, and why does it matter for portability?”
- Answer: Static includes library code in binary, dynamic loads at runtime
- Windows requires .lib import library for DLLs, POSIX uses just .so
“How would you test cross-platform code during development?”
- Answer: Combination of virtual machines, Docker containers, and CI/CD
- Best practice: GitHub Actions with matrix builds across OS
“What is undefined behavior in type punning, and how does it affect portability?”
- Answer: Accessing same memory through incompatible pointer types
- Example: *(int*)&float_var invokes UB, may break on different compilers

Hints in Layers

Hint 1: Start with Opaque Types

Don’t expose platform details in your header:

// xplat.h (public header)
typedef struct xplat_file xplat_file_t;  // Opaque!
xplat_file_t* xplat_open(const char* path, int flags);

// xplat_internal.h (private header)
struct xplat_file {
#ifdef _WIN32
    HANDLE handle;
#else
    int fd;
#endif
};

This lets you change internals without breaking API.

Hint 2: Use Feature Detection Macros

Create a consistent way to detect platforms:

// xplat_platform.h
#if defined(_WIN32) || defined(_WIN64)
    #define XPLAT_WINDOWS 1
    #define XPLAT_POSIX 0
#elif defined(__unix__) || defined(__APPLE__)
    #define XPLAT_POSIX 1
    #define XPLAT_WINDOWS 0
#else
    #error "Unsupported platform"
#endif

Hint 3: Create a Translation Layer for Flags

Map your generic flags to platform-specific ones:

// Define your own flags
#define XPLAT_O_RDONLY  0x01
#define XPLAT_O_WRONLY  0x02
#define XPLAT_O_RDWR    0x04

// Translate at runtime (POSIX)
static int translate_flags_posix(int xplat_flags) {
    int flags = 0;
    if (xplat_flags & XPLAT_O_RDONLY) flags |= O_RDONLY;
    if (xplat_flags & XPLAT_O_WRONLY) flags |= O_WRONLY;
    return flags;
}

// Translate at runtime (Windows)
static DWORD translate_flags_win32(int xplat_flags) {
    DWORD access = 0;
    if (xplat_flags & XPLAT_O_RDONLY) access |= GENERIC_READ;
    if (xplat_flags & XPLAT_O_WRONLY) access |= GENERIC_WRITE;
    return access;
}

Hint 4: Build Incrementally

Start with the simplest possible abstraction:

Phase 1: Just abstract open() / CreateFile() and return success/failure
Phase 2: Add read() / ReadFile() abstraction
Phase 3: Add proper error handling and reporting
Phase 4: Add close() / CloseHandle() abstraction
Phase 5: Extend to directories, process spawning, etc.

Don’t try to abstract everything at once!

Hint 5: Use CI/CD from Day One

Set up GitHub Actions early:

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  build:
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v3
      - name: Build
        run: make
      - name: Test
        run: make test

This catches portability bugs immediately.

Hint 6: Study libuv’s Source Code

Read how libuv handles platform abstraction:

git clone https://github.com/libuv/libuv.git
cd libuv
# Look at how they structure platform-specific code:
ls src/unix/     # POSIX implementations
ls src/win/      # Windows implementations
cat include/uv.h # Public API (platform-agnostic)

Pay attention to:

How they use opaque handles (uv_loop_t, uv_file_t)
How they map error codes (uv_errno_t)
How they detect features (UV_HAVE_KQUEUE)

Hint 7: Handle Struct Differences with Getters

Don’t expose platform-specific structs directly:

// Bad: Exposes struct layout
struct xplat_stat {
    size_t size;
    time_t mtime;
    // Fields differ across platforms!
};

// Good: Use accessor functions
typedef struct xplat_stat xplat_stat_t;  // Opaque
size_t xplat_stat_size(xplat_stat_t* st);
time_t xplat_stat_mtime(xplat_stat_t* st);

Internally, convert from struct stat (POSIX) or WIN32_FILE_ATTRIBUTE_DATA (Windows).

Hint 8: Test on Real Hardware, Not Just VMs

Platform quirks only show up on real systems:

Test on actual Windows (not just WSL)
Test on real macOS (not just Hackintosh)
Test on different Linux distributions (Ubuntu, Fedora, Alpine)

Use free CI/CD to test on real GitHub-hosted runners.

Books That Will Help

Topic	Book	Chapter
POSIX file I/O fundamentals	“Advanced Programming in the UNIX Environment” by Stevens & Rago	Ch. 3: File I/O
POSIX process control	“Advanced Programming in the UNIX Environment” by Stevens & Rago	Ch. 8: Process Control
Windows file and process APIs	“Windows System Programming” by Johnson Hart	Ch. 2-3: File I/O, Processes
Windows vs POSIX differences	“Windows Via C/C++” by Jeffrey Richter	Ch. 4: Processes; Ch. 10: Thread Synchronization
Platform detection and portability	“21st Century C” by Ben Klemens	Ch. 2: Debugging, Testing, Documenting
Error handling across platforms	“The Linux Programming Interface” by Michael Kerrisk	Ch. 3: System Programming Concepts
ABI and struct layout	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 3.9: Heterogeneous Data Structures
Calling conventions	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 3.7: Procedures
Symbol visibility and linking	“Advanced C and C++ Compiling” by Milan Stevanovic	Ch. 7: Designing Dynamic Libraries
Thread-local storage	“Programming with POSIX Threads” by David Butenhof	Ch. 5.4: Thread-Specific Data
Memory-mapped files (POSIX)	“The Linux Programming Interface” by Michael Kerrisk	Ch. 49: Memory Mappings
Cross-platform design patterns	“C Interfaces and Implementations” by David Hanson	Throughout (design philosophy)
libuv design and internals	“An Introduction to libuv” by Nikhil Marathe	Entire book (free online)
Practical portability techniques	“Practical C Programming” by Steve Oualline	Ch. 19: Portability Problems

Project 5: High-Performance String Search Library

📖 View Detailed Guide →

File: PHASE_2_TRACK_B_SYSTEMS_LIBRARIES_PROJECTS.md

Programming Language: C (with SIMD Intrinsics)

Coolness Level: Level 4: Hardcore Tech Flex

Business Potential: 4. The “Open Core” Infrastructure

Difficulty: Level 4: Expert

Knowledge Area: Algorithms / Low-Level Optimization

Software or Tool: SIMD / Vectorization

Main Book: “Modern X86 Assembly Language Programming” by Daniel Kusswurm

What you’ll build: A fast substring search library with SIMD acceleration, similar to what powers ripgrep’s core.

Why it teaches performance tuning: You’ll learn that the algorithm textbooks teach (KMP, Boyer-Moore) isn’t what fast tools use. You’ll discover SIMD, understand why memory access patterns matter more than Big-O for real data, and learn to profile before optimizing.

Core challenges you’ll face:

Implementing SIMD-accelerated search (maps to performance tuning)
Handling different string encodings (UTF-8 awareness) (maps to real-world correctness)
Designing an API that’s both fast and safe (maps to API design)
Avoiding undefined behavior with pointer arithmetic (maps to UB avoidance)
Beating strstr and memmem in benchmarks (maps to measurable outcome)

Key Concepts:

SIMD fundamentals: “Modern X86 Assembly Language Programming” by Daniel Kusswurm - SIMD chapters
Fast string search algorithms: Andrew Gallant’s (BurntSushi) blog posts on ripgrep’s internals
Cache-aware programming: “What Every Programmer Should Know About Memory” by Ulrich Drepper
UTF-8 handling: “UTF-8 Everywhere” manifesto and ripgrep’s encoding handling

Resources for SIMD string search challenge:

“Hyperscan” Intel paper - How regex engines use SIMD for literal matching

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C, basic understanding of CPU architecture

Real world outcome:

A command-line tool: ./fastsearch "pattern" largefile.txt
Benchmarks showing 3-10x speedup over naive search on large files
Flame graphs showing where time is actually spent

Learning milestones:

Basic SIMD search working - You understand vector instructions and intrinsics
UTF-8 handled correctly - You grasp why byte-level search needs encoding awareness
Consistently faster than stdlib - You’ve profiled, found bottlenecks, and optimized the right things

Real World Outcome

When you complete this project, you’ll have built a high-performance string search library that rivals the core of tools like ripgrep, ag (the silver searcher), and grep. This is the kind of code that powers developer tools used by millions.

What you’ll actually see:

A command-line tool that searches files blazingly fast, with benchmarks proving it:

# Build your library with SIMD support
$ gcc -O3 -march=native -mavx2 -o fastsearch fastsearch.c search_simd.c
$ ./fastsearch --version
fastsearch 1.0 - SIMD-accelerated string search
Detected CPU features: SSE4.2, AVX2, BMI2
Using: AVX2 vectorized search (256-bit vectors)

# Search a large file (100MB log file)
$ time ./fastsearch "ERROR" /var/log/huge.log
[Line 1247] 2025-12-27 ERROR: Connection timeout
[Line 5893] 2025-12-27 ERROR: Database unreachable
[Line 8312] 2025-12-27 ERROR: Memory allocation failed
... (147 matches found)

real    0m0.143s
user    0m0.135s
sys     0m0.008s

# Compare with standard grep
$ time grep "ERROR" /var/log/huge.log
... (same 147 matches)

real    0m1.521s
user    0m1.489s
sys     0m0.031s

# Your SIMD version is 10.6x faster!

Benchmark output showing performance gains:

$ ./bench_search
Running benchmarks on 100MB test corpus...

Algorithm               | Throughput      | Speedup
------------------------|-----------------|----------
Naive search            | 142 MB/s        | 1.0x
libc strstr()           | 385 MB/s        | 2.7x
Two-way algorithm       | 521 MB/s        | 3.7x
Boyer-Moore             | 447 MB/s        | 3.1x
SIMD SSE4.2 (your impl) | 1,234 MB/s      | 8.7x
SIMD AVX2 (your impl)   | 2,847 MB/s      | 20.0x
SIMD AVX-512 (your impl)| 4,921 MB/s      | 34.6x

Pattern length tests (searching for "the"):
- 3-byte pattern:  4,921 MB/s (AVX-512)
- 10-byte pattern: 3,215 MB/s (AVX-512)
- 50-byte pattern: 1,892 MB/s (falls back to Boyer-Moore)

UTF-8 correctness test: ✓ PASSED (all 1,000 multibyte patterns)

Your library’s API:

// search.h - Public API
#include <stddef.h>

// Initialize search engine (detects CPU features)
void search_init(void);

// Search for needle in haystack
// Returns: pointer to first occurrence, or NULL
const char* search_find(const char* haystack, size_t haystack_len,
                        const char* needle, size_t needle_len);

// Search with options
typedef struct {
    int case_insensitive;
    int utf8_mode;
    int use_simd;  // Auto-detect if not specified
} search_opts_t;

const char* search_find_opts(const char* haystack, size_t haystack_len,
                              const char* needle, size_t needle_len,
                              const search_opts_t* opts);

// Get detected CPU features
const char* search_get_features(void);

What the SIMD code looks like under the hood:

// search_avx2.c - AVX2 implementation
#include <immintrin.h>

const char* search_avx2(const char* haystack, size_t len,
                        const char* needle, size_t needle_len) {
    // Load first character of needle into all 32 bytes of a 256-bit register
    __m256i first_char = _mm256_set1_epi8(needle[0]);

    size_t i = 0;
    // Process 32 bytes at a time
    while (i + 32 <= len) {
        // Load 32 bytes from haystack
        __m256i block = _mm256_loadu_si256((__m256i*)(haystack + i));

        // Compare all 32 bytes in parallel
        __m256i cmp = _mm256_cmpeq_epi8(block, first_char);

        // Convert comparison result to bitmask
        int mask = _mm256_movemask_epi8(cmp);

        // Check each set bit (potential match)
        while (mask != 0) {
            int pos = __builtin_ctz(mask);  // Count trailing zeros
            // Verify full match at this position
            if (memcmp(haystack + i + pos, needle, needle_len) == 0) {
                return haystack + i + pos;
            }
            mask &= mask - 1;  // Clear lowest set bit
        }

        i += 32;
    }

    // Handle remaining bytes with scalar code
    // ...
}

Profiling output showing where time is spent:

$ perf record -g ./fastsearch "pattern" largefile.txt
$ perf report

# Samples: 10K of event 'cycles'
# Overhead  Command      Shared Object       Symbol
# ........  ...........  ..................  .........................
    47.23%  fastsearch   fastsearch          [.] search_avx2
    12.45%  fastsearch   [kernel.kallsyms]   [k] page_fault
     8.92%  fastsearch   libc-2.31.so        [.] __memcmp_avx2
     7.31%  fastsearch   fastsearch          [.] utf8_validate
     3.24%  fastsearch   fastsearch          [.] search_init
     ...

# Notice: Most time in your SIMD code, not I/O or system calls!

Flame graph visualization:

fastsearch (100%)
├─ search_avx2 (47.2%)      ← Your hot path
│  ├─ _mm256_loadu_si256 (21.3%)
│  ├─ _mm256_cmpeq_epi8 (15.8%)
│  └─ __builtin_ctz (10.1%)
├─ page_fault (12.4%)       ← Kernel memory management
├─ __memcmp_avx2 (8.9%)     ← Verifying matches
└─ utf8_validate (7.3%)     ← Character encoding

The concrete artifacts you’ll deliver:

Core library: search.c, search_avx2.c, search_sse42.c, search_scalar.c
CPU feature detection: Runtime dispatch to fastest available implementation
Benchmark suite: Comparing against glibc, ripgrep internals, naive algorithms
CLI tool: fastsearch command-line utility
Test suite: 500+ tests including edge cases, UTF-8, performance regressions
Documentation: Explaining when to use SIMD vs traditional algorithms

Real-world integration example:

Someone could use your library in their text editor:

// text_editor.c
#include "search.h"

void highlight_search_results(const char* buffer, size_t len,
                               const char* query) {
    const char* pos = buffer;
    size_t remaining = len;

    while (remaining > 0) {
        const char* match = search_find(pos, remaining,
                                        query, strlen(query));
        if (!match) break;

        // Highlight this match in the editor
        highlight_text(match - buffer, strlen(query));

        size_t offset = (match - pos) + strlen(query);
        pos += offset;
        remaining -= offset;
    }
}

You’ll have tangible proof of understanding:

How modern CPUs process data in parallel
Why memory access patterns matter more than algorithmic complexity for real data
How to write portable SIMD code that degrades gracefully
How to profile and optimize at the assembly level

The Core Question You’re Answering

“Why are tools like ripgrep so much faster than grep? What makes string search fast at the hardware level?”

Everyone learns Boyer-Moore and KMP in algorithms class, but those aren’t what make modern search tools fast. The real question is:

What is SIMD and why does it matter? How can you search 32 bytes in the same time it takes to search 1 byte?
Why does memory matter more than Big-O? A cache-friendly O(n) beats a cache-hostile O(log n)
How do you write code that works on different CPUs? SSE, AVX2, AVX-512, NEON—how do you support them all?

After this project, you’ll understand that performance isn’t about clever algorithms—it’s about understanding your hardware and feeding it data the way it wants to consume it.

Concepts You Must Understand First

Stop and research these before coding:

What is SIMD? (Single Instruction, Multiple Data)
- How does a CPU execute the same operation on multiple data elements in parallel?
- What is a vector register? (128-bit XMM, 256-bit YMM, 512-bit ZMM)
- Why can you compare 32 bytes in one instruction instead of looping 32 times?
- What’s the difference between SSE, AVX, and AVX-512?
- Book Reference: “Modern X86 Assembly Language Programming” Ch. 5-6 - Kusswurm
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 5.8 - Bryant & O’Hallaron
CPU Intrinsics vs Assembly
- What are intrinsics? (C functions that map directly to assembly instructions)
- Why use intrinsics instead of inline assembly?
- How do you use _mm256_loadu_si256() and _mm256_cmpeq_epi8()?
- What does “unaligned load” mean and why does it matter?
- Book Reference: “Modern X86 Assembly Language Programming” Ch. 7 - Kusswurm
String Search Algorithms
- Naive search: O(nm) but simple and cache-friendly for short patterns
- Two-way algorithm: What glibc uses in strstr() (linear time, constant space)
- Boyer-Moore: Skip ahead based on mismatches (good for long patterns)
- SIMD-accelerated search: Why it dominates for typical workloads
- Book Reference: “Algorithms” by Sedgewick & Wayne - Ch. 5.3: Substring Search
- Resource: Andrew Gallant’s blog posts on ripgrep’s SIMD usage
Cache-Aware Programming
- What is a cache line? (Typically 64 bytes)
- Why does sequential access beat random access by 100x?
- What is prefetching and how does the CPU do it automatically?
- How does SIMD help with cache efficiency?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 6 - Bryant & O’Hallaron
- Resource: “What Every Programmer Should Know About Memory” - Ulrich Drepper
UTF-8 Encoding
- Why can’t you treat UTF-8 as just bytes? (Multi-byte characters!)
- How do you detect if you’re in the middle of a multi-byte sequence?
- What happens if you report a match that splits a character?
- How does ripgrep handle this?
- Resource: “UTF-8 Everywhere” manifesto
- Book Reference: “The Linux Programming Interface” Ch. 10.2 - Kerrisk
CPU Feature Detection
- How do you detect if the CPU supports AVX2 at runtime?
- What is the CPUID instruction?
- How do you write code that falls back to SSE or scalar if AVX2 is unavailable?
- What are function pointers used for in feature dispatch?
- Book Reference: “Modern X86 Assembly Language Programming” Ch. 3 - Kusswurm
Memory Alignment
- What does “16-byte aligned” mean?
- Why do aligned loads (_mm256_load_si256) require alignment?
- When should you use unaligned loads (_mm256_loadu_si256)?
- How does alignment affect performance?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.9.3 - Bryant & O’Hallaron
Profiling and Benchmarking
- How do you use perf to see where CPU time is spent?
- What is cache miss rate and how do you measure it?
- How do you generate flame graphs?
- What’s the difference between wall-clock time and CPU time?
- Book Reference: “Systems Performance” Ch. 6 - Brendan Gregg
- Resource: perf documentation and Brendan Gregg’s website

Questions to Guide Your Design

Before implementing, think through these:

Algorithm Selection
- When should you use SIMD vs Boyer-Moore vs naive search?
- How does pattern length affect the best algorithm choice?
- Should you always use the fastest SIMD available?
Handling Different CPUs
- How do you support CPUs without AVX2? (Laptops, older servers)
- Should you compile multiple versions and choose at runtime?
- What’s your fallback strategy: SSE4.2 → SSE2 → scalar?
API Design
- Should your API look like strstr() (return pointer)?
- Or like memmem() (take explicit lengths)?
- How do you handle case-insensitive search with SIMD?
UTF-8 Correctness
- Should you validate UTF-8 before searching?
- What if the haystack has invalid UTF-8?
- How do you ensure you don’t split multi-byte characters?
Performance Measurement
- How do you benchmark fairly? (Avoid measuring I/O, focus on search)
- What size files should you test on? (Fit in L3 cache? Larger?)
- How do you avoid compiler optimizing away your benchmark?
Memory Access Patterns
- Should you process the entire file linearly?
- Would memory-mapped I/O help or hurt?
- How do you minimize cache misses?

Thinking Exercise

Trace a SIMD Search By Hand

Before writing SIMD code, trace this algorithm on paper:

Searching for “cat” in “the cat sat”:

Pattern: "cat" (needle_len = 3)
Haystack: "the cat sat" (len = 11)

1. Load first char of pattern ('c') into 256-bit register:
   ['c','c','c','c','c','c','c','c', ...]  (32 copies)

2. Load 32 bytes from haystack (pad with zeros if needed):
   ['t','h','e',' ','c','a','t',' ','s','a','t','\0',0,0,0,...]

3. Compare all 32 bytes in parallel:
   [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...]  (1 = match)

4. Convert to bitmask:
   0b00000000000000000000000000010000 = 0x10 (bit 4 is set)

5. Find position of set bit:
   pos = 4

6. Verify full match at haystack[4:7]:
   haystack[4:7] = "cat" ✓ MATCH!

Questions while tracing:

What if there are multiple ‘c’s in the 32-byte block?
How do you handle the case where the pattern spans two blocks?
What’s the worst-case scenario for this algorithm?
Why is this faster than checking each byte individually?

Now consider edge cases:

Pattern: "café" (UTF-8: 'c','a','f',0xC3,0xA9)
Haystack: "the café sat"

- The 'é' is 2 bytes in UTF-8
- What happens if your 32-byte block ends between those bytes?
- How do you ensure correctness?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is SIMD and how does it make string search faster?”
- Answer: Process multiple bytes in parallel using vector instructions
- Example: AVX2 can compare 32 bytes in one instruction
“Why doesn’t everyone just use SIMD for everything?”
- Answer: Not all algorithms vectorize well; SIMD has overhead for setup
- Useful when: processing large amounts of data with same operation
“How do you handle CPUs that don’t support AVX2?”
- Answer: Runtime CPU feature detection with fallback implementations
- Use function pointers or #ifdef to compile multiple versions
“What’s the difference between aligned and unaligned loads?”
- Answer: Aligned requires address divisible by vector size, faster
- Unaligned works anywhere but may cross cache lines (slower)
“How does UTF-8 complicate string search?”
- Answer: Can’t split multi-byte characters; byte-level match != character match
- Solution: Validate match boundaries or track character boundaries
“Why is cache locality important for search performance?”
- Answer: RAM is 100x slower than L1 cache; sequential access enables prefetching
- SIMD helps by processing more data per cache line fetched
“How would you profile a performance regression?”
- Answer: Use perf to compare before/after, look for cache misses and branch mispredicts
- Generate flame graphs to see where time is spent
“When would Boyer-Moore beat SIMD search?”
- Answer: Very long patterns where skip distance is large
- SIMD excels at short patterns (1-16 bytes) in modern workloads

Hints in Layers

Hint 1: Start Without SIMD

Get the API and testing framework right first:

// Naive implementation
const char* search_naive(const char* haystack, size_t h_len,
                         const char* needle, size_t n_len) {
    for (size_t i = 0; i <= h_len - n_len; i++) {
        if (memcmp(haystack + i, needle, n_len) == 0) {
            return haystack + i;
        }
    }
    return NULL;
}

Make sure this works correctly before optimizing!

Hint 2: Detect CPU Features

Use __builtin_cpu_supports() on GCC/Clang:

#include <stdint.h>

void search_init(void) {
    #if defined(__x86_64__) || defined(_M_X64)
        if (__builtin_cpu_supports("avx512f")) {
            search_impl = search_avx512;
        } else if (__builtin_cpu_supports("avx2")) {
            search_impl = search_avx2;
        } else if (__builtin_cpu_supports("sse4.2")) {
            search_impl = search_sse42;
        } else {
            search_impl = search_naive;
        }
    #else
        search_impl = search_naive;
    #endif
}

Hint 3: Start with SSE 4.2 (Simpler than AVX2)

SSE4.2 has _mm_cmpestri which is designed for string search:

#include <nmmintrin.h>  // SSE4.2

const char* search_sse42(const char* haystack, size_t h_len,
                          const char* needle, size_t n_len) {
    __m128i needle_vec = _mm_loadu_si128((__m128i*)needle);

    for (size_t i = 0; i <= h_len - 16; i += 16) {
        __m128i hay_vec = _mm_loadu_si128((__m128i*)(haystack + i));

        // Find first occurrence of needle[0] in hay_vec
        int idx = _mm_cmpestri(needle_vec, n_len, hay_vec, 16,
                                _SIDD_CMP_EQUAL_EACH | _SIDD_LEAST_SIGNIFICANT);

        if (idx < 16) {
            // Verify full match
            if (memcmp(haystack + i + idx, needle, n_len) == 0) {
                return haystack + i + idx;
            }
        }
    }

    return NULL;
}

Hint 4: Benchmark Against glibc

See how you compare to the system implementation:

#include <string.h>
#include <time.h>

void benchmark(const char* haystack, size_t h_len,
               const char* needle, size_t n_len) {
    clock_t start, end;

    // Test glibc
    start = clock();
    for (int i = 0; i < 10000; i++) {
        strstr(haystack, needle);
    }
    end = clock();
    printf("glibc strstr: %.3f ms\n",
           (double)(end - start) / CLOCKS_PER_SEC * 1000);

    // Test your implementation
    start = clock();
    for (int i = 0; i < 10000; i++) {
        search_find(haystack, h_len, needle, n_len);
    }
    end = clock();
    printf("Your implementation: %.3f ms\n",
           (double)(end - start) / CLOCKS_PER_SEC * 1000);
}

Hint 5: Use Compiler Explorer

See what assembly your SIMD code generates:

# Paste your code at https://godbolt.org/
# Use compiler: gcc -O3 -mavx2
# You'll see the actual vector instructions generated

This helps you verify the compiler is doing what you expect.

Hint 6: Handle the “Tail” Carefully

What happens when the haystack length isn’t divisible by 32?

// Process main body with SIMD
size_t simd_len = (h_len / 32) * 32;
const char* result = search_avx2_body(haystack, simd_len, needle, n_len);
if (result) return result;

// Handle remaining bytes with scalar code
return search_naive(haystack + simd_len, h_len - simd_len, needle, n_len);

Hint 7: Study ripgrep’s Source

See how the pros do it:

git clone https://github.com/BurntSushi/ripgrep.git
cd ripgrep
# Look at the aho-corasick crate for SIMD string matching:
# It uses memchr which has SIMD implementations

Pay attention to:

How they detect CPU features
How they fall back gracefully
How they handle UTF-8

Hint 8: Profile with perf

See where your CPU time is really going:

# Record performance data
perf record -g ./fastsearch "pattern" file.txt

# View report
perf report

# Look for cache misses
perf stat -e cache-misses,cache-references ./fastsearch "pattern" file.txt

This shows you what to optimize next.

Books That Will Help

Topic	Book	Chapter
SIMD fundamentals and intrinsics	“Modern X86 Assembly Language Programming” by Daniel Kusswurm	Ch. 5-7: AVX Programming
Cache-aware programming	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 6: The Memory Hierarchy
String search algorithms	“Algorithms” by Sedgewick & Wayne	Ch. 5.3: Substring Search
Understanding CPU performance	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 5: Optimizing Program Performance
Profiling and optimization	“Systems Performance” by Brendan Gregg	Ch. 6: CPUs
UTF-8 and character encoding	“The Linux Programming Interface” by Michael Kerrisk	Ch. 10: Internationalization
Advanced SIMD techniques	“Modern Parallel Programming with C++ and Assembly” by Daniel Kusswurm	Ch. 8-10: SIMD String Processing
Practical optimization	“Write Great Code, Volume 2” by Randall Hyde	Ch. 7: Thinking Low-Level
Memory alignment	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 3.9: Heterogeneous Data Structures
Benchmarking methodology	“Systems Performance” by Brendan Gregg	Ch. 2: Methodology

Essential Online Resources:

Topic	Resource
Intel intrinsics reference	Intel Intrinsics Guide (software.intel.com/intrinsics)
SIMD string search	Andrew Gallant’s blog on ripgrep SIMD (blog.burntsushi.net)
SIMD algorithms	“SIMD-friendly algorithms for substring searching” by Wojciech Muła
CPU feature detection	GCC documentation on `__builtin_cpu_supports`
UTF-8 specification	“UTF-8 Everywhere” manifesto (utf8everywhere.org)
Performance analysis	Brendan Gregg’s performance website (brendangregg.com)

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor	Most Teaches
Memory Allocator	Advanced	3-4 weeks	★★★★★	★★★	Memory, UB, Performance
Work-Stealing Thread Pool	Advanced	2-3 weeks	★★★★★	★★★★	Threading, Atomics, Cache
Mini Async Runtime	Advanced	2-3 weeks	★★★★	★★★★★	Async, Platform IO, API
Syscall Abstraction	Intermediate	2 weeks	★★★	★★★	ABI, Platform, API
String Search Library	Advanced	2-3 weeks	★★★★	★★★★	SIMD, Performance, Profiling

Recommended Learning Order

Based on this track’s emphasis on excellent job market and OSS-heavy hiring signal, here’s the recommended order:

Start with: Mini Async Runtime

High demand skill (Tokio, libuv, Node.js internals)
You’ll build something visibly impressive (C10K server)
Teaches platform differences naturally (epoll vs kqueue)
Foundation for understanding modern networking stacks

Then: Work-Stealing Thread Pool

Complements async knowledge (CPU-bound vs IO-bound)
Threading + atomics is interview gold for systems roles
Shows you understand parallelism at a deep level

Finally: Memory Allocator

The “boss fight” of systems programming
Forces you to synthesize everything: performance, correctness, threading
Having “wrote a malloc” on your resume/portfolio is a strong signal

Final Capstone Project: Embedded Key-Value Database

What you’ll build: A persistent, thread-safe, embedded key-value store like a simplified RocksDB or LMDB—combining everything from above.

Why it teaches everything: This project is the “final boss” because it requires:

Memory allocators: Custom allocators for the buffer pool
Threading: Concurrent readers/writers with proper synchronization
Async: Optional async API for non-blocking operations
ABI: Stable on-disk format across versions
Platform: Memory-mapped files work differently on each OS
Performance: You’ll profile compaction, cache hit rates, write amplification
UB avoidance: Memory-mapped IO is a UB minefield
API design: Making it usable for application developers

Core challenges you’ll face:

Implementing a log-structured merge tree or B-tree (maps to data structure design)
Memory-mapping files safely across platforms (maps to platform differences)
Write-ahead logging for durability (maps to crash consistency)
Concurrent access without global locks (maps to threading primitives)
Compaction without blocking readers (maps to async/background work)
Benchmarking against SQLite/RocksDB (maps to performance tuning)

Key Concepts:

LSM trees: “Designing Data-Intensive Applications” by Martin Kleppmann - Chapter 3
Memory-mapped IO: “The Linux Programming Interface” by Kerrisk - Chapter 49
Crash consistency: “Operating Systems: Three Easy Pieces” - Crash Consistency chapter
B-tree implementation: “Algorithms” by Sedgewick & Wayne - Chapter 6
Concurrent data structures: “The Art of Multiprocessor Programming” by Herlihy & Shavit

Difficulty: Expert Time estimate: 1-2 months Prerequisites: All previous projects, or equivalent experience

Real world outcome:

A library: db_open(), db_get(), db_put(), db_close()
A benchmark suite comparing ops/sec against SQLite and RocksDB
A crash test that kills the process mid-write and verifies data integrity on restart
Potentially: others using it in their projects (OSS signal)

Learning milestones:

Basic persistence working - You understand write-ahead logging and fsync semantics
Concurrent access safe - You’ve implemented MVCC or reader-writer locks correctly
Performance competitive - You’ve profiled, optimized, and understand why RocksDB makes certain tradeoffs
Crash-safe verified - You can kill -9 your DB at any point and recover correctly

Real World Outcome

When you complete this project, you’ll have built a production-quality embedded database similar to SQLite, LMDB, or RocksDB. This is the kind of infrastructure that powers applications from mobile apps to distributed systems.

What you’ll actually see:

A working database library with a simple API that real applications can embed:

# Build your database library
$ make
gcc -O3 -Wall -Wextra -o libkvdb.so kvdb.c lsm.c wal.c memtable.c -shared -fPIC
gcc -o kvdb_cli cli.c -L. -lkvdb

# Use the command-line interface
$ ./kvdb_cli mydata.db
kvdb> open mydata.db
Database opened: mydata.db
LSM tree levels: 0
Active memtable: 0 entries
Write-ahead log: 0 bytes

kvdb> put user:1 '{"name":"Alice","age":30}'
OK (wrote 34 bytes)

kvdb> put user:2 '{"name":"Bob","age":25}'
OK (wrote 32 bytes)

kvdb> get user:1
{"name":"Alice","age":30}

kvdb> scan user:
user:1 -> {"name":"Alice","age":30}
user:2 -> {"name":"Bob","age":25}
(2 keys scanned)

kvdb> stats
Database statistics:
  Total keys: 2
  Memtable entries: 2
  L0 SSTables: 0
  Total disk usage: 0 bytes
  WAL size: 78 bytes
  Cache hit rate: 100.0%

kvdb> compact
Starting compaction...
Flushed memtable to L0 SSTable (2 entries, 1.2 KB)
Compaction complete (12.3 ms)

kvdb> quit
Closing database...
Flushing memtable...
Syncing WAL...
Database closed successfully.

Crash recovery in action:

# Write some data
$ ./kvdb_cli test.db
kvdb> put key1 value1
OK
kvdb> put key2 value2
OK

# Simulate crash (kill -9 in another terminal)
$ kill -9 $(pgrep kvdb_cli)

# Restart and verify data is intact
$ ./kvdb_cli test.db
Recovering from WAL... (2 entries replayed)
Database opened: test.db

kvdb> get key1
value1

kvdb> get key2
value2

# Data survived the crash!

Your library’s API:

// kvdb.h - Public API
#include <stddef.h>
#include <stdint.h>

// Database handle
typedef struct kvdb kvdb_t;

// Open or create a database
kvdb_t* kvdb_open(const char* path);

// Write operations
int kvdb_put(kvdb_t* db, const void* key, size_t key_len,
             const void* value, size_t value_len);
int kvdb_delete(kvdb_t* db, const void* key, size_t key_len);

// Read operations
int kvdb_get(kvdb_t* db, const void* key, size_t key_len,
             void** value, size_t* value_len);

// Iteration
typedef struct kvdb_iterator kvdb_iter_t;
kvdb_iter_t* kvdb_scan(kvdb_t* db, const void* start_key, size_t key_len);
int kvdb_iter_next(kvdb_iter_t* iter);
void kvdb_iter_free(kvdb_iter_t* iter);

// Maintenance
int kvdb_compact(kvdb_t* db);
int kvdb_flush(kvdb_t* db);

// Close database
void kvdb_close(kvdb_t* db);

A real application using your database:

// user_service.c
#include "kvdb.h"
#include <stdio.h>
#include <string.h>

int main() {
    // Open database
    kvdb_t* db = kvdb_open("users.db");
    if (!db) {
        fprintf(stderr, "Failed to open database\n");
        return 1;
    }

    // Store user data
    const char* user_id = "user:alice";
    const char* user_data = "{\"email\":\"alice@example.com\",\"verified\":true}";

    if (kvdb_put(db, user_id, strlen(user_id),
                 user_data, strlen(user_data)) != 0) {
        fprintf(stderr, "Failed to write user\n");
        return 1;
    }

    // Retrieve user data
    void* retrieved_data;
    size_t retrieved_len;

    if (kvdb_get(db, user_id, strlen(user_id),
                 &retrieved_data, &retrieved_len) == 0) {
        printf("User: %.*s\n", (int)retrieved_len, (char*)retrieved_data);
        free(retrieved_data);
    }

    // Scan all users
    kvdb_iter_t* iter = kvdb_scan(db, "user:", 5);
    while (kvdb_iter_next(iter) == 0) {
        printf("Key: %s\n", kvdb_iter_key(iter));
    }
    kvdb_iter_free(iter);

    kvdb_close(db);
    return 0;
}

Benchmark output comparing to established databases:

$ ./bench_kvdb
Running benchmarks (1,000,000 operations each)...

=== Sequential Writes ===
kvdb (your implementation):    147,239 ops/sec
SQLite (default settings):      85,123 ops/sec
RocksDB:                       412,847 ops/sec
LMDB:                          198,234 ops/sec

=== Random Reads (100% cache hit) ===
kvdb (your implementation):    892,341 ops/sec
SQLite:                        324,128 ops/sec
RocksDB:                       687,492 ops/sec
LMDB:                        1,234,567 ops/sec

=== Random Reads (0% cache hit) ===
kvdb (your implementation):     45,123 ops/sec
SQLite:                         38,492 ops/sec
RocksDB:                        52,847 ops/sec
LMDB:                          89,234 ops/sec

=== Mixed Workload (50% read, 50% write) ===
kvdb (your implementation):     78,492 ops/sec
SQLite:                         42,387 ops/sec
RocksDB:                       123,847 ops/sec
LMDB:                          95,234 ops/sec

Your database is competitive!

What the LSM tree structure looks like:

$ ./kvdb_inspect mydata.db
Database structure:

Write-Ahead Log (WAL):
  Size: 2.4 MB
  Entries: 12,847
  Oldest sequence: 1
  Newest sequence: 12847

Memtable (in-memory):
  Entries: 1,247
  Size: 156 KB
  Threshold for flush: 4 MB

LSM Tree on Disk:
  Level 0: 3 SSTables (8.2 MB, 41,234 keys)
    - sstable_00001.sst: 2.7 MB (13,847 keys)
    - sstable_00002.sst: 2.8 MB (14,123 keys)
    - sstable_00003.sst: 2.7 MB (13,264 keys)

  Level 1: 1 SSTable (23.4 MB, 120,487 keys)
    - sstable_00004.sst: 23.4 MB (120,487 keys)

  Level 2: 0 SSTables

Total keys: 162,968
Total disk usage: 31.6 MB
Compression ratio: 2.3x

Visualizing how the LSM tree works:

Time 0: Empty database
┌─────────────┐
│  Memtable   │ (empty)
└─────────────┘

Time 1: Write 100 keys
┌─────────────┐
│  Memtable   │ ← 100 keys in memory
└─────────────┘

Time 2: Memtable full, flush to L0
┌─────────────┐
│  Memtable   │ (empty)
└─────────────┘
       ↓
┌─────────────┐
│  L0 SST #1  │ ← 100 keys on disk
└─────────────┘

Time 3: Multiple flushes
┌─────────────┐
│  Memtable   │ ← New writes
└─────────────┘
       ↓
┌─────────────┬─────────────┬─────────────┐
│  L0 SST #1  │  L0 SST #2  │  L0 SST #3  │ ← Overlapping ranges
└─────────────┴─────────────┴─────────────┘

Time 4: Compaction (merge L0 → L1)
┌─────────────┐
│  Memtable   │
└─────────────┘
       ↓
┌─────────────┐
│  L0 (empty) │
└─────────────┘
       ↓
┌────────────────────────────┐
│  L1 SST (merged, sorted)   │ ← Single sorted file
└────────────────────────────┘

LSM Tree Compaction Timeline

Crash recovery visualization:

Normal operation:
  Write → Memtable ──────────→ Eventually flushed to disk
            │
            └─→ WAL (immediate) ← Persisted with fsync()

After crash:
  1. Read WAL from disk
  2. Replay all entries into new memtable
  3. Continue normal operation

Example:
  WAL contains:
    [PUT key1 val1]
    [PUT key2 val2]
    [DELETE key1]

  After replay:
    Memtable = { key2: val2 }

Write-Ahead Logging Crash Recovery Flow

The concrete artifacts you’ll deliver:

Core library: kvdb.c, lsm.c, memtable.c, sstable.c, wal.c, compaction.c
On-disk format: Binary SSTable format with bloom filters and index blocks
CLI tool: Interactive database shell for testing
Benchmark suite: Comparing write throughput, read latency, crash recovery time
Test suite: 200+ tests including crash simulation, concurrent access, corruption detection
Documentation: Architecture overview, API reference, performance tuning guide

Demonstrating crash safety:

$ ./crash_test.sh
Starting crash safety test...

Test 1: Kill during write
  - Writing 1000 keys...
  - Killing process at random time... (killed)
  - Restarting and verifying... ✓ All committed keys recovered

Test 2: Kill during compaction
  - Writing 10000 keys...
  - Starting compaction...
  - Killing process mid-compaction... (killed)
  - Restarting and verifying... ✓ Database consistent

Test 3: Power failure simulation (no fsync)
  - Writing 1000 keys with sync disabled...
  - Simulating power loss... (killed)
  - Restarting... ⚠ Lost 847 keys (expected without fsync)

Test 4: Concurrent writers
  - Starting 10 writer threads...
  - Writing 100 keys per thread...
  - Killing process randomly... (killed)
  - Restarting and verifying... ✓ No corruption detected

Crash safety: PASSED (4/4 tests)

You’ll have tangible proof of understanding:

How databases persist data reliably
Why write-ahead logging is fundamental to durability
How LSM trees trade write amplification for write throughput
How to design concurrent data structures
How to profile and optimize I/O-bound workloads

The Core Question You’re Answering

“How do databases like SQLite and RocksDB guarantee your data won’t be lost, even if the power goes out mid-write?”

Everyone uses databases, but few understand the deep engineering that makes them reliable. The real questions are:

What is durability? How do you guarantee that once you say “OK, data saved,” it’s actually on disk?
Why are LSM trees fast for writes? What’s the tradeoff between B-trees and LSM trees?
How do you make it thread-safe? Multiple readers, one writer—how do you avoid locks killing performance?
What happens when the computer crashes? How do you recover without losing data or corrupting the database?

After this project, you’ll understand that databases aren’t magic—they’re careful engineering of data structures, file formats, concurrency primitives, and crash recovery protocols.

Concepts You Must Understand First

Stop and research these before coding:

Log-Structured Merge (LSM) Trees
- Why append-only data structures are fast for writes
- What is a memtable? (In-memory sorted structure, typically a skip list or red-black tree)
- What is an SSTable? (Sorted String Table—immutable, on-disk sorted key-value file)
- What is compaction and why is it necessary?
- Book Reference: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
- Resource: “The Log-Structured Merge-Tree (LSM-Tree)” - Patrick O’Neil et al. (original paper)
Write-Ahead Logging (WAL)
- Why do you write to a log before modifying the main data?
- What is fsync() and why does it matter for durability?
- How do you replay the log after a crash?
- What’s the difference between crash consistency and data consistency?
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 42: Crash Consistency
- Book Reference: “Database Internals” Ch. 6 - Alex Petrov
Memory-Mapped I/O
- What is mmap() and why is it useful for databases?
- How does the OS handle page faults for memory-mapped files?
- What are the dangers of mmap() for durability? (delayed writes!)
- When should you use mmap() vs read()/write()?
- Book Reference: “The Linux Programming Interface” Ch. 49 - Kerrisk
- Book Reference: “Database Internals” Ch. 4 - Petrov
Concurrent Data Structures
- How do you implement a lock-free read path?
- What is MVCC (Multi-Version Concurrency Control)?
- What are reader-writer locks and when should you use them?
- How do you avoid deadlocks in concurrent code?
- Book Reference: “The Art of Multiprocessor Programming” Ch. 9 - Herlihy & Shavit
- Book Reference: “Database Internals” Ch. 5 - Petrov
B-Trees vs LSM Trees
- Why does SQLite use B-trees? (Read-optimized)
- Why does RocksDB use LSM trees? (Write-optimized)
- What is write amplification?
- What is read amplification?
- Book Reference: “Designing Data-Intensive Applications” Ch. 3 - Kleppmann
- Resource: RocksDB documentation on LSM tree tuning
Bloom Filters
- How can you quickly test if a key might be in an SSTable?
- What is a false positive rate and how do you tune it?
- How much memory does a bloom filter use?
- Why are bloom filters critical for LSM tree read performance?
- Book Reference: “Algorithms” by Sedgewick & Wayne - Appendix on probabilistic data structures
- Resource: “Bloom Filters by Example” - various online tutorials
File System Semantics
- What guarantees does fsync() provide?
- What’s the difference between fsync() and fdatasync()?
- What is O_DIRECT and when would you use it?
- How do you ensure directory metadata is persisted?
- Book Reference: “The Linux Programming Interface” Ch. 13 - Kerrisk
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 40: File System Implementation
Binary File Formats
- How do you serialize data structures to disk?
- What is endianness and why does it matter?
- How do you version your file format for future compatibility?
- What’s the tradeoff between compression and read speed?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 2 - Bryant & O’Hallaron
- Resource: RocksDB SSTable format documentation

Questions to Guide Your Design

Before implementing, think through these:

Data Structure Choice
- Should your memtable be a skip list, red-black tree, or AVL tree?
- How do you make it thread-safe without locks?
- What’s your threshold for flushing memtable to disk?
On-Disk Format
- How do you lay out an SSTable on disk?
- Should you store keys and values separately?
- Do you need an index block for fast lookups?
- Should you compress data? Which algorithm?
Write-Ahead Log
- Should each write be a separate WAL entry or batch them?
- How often do you call fsync()? (Every write? Every N writes? Every N milliseconds?)
- What’s the format of a WAL entry?
- When is it safe to truncate the WAL?
Compaction Strategy
- How many levels in your LSM tree?
- When do you trigger compaction?
- Do you compact in a background thread?
- How do you ensure readers don’t see half-compacted state?
Crash Recovery
- What’s your recovery procedure on startup?
- How do you detect corrupted WAL entries?
- Should you use checksums on every write?
- How do you test that recovery actually works?
Concurrency
- How many concurrent readers do you support?
- Can writers run concurrently? (Probably not in v1)
- How do you make iterators safe while background compaction runs?
Performance vs Durability
- Should you offer async writes (fast but less durable)?
- Do you batch writes to amortize fsync() cost?
- What’s your target: throughput or latency?

Thinking Exercise

Design the Write Path on Paper

Trace what happens when a user calls kvdb_put(db, "user:123", "Alice"):

1. Acquire write lock
2. Append to WAL:
   [Sequence #1234] PUT user:123 Alice [Checksum: 0xABCD]
3. Call fsync() on WAL file
4. Insert into memtable (in-memory skip list)
5. Release write lock
6. Return success to user

If memtable is now full (> 4MB):
  7. Create new empty memtable
  8. Start background thread to flush old memtable:
     a. Sort all entries (already sorted in skip list)
     b. Write SSTable to disk:
        - Data block: [user:123 -> Alice]
        - Index block: [user -> offset 0]
        - Bloom filter: set bits for "user:123"
        - Footer: [index offset, bloom offset, checksum]
     c. Call fsync() on SSTable
     d. Add SSTable to L0
     e. Truncate WAL (safe now that data is in SSTable)

Questions to consider:

What if the process crashes at step 3? (WAL has entry, memtable doesn’t → replay from WAL)
What if the process crashes at step 6? (Both WAL and memtable have entry → idempotent replay)
What if the process crashes during step 8b? (Partial SSTable → ignored on startup, WAL still has data)
How do concurrent readers see the write? (Memtable is visible immediately after step 4)

Now trace a read: kvdb_get(db, "user:123")

1. Look in memtable (skip list search) → FOUND! Return "Alice"
   (If not found, continue to step 2)

2. Check L0 SSTables (newest first):
   a. Check bloom filter for "user:123"
   b. If bloom says "maybe", binary search index block
   c. If found in index, read data block
   d. If found in data block, return value

3. Check L1 SSTables (only one, sorted range)
   a. Binary search index to find relevant SSTable
   b. Check bloom filter
   c. Read data block if bloom filter says maybe

4. Return NOT_FOUND

Why this order?

Memtable first: Most recent writes
L0 next: Recent flushes (may have updates)
L1 last: Older, compacted data

The Interview Questions They’ll Ask

Prepare to answer these:

“How does write-ahead logging guarantee durability?”
- Answer: WAL is synced to disk (fsync) before acknowledging the write
- On crash, replay WAL to reconstruct in-memory state
“What’s the difference between an LSM tree and a B-tree?”
- Answer: LSM tree is write-optimized (append-only), B-tree is read-optimized (in-place updates)
- Tradeoff: LSM has write amplification from compaction, B-tree has write amplification from rebalancing
“Why do LSM trees need compaction?”
- Answer: Without compaction, you’d have too many overlapping SSTables
- Reads would require checking many files, killing performance
“How would you make this database crash-safe?”
- Answer: WAL + fsync() before ACK, checksums on all entries, replay on recovery
- Must ensure directory metadata is persisted too (fsync() on directory)
“What’s write amplification and why does it matter?”
- Answer: Writing 1 byte might cause many more bytes to be written (compaction, rewriting SSTables)
- Matters for SSD lifespan and write throughput
“How do you handle concurrent reads and writes?”
- Answer: MVCC (copy-on-write for memtable), immutable SSTables (safe for concurrent reads)
- Or reader-writer locks (simpler but less scalable)
“What’s a bloom filter and why is it important for LSM trees?”
- Answer: Probabilistic data structure to test set membership
- Avoids expensive disk reads when key definitely not in SSTable
“How would you debug a crash that corrupts the database?”
- Answer: Check WAL checksums, verify SSTable integrity, replay WAL in test environment
- Use tools like valgrind to catch memory bugs, fsync tracing to ensure durability

Hints in Layers

Hint 1: Start with an In-Memory-Only Version

Get the API and basic data structures working without persistence:

// Simple in-memory hash table first
typedef struct {
    char* key;
    char* value;
    size_t key_len;
    size_t value_len;
} kv_entry_t;

kvdb_t* kvdb_open(const char* path) {
    kvdb_t* db = malloc(sizeof(kvdb_t));
    db->entries = malloc(sizeof(kv_entry_t) * 1024);
    db->count = 0;
    return db;
}

int kvdb_put(kvdb_t* db, const void* key, size_t key_len,
             const void* value, size_t value_len) {
    // Just store in array for now
    kv_entry_t* entry = &db->entries[db->count++];
    entry->key = malloc(key_len);
    memcpy(entry->key, key, key_len);
    entry->value = malloc(value_len);
    memcpy(entry->value, value, value_len);
    return 0;
}

Once this works, add persistence layer by layer.

Hint 2: Implement WAL First

The simplest persistent structure is an append-only log:

// WAL entry format:
// [4 bytes: key_len][4 bytes: value_len][key][value][4 bytes: CRC32]

int wal_append(int fd, const void* key, size_t key_len,
               const void* value, size_t value_len) {
    uint32_t k_len = key_len;
    uint32_t v_len = value_len;

    write(fd, &k_len, sizeof(k_len));
    write(fd, &v_len, sizeof(v_len));
    write(fd, key, key_len);
    write(fd, value, value_len);

    // Calculate and write checksum
    uint32_t crc = crc32(key, key_len, value, value_len);
    write(fd, &crc, sizeof(crc));

    // CRITICAL: fsync before returning!
    fsync(fd);

    return 0;
}

Test that you can replay the WAL after a crash.

Hint 3: Use a Skip List for Memtable

Skip lists are simpler than red-black trees and perform similarly:

typedef struct skip_node {
    char* key;
    char* value;
    struct skip_node* next[MAX_LEVEL];
} skip_node_t;

// Insert is probabilistic (coin flip determines level)
void skiplist_insert(skiplist_t* list, const char* key, const char* value) {
    int level = random_level();  // 1-16
    skip_node_t* node = malloc(sizeof(skip_node_t));
    node->key = strdup(key);
    node->value = strdup(value);

    // Link into skip list at chosen level
    // ... (standard skip list insertion)
}

Reference: “Algorithms” by Sedgewick & Wayne has skip list pseudocode.

Hint 4: SSTable Format Should Be Simple

Don’t over-engineer the on-disk format initially:

SSTable file structure:
┌──────────────────────────────────────┐
│ Header:                               │
│   magic: "KVDB" (4 bytes)            │
│   version: 1 (4 bytes)               │
│   num_entries: N (8 bytes)           │
├──────────────────────────────────────┤
│ Data Block:                           │
│   [key_len][value_len][key][value]   │
│   [key_len][value_len][key][value]   │
│   ... (repeated N times)              │
├──────────────────────────────────────┤
│ Index Block: (optional v2 feature)   │
│   [first_key -> offset]              │
│   [first_key -> offset]              │
├──────────────────────────────────────┤
│ Footer:                               │
│   index_offset: (8 bytes)            │
│   checksum: CRC32 (4 bytes)          │
└──────────────────────────────────────┘

SSTable File Structure

Start without index/bloom filter, add them when reads are slow.

Hint 5: Test Crash Recovery Early

Write a script that kills your process randomly:

#!/bin/bash
# crash_test.sh

for i in {1..100}; do
    # Start database in background
    ./kvdb_cli test.db < test_commands.txt &
    PID=$!

    # Sleep random time (0-2 seconds)
    sleep $((RANDOM % 2)).$((RANDOM % 1000))

    # Kill it
    kill -9 $PID

    # Try to recover
    ./kvdb_cli test.db < verify_commands.txt
    if [ $? -ne 0 ]; then
        echo "Recovery failed on iteration $i"
        exit 1
    fi
done

echo "All 100 crash tests passed!"

This surfaces bugs in your WAL replay logic.

Hint 6: Use mmap() Carefully

Memory-mapped I/O is convenient but has durability pitfalls:

// Mapping an SSTable for reads is great:
void* data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Now you can read data as a byte array

// But for writes, be careful:
void* data = mmap(NULL, file_size, PROT_WRITE, MAP_SHARED, fd, 0);
// Writes might not hit disk until you call:
msync(data, file_size, MS_SYNC);  // Like fsync for mmap

Start without mmap, add it as an optimization later.

Hint 7: Study Existing Implementations

Read the source code of simple key-value stores:

# LevelDB (C++, but readable)
git clone https://github.com/google/leveldb.git
# Look at:
#   db/write_batch.cc     (WAL writing)
#   table/table_builder.cc (SSTable writing)
#   db/db_impl.cc          (main database logic)

# LMDB (C, very clean)
git clone https://github.com/LMDB/lmdb.git
# Look at:
#   libraries/liblmdb/mdb.c (entire database in one file!)

Don’t copy code, but understand their design decisions.

Hint 8: Benchmark Against SQLite

SQLite is your baseline—can you match it?

#include <sqlite3.h>
#include <time.h>

// Benchmark SQLite
clock_t start = clock();
sqlite3* db;
sqlite3_open("test.db", &db);
for (int i = 0; i < 100000; i++) {
    char sql[256];
    sprintf(sql, "INSERT INTO kv VALUES ('%d', 'value%d')", i, i);
    sqlite3_exec(db, sql, NULL, NULL, NULL);
}
clock_t end = clock();
printf("SQLite: %.2f seconds\n", (double)(end - start) / CLOCKS_PER_SEC);

// Now benchmark yours the same way

If you’re within 2-3x of SQLite, you’re doing well!

Books That Will Help

Topic	Book	Chapter
LSM trees and database design	“Designing Data-Intensive Applications” by Martin Kleppmann	Ch. 3: Storage and Retrieval
Write-ahead logging	“Operating Systems: Three Easy Pieces” by Arpaci-Dusseau	Ch. 42: Crash Consistency
Memory-mapped I/O	“The Linux Programming Interface” by Michael Kerrisk	Ch. 49: Memory Mappings
File system semantics	“The Linux Programming Interface” by Michael Kerrisk	Ch. 13: File I/O Buffering
Concurrent data structures	“The Art of Multiprocessor Programming” by Herlihy & Shavit	Ch. 9: Linked Lists
Database internals	“Database Internals” by Alex Petrov	Ch. 3-6: File Formats, Indexing, Transaction Processing
B-trees and LSM trees comparison	“Database Internals” by Alex Petrov	Ch. 2: B-Tree Basics; Ch. 7: Log-Structured Storage
Skip lists	“Algorithms” by Sedgewick & Wayne	Section 3.5: Applications (skip lists)
Bloom filters	“Algorithms” by Sedgewick & Wayne	Appendix: Probabilistic Data Structures
Binary file formats	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 2: Representing Information
fsync and durability	“Database Reliability Engineering” by Campbell & Majors	Ch. 4: Durability
Crash consistency	“Operating Systems: Three Easy Pieces” by Arpaci-Dusseau	Ch. 41-42: File System Implementation, Crash Consistency

Essential Papers and Resources:

Topic	Resource
LSM tree original paper	“The Log-Structured Merge-Tree” by O’Neil et al. (1996)
RocksDB documentation	RocksDB Wiki (github.com/facebook/rocksdb/wiki)
LMDB design	“LMDB: The Symas Lightning Memory-Mapped Database” whitepaper
Write-ahead logging	“Write-Ahead Logging” - PostgreSQL documentation
File system guarantees	“All File Systems Are Not Created Equal” by Pillai et al.
SSTables in Bigtable	“Bigtable: A Distributed Storage System” by Chang et al. (Google paper)
Compaction strategies	“Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores”

Summary

This track will transform you from someone who uses systems libraries to someone who builds them. The projects above give you real artifacts to show employers and the deep understanding to discuss implementation tradeoffs in interviews.

Projects in this track align with real-world tools:

Memory Allocator → jemalloc, tcmalloc, mimalloc
Thread Pool → Rayon, Go runtime, Java ForkJoinPool
Async Runtime → Tokio, libuv, io_uring
Syscall Abstraction → libuv, Rust std, Go runtime
String Search → ripgrep, hyperscan, stringzilla
Key-Value Store → RocksDB, LMDB, LevelDB

Systems Libraries & Runtimes — Project-Based Learning Path

Core Concept Analysis

Project 1: Custom Memory Allocator

Real World Outcome

The Moment of Truth: LD_PRELOAD

Seeing Your Allocator in Action

Benchmark Output: The Numbers That Matter

Running Real Programs with Your Allocator

Debugging Output: Seeing Fragmentation

The Interview Demo

The Core Question You’re Answering

Concepts You Must Understand First

1. Virtual Memory and Address Spaces

2. Metadata: The Hidden Cost of malloc

3. Free Lists: The Core Data Structure

4. Fragmentation: The Silent Killer

5. Thread Safety and Lock-Free Data Structures

6. Size Classes: Why jemalloc is Fast

7. Alignment Requirements

Questions to Guide Your Design

Design Question 1: How Will You Get Memory from the OS?

Design Question 2: How Will You Store Metadata?

Design Question 3: What Free List Strategy Will You Use?

Design Question 4: How Will You Handle Thread Safety?

Design Question 5: Will You Implement Coalescing?

Thinking Exercise: Trace an Allocation by Hand

Scenario: Simple Allocator State

Step-by-step trace:

The Interview Questions They’ll Ask

Conceptual Questions

Design Questions

Performance Questions

Implementation Questions

Advanced Questions

Hints in Layers

Hint 1: Start with the Simplest Possible Allocator

Hint 2: Getting Memory from the OS

Hint 3: Implementing free

Hint 4: Splitting Blocks

Hint 5: Coalescing Adjacent Free Blocks

Hint 6: Making it Thread-Safe (First Attempt)

Hint 7: Size Classes (Advanced)

Hint 8: Benchmarking Your Allocator

Hint 9: Debugging with Address Sanitizer

Hint 10: Testing with Real Programs

Books That Will Help

Recommended Reading Path

Project 2: Work-Stealing Thread Pool

Project 3: Mini Async Runtime

Project 4: Cross-Platform Syscall Abstraction Library

Real World Outcome

The Core Question You’re Answering

Concepts You Must Understand First

Questions to Guide Your Design

Thinking Exercise

Design the File API on Paper

The Interview Questions They’ll Ask

Hints in Layers

Books That Will Help

Project 5: High-Performance String Search Library

Real World Outcome

The Core Question You’re Answering

Concepts You Must Understand First

Questions to Guide Your Design

Thinking Exercise

Trace a SIMD Search By Hand

The Interview Questions They’ll Ask

Hints in Layers

Books That Will Help

Project Comparison Table

Recommended Learning Order

Start with: Mini Async Runtime

Then: Work-Stealing Thread Pool

Finally: Memory Allocator

Final Capstone Project: Embedded Key-Value Database

Real World Outcome

The Core Question You’re Answering

Concepts You Must Understand First

Questions to Guide Your Design

Thinking Exercise

Hint 3: Implementing `free`