Project 1: Custom Memory Allocator

Build a drop-in malloc replacement with size classes, arenas, and repeatable benchmarks.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 3 to 6 weeks
Main Programming Language C (C17)
Alternative Programming Languages C++, Rust, Zig
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 3 - Infrastructure Core
Prerequisites C pointers, basic threading, virtual memory basics
Key Topics malloc/free, size classes, fragmentation, arenas, alignment, profiling, LD_PRELOAD

1. Learning Objectives

By completing this project, you will:

  1. Implement a real allocator pipeline that is fast on the hot path and correct under stress.
  2. Design size classes and free lists that balance fragmentation vs throughput.
  3. Integrate OS page allocation with allocator metadata and reclamation policies.
  4. Make allocator behavior deterministic and measurable with repeatable benchmarks.
  5. Protect allocator metadata from common corruption and undefined behavior hazards.
  6. Ship a usable drop-in replacement for malloc and validate it with real programs.

2. All Theory Needed (Per-Concept Breakdown)

This section is a mini-book for the allocator. Every concept listed here is required to implement a robust, production-like allocator.

Concept 1: Virtual Memory, Pages, and OS Allocation Interfaces

Fundamentals

Dynamic allocation is built on virtual memory. Each process owns a virtual address space that the OS maps to physical pages. The allocator does not get bytes directly from RAM; it asks the OS for pages and then subdivides them. That means all allocator design is downstream of page size, page alignment, and OS allocation APIs. On Linux and BSD, the allocator can grow the traditional heap with brk and can also map anonymous pages with mmap. On Windows, the rough equivalent is VirtualAlloc. Every allocation request is therefore a policy decision about when to ask the OS for more pages, how to keep them in reserve, and when to return them with munmap or madvise. If you misunderstand the VM model you will accidentally create excessive page faults, thrash the TLB, or keep too much memory resident. The allocator must treat the OS as a page provider with its own costs, not as a byte-granular API. Understanding how page tables, protection bits, and virtual address ranges work is essential for correctness and performance.

Deep Dive into the concept

A process sees a flat virtual address space, but the CPU translates each virtual address through page tables. The OS decides which pages are backed by physical memory. Allocation at the OS level is thus a mapping decision: allocate a range of virtual addresses and associate them with physical pages as needed. On Linux, mmap with MAP_ANONYMOUS creates a virtual memory area (VMA) that is initially not backed by physical memory; the first write triggers a page fault and the OS allocates a physical page. This is why large allocations can appear to be cheap at first but get expensive when touched. The allocator must decide whether to pre-fault pages (by touching them) or to leave them lazy. Pre-faulting reduces latency variance but increases upfront cost. The decision depends on workload.

The classic brk interface grows the program break and creates a single contiguous heap region. This is simple and works well for small allocations because contiguous memory makes coalescing easier. But it also has severe drawbacks: you cannot easily return memory that is not at the top of the heap, and a large allocation can force the break to jump, creating a sparse address space and a big RSS footprint. Modern allocators therefore use mmap for large allocations to avoid contaminating the main heap. Many allocators set a threshold, often around 128KB or larger, above which they allocate by mmap and track the region separately. This avoids external fragmentation in the main heap and allows returning large blocks to the OS immediately.

Understanding overcommit is also critical. Linux may allow more virtual memory to be mapped than physical memory is available, assuming not all pages will be touched. An allocator that aggressively pre-reserves memory can trigger the OOM killer or have allocations succeed but later fail at first touch. The allocator must therefore be careful to balance virtual address reservation with actual committed usage. This is a key reason why per-thread arenas can be expensive: each arena may reserve or map more memory than it immediately uses.

Page size and alignment are another core part of allocator design. Standard pages are often 4KB, but huge pages (2MB) can reduce TLB misses. Some allocators support huge page arenas for large allocations or for internal metadata. However, huge pages can be scarce and require alignment constraints. If you do not align arena base addresses to page boundaries, madvise and munmap calls can fail or behave unexpectedly. Similarly, returning memory to the OS by munmap should only be done for whole page ranges. Partial page freeing is impossible, which means internal fragmentation at page granularity is unavoidable. This is why size classes often align to at least 8 or 16 bytes but still pack multiple allocations into a single page.

The OS allocation APIs themselves have costs and semantics. mmap is relatively heavy: it acquires kernel locks, updates VMAs, and may flush TLB entries. brk can also be expensive, especially with multi-threaded processes that share the heap. That cost is why allocators avoid calling into the kernel on the fast path. Instead, they maintain free lists of blocks carved from pages and only go to the kernel when empty. This idea extends to memory reclamation: returning memory to the OS may reduce RSS but costs syscalls and can create performance cliffs. Many allocators implement a decay or purge policy that uses madvise(MADV_DONTNEED) or MADV_FREE on Linux to give pages back under memory pressure while keeping the virtual mapping intact. This balance between address space and physical memory is subtle but crucial for latency predictability.

Finally, virtual memory protection can be used for debugging. An allocator can add guard pages using mprotect around large allocations or around metadata regions to catch out-of-bounds writes. These features are useful in debug modes but are too expensive for production. Knowing when to enable them is part of allocator engineering. In short, the allocator is an interpreter between the OS page allocator and the application. If you understand the OS-level VM model, you can design policies that trade memory for speed in an intentional way rather than accidentally.

How This Fits on Projects

This concept defines the way you acquire and release memory in the allocator core. It controls the layout of arenas, the page size assumptions used in Section 4.4 Data Structures, and the policies in Section 5.10 Phase 1 and Phase 3. It also informs the OS abstraction in P04-cross-platform-syscall-abstraction.md and the SSTable mapping strategy in P06-embedded-key-value-database.md.

Definitions & Key Terms

  • Virtual address: An address in the process address space that the CPU translates to a physical address.
  • Page: The fixed-size block of memory managed by the OS (commonly 4KB).
  • VMA (Virtual Memory Area): A contiguous range of virtual addresses with the same permissions.
  • brk/sbrk: APIs to grow the traditional heap segment.
  • mmap/munmap: APIs to map or unmap pages in the virtual address space.
  • RSS (Resident Set Size): Amount of physical memory currently backed by pages for the process.
  • Overcommit: OS policy that allows mapping more virtual memory than physical RAM.
  • TLB (Translation Lookaside Buffer): CPU cache of page table translations.

Mental Model Diagram (ASCII)

Application malloc/free
        |
        v
Allocator arenas + size classes
        |
        v
Page allocator (mmap/brk)
        |
        v
Virtual memory mappings -> physical pages

How It Works (Step-by-Step)

  1. The allocator determines that a size class has no free blocks.
  2. It requests a new page range from the OS using mmap (or brk for the main heap).
  3. The OS creates a VMA and returns a base address aligned to page size.
  4. The allocator subdivides the range into blocks and pushes them into a free list.
  5. When memory is freed, whole pages that become empty are candidates for madvise or munmap.

Invariants:

  • Arena base addresses are page-aligned.
  • The allocator never unmaps a page that still contains live allocations.

Failure modes:

  • Mapping failures due to address space exhaustion.
  • Overcommit leading to OOM on first touch.

Minimal Concrete Example

#include <sys/mman.h>
#include <unistd.h>
#include <stddef.h>

void* os_pages_alloc(size_t bytes) {
    size_t page = (size_t)sysconf(_SC_PAGESIZE);
    size_t size = (bytes + page - 1) & ~(page - 1);
    void* p = mmap(NULL, size, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    return (p == MAP_FAILED) ? NULL : p;
}

Common Misconceptions

  • “free always returns memory to the OS” -> Most allocators keep pages for reuse.
  • “mmap is cheap” -> It is a heavy syscall and should not be on the hot path.
  • “Virtual memory equals physical memory” -> Mapping a page does not allocate physical RAM until touched.

Check-Your-Understanding Questions

  1. Why do allocators often use mmap for large allocations but not for small ones?
  2. What is the difference between reserving virtual memory and committing physical memory?
  3. Why can returning memory to the OS increase latency for later allocations?

Check-Your-Understanding Answers

  1. Large allocations would fragment the heap; mmap isolates them and can be returned cleanly.
  2. Reservation creates address space mappings; commitment happens when pages are touched and backed by RAM.
  3. Unmapped pages must be re-mapped and re-faulted, which adds syscalls and page faults.

Real-World Applications

  • Browsers use arena-based allocators to isolate subsystems and reduce fragmentation.
  • Databases use mmap for large buffers and file-backed caches.
  • High-frequency trading systems pre-fault pages to avoid latency spikes.

Where You’ll Apply It

References

  • “Computer Systems: A Programmer’s Perspective” - Chapters on virtual memory.
  • “The Linux Programming Interface” - Memory mapping and mmap semantics.
  • “Operating Systems: Three Easy Pieces” - Virtual memory chapters.

Key Insights

The allocator is a policy layer on top of the OS page allocator, not a magical source of bytes.

Summary

Virtual memory is the bedrock of dynamic allocation. Pages are the real currency, and mmap or brk are the only ways to acquire them. An allocator that understands VM can trade latency, RSS, and fragmentation intentionally instead of accidentally.

Homework/Exercises to Practice the Concept

  1. Write a small program that mmaps 64MB, touches each page, and observes RSS growth.
  2. Compare the cost of mmap vs malloc for many small allocations.
  3. Experiment with madvise(MADV_DONTNEED) on a large mapped region and observe RSS changes.

Solutions to the Homework/Exercises

  1. RSS grows only after pages are touched; the initial mapping is mostly virtual.
  2. mmap is orders of magnitude slower for tiny allocations due to syscall overhead.
  3. madvise reduces RSS but the virtual address range remains mapped.

Concept 2: Allocator Metadata, Size Classes, and Fragmentation

Fundamentals

Allocators must remember the size and state of every allocation so they can free it safely. This metadata can be stored inline (a header before the user pointer) or in side tables. The design of metadata interacts with size classes: allocators round requests up to fixed bucket sizes so the free list operations are O(1). Size classes make allocations fast but introduce internal fragmentation because a 17 byte allocation might consume a 32 byte block. External fragmentation occurs when free blocks are scattered and cannot satisfy large requests even though total free memory is sufficient. Coalescing adjacent blocks reduces external fragmentation but adds overhead at free time. The allocator must decide when to coalesce, how to track free blocks, and how to represent metadata without corrupting user data. A good allocator is therefore a data structure problem as much as a systems problem.

Deep Dive into the concept

The allocator core is a set of data structures for tracking blocks. In a simple design, each block has a header containing its size and a bit for in-use/free. When freed, the block becomes a node in a free list. Allocation then walks the free list to find a block large enough. This “implicit” or “explicit” free list design is easy to implement but can be slow because it requires searching and because it fragments over time. To make the fast path O(1), modern allocators use segregated free lists with size classes. Requests are rounded up to the nearest class; each class has its own free list; allocation is a quick pop. The trade-off is internal fragmentation and the need to choose a good size class table.

Metadata placement is a design constraint. Inline headers are simple but vulnerable: a buffer overflow can overwrite the size field and corrupt the allocator. To mitigate this, allocators use cookies, canaries, or checksums on headers. Another strategy is to separate metadata into side tables indexed by address; this is safer but requires more complex bookkeeping and can reduce cache locality. Some allocators store the size in the low bits of the pointer by using aligned addresses, but this can run into undefined behavior if not done carefully.

Fragmentation has two faces. Internal fragmentation is the space wasted inside blocks because of rounding and alignment. External fragmentation is the space wasted between blocks because free blocks are too small or too scattered. Size classes reduce external fragmentation for small objects but can increase internal fragmentation if classes are too coarse. Coalescing addresses external fragmentation: when two adjacent blocks are free, you can merge them to form a larger block. However, eager coalescing makes free expensive and can become a bottleneck. Many allocators use deferred or lazy coalescing, merging blocks only when a large request fails. This creates a feedback loop where large allocations drive cleanup.

Free list structure also matters. A singly linked list is compact but makes removal from the middle expensive. A doubly linked list allows O(1) removal but increases metadata overhead and cache misses. Some allocators use bitmaps to represent free blocks within a page, which improves locality and reduces pointer chasing. Others use tree-based structures for large allocations to quickly find a best-fit block. The design you choose should match your target workloads.

Size class selection is not arbitrary. The common approach is to define fine-grained classes for small sizes (e.g., 8, 16, 24, 32, 40) and then exponentially grow for larger sizes. This keeps internal fragmentation low for small allocations while avoiding too many classes. You also need to consider alignment requirements; blocks should align to at least alignof(max_align_t) and often to cache-line sizes for performance-sensitive data. That requirement can push classes to larger sizes.

Finally, the allocator must track large allocations separately. Many designs use a separate tree or list for mmap allocations. These blocks have different lifetimes and alignment constraints, and they can be returned directly to the OS when freed. Mixing them with small-object bins creates fragmentation and degrades performance. In your allocator, you will likely implement a fast path for small allocations and a slow path for large allocations. The boundary between them is an important tuning knob and should be benchmarked.

How This Fits on Projects

This concept defines the block layout in Section 4.4 Data Structures, the allocation algorithm in Section 4.4 Algorithm Overview, and the fragmentation policies in Section 5.10 Phase 2. It also appears in P05-high-performance-string-search.md where alignment and metadata boundaries affect SIMD safety, and in P06-embedded-key-value-database.md for memtable and cache allocation.

Definitions & Key Terms

  • Header: Metadata stored before the user pointer.
  • Footer: Metadata stored at the end of a block for coalescing.
  • Size class: A bucket of fixed allocation sizes.
  • Internal fragmentation: Wasted space inside an allocated block.
  • External fragmentation: Wasted space between allocated blocks.
  • Coalescing: Merging adjacent free blocks into a larger block.
  • Free list: A data structure that tracks available blocks.

Mental Model Diagram (ASCII)

[Page]
+-----------------------------+
| hdr | user | hdr | user |.. |
+-----------------------------+
| free list for size class 64 |
+-----------------------------+

How It Works (Step-by-Step)

  1. Round request size up to a size class and alignment.
  2. Pop a block from the class free list if available.
  3. If empty, split a page into blocks for that class.
  4. On free, mark the block free and push to free list.
  5. Optionally coalesce adjacent free blocks if a large request fails.

Invariants:

  • Every block has a consistent size stored in metadata.
  • Free blocks are only in one free list at a time.

Failure modes:

  • Corrupted headers lead to invalid sizes and heap corruption.
  • Double free can insert a block into a free list twice.

Minimal Concrete Example

typedef struct block_header {
    size_t size;
    int in_use;
    struct block_header* next;
} block_header;

void* user_ptr(block_header* h) {
    return (void*)((char*)h + sizeof(block_header));
}

Common Misconceptions

  • “Best-fit is always better” -> Best-fit can be slow and still fragment.
  • “Coalesce on every free” -> It can create latency spikes and lock contention.
  • “Headers are harmless” -> Headers can be corrupted by simple buffer overruns.

Check-Your-Understanding Questions

  1. Why do size classes make allocation faster?
  2. What is the difference between internal and external fragmentation?
  3. When is lazy coalescing preferable to eager coalescing?

Check-Your-Understanding Answers

  1. Size classes allow O(1) allocation by popping from a fixed free list.
  2. Internal fragmentation is wasted space inside blocks, external is wasted space between blocks.
  3. Lazy coalescing reduces free-time overhead and only pays cost when needed.

Real-World Applications

  • jemalloc uses size classes and per-arena bins for fast allocations.
  • tcmalloc uses thread caches to avoid global contention.
  • mimalloc uses delayed coalescing and per-thread segments.

Where You’ll Apply It

References

  • “C Interfaces and Implementations” - Memory allocation chapters.
  • “Computer Systems: A Programmer’s Perspective” - Dynamic memory sections.
  • “The C Programming Language” - Storage allocator example.

Key Insights

Allocator performance is mostly about fast metadata decisions and fragmentation policy, not just raw speed.

Summary

Allocator metadata and size classes control everything: speed, fragmentation, and correctness. The best allocators use a simple fast path for common sizes and reserve complex logic for rare cases.

Homework/Exercises to Practice the Concept

  1. Implement a free list with block headers and test allocation/free order.
  2. Simulate internal fragmentation with random allocation sizes and measure waste.
  3. Implement a simple coalescing strategy and compare memory usage.

Solutions to the Homework/Exercises

  1. A header-based free list shows how metadata and user pointers interact.
  2. Random sizes show that coarser size classes waste more memory.
  3. Coalescing reduces external fragmentation but adds overhead on free.

Concept 3: Concurrency, Alignment, and Allocator Correctness

Fundamentals

Allocators must work under concurrency. A single global lock around the heap creates contention and can destroy scalability on multi-core systems. The typical solution is to use per-thread arenas or caches so that most allocations are lock-free or require only thread-local locks. However, these arenas increase memory footprint and can fragment memory across threads. Alignment is another correctness requirement: the allocator must return pointers aligned to at least alignof(max_align_t), and many workloads expect 16 or 32 byte alignment for SIMD. Misalignment causes crashes or subtle undefined behavior on some architectures. Finally, allocator correctness is about invariants: blocks must not overlap, metadata must be consistent, and free operations must be idempotent only in the sense that double free should be detected. Concurrency makes these invariants harder because two threads can attempt to manipulate the same free list. Your allocator must enforce a clear ownership model and use synchronization to prevent corruption.

Deep Dive into the concept

Concurrency changes the allocator from a single data structure into a distributed system. With multiple threads, global locks create a serial bottleneck. Per-thread arenas solve this by partitioning the heap: each thread allocates from its own arena, usually backed by a set of pages reserved for that thread. This dramatically reduces contention but increases memory usage because free space in one arena cannot be easily used by another. To mitigate this, allocators often implement an arena decay or scavenging mechanism where inactive arenas are purged or migrated. This is a policy decision: how long should an arena keep idle pages before returning them to a global pool or the OS?

Thread caches are another technique. A thread cache is a small, per-thread stash of recently freed blocks for each size class. The fast path is entirely lock-free: allocations pop from the local cache and frees push into it. When a cache runs out or grows too large, it exchanges blocks with a central arena using a lock. This amortizes lock overhead and keeps common operations cheap. The trade-off is that a block freed by one thread might not be immediately visible to another, which can increase fragmentation and RSS but improves throughput.

Alignment is both a performance and a correctness constraint. The C standard requires malloc to return pointers aligned for any type, which is commonly the alignment of max_align_t. If you return less than that, code that assumes alignment can crash or produce undefined behavior. Modern CPUs also have specific alignment requirements for SIMD instructions; for example, some AVX instructions require 32 byte alignment. Even though unaligned loads exist, they can be slower or require careful tail handling. This is why allocators often align blocks to 16 or 32 bytes even if the user requested less. The allocator must implement rounding that preserves this alignment while not corrupting metadata. If your metadata is stored in the block header, you must ensure the user pointer is properly aligned after the header. This is a common bug in hand-written allocators.

Allocator correctness is about invariants under concurrency. Free list operations must be atomic with respect to other threads. If a block is popped by one thread while another thread is pushing to the same list, the list can be corrupted unless protected by a lock or a lock-free algorithm. Many allocators avoid lock-free data structures because of complexity; instead they use fine-grained locks per size class. Another approach is to use lock-free stacks with atomic compare-and-swap, but those can suffer from the ABA problem. If you use lock-free lists, you must also consider memory reclamation (hazard pointers, epoch-based reclamation) because a node removed by one thread might still be read by another.

Detecting allocator misuse is part of correctness. Debug builds can add canaries, double-free checks, and freed-block poisoning. These checks increase safety but cost memory and CPU. A production allocator might include a light-weight checksum in headers to detect corruption without full debug mode. You should design your allocator so that these checks can be toggled by a compile-time flag or environment variable. This is also where you can integrate with sanitizers: AddressSanitizer and Valgrind can catch heap bugs, but your allocator must avoid patterns that confuse them (such as reusing freed blocks too quickly in debug mode).

In summary, concurrency, alignment, and correctness are intertwined. The core choices you make for thread-local caching, lock granularity, and alignment rounding will determine whether your allocator scales and whether it fails safely when misuse happens.

How This Fits on Projects

This concept drives the thread-safe design in Section 3.2 Functional Requirements, the arena structure in Section 4.4 Data Structures, and the concurrency decisions in Section 5.10 Phase 3. It also connects to P02-work-stealing-thread-pool.md for memory ordering and P05-high-performance-string-search.md for alignment-sensitive SIMD.

Definitions & Key Terms

  • Arena: A per-thread or per-CPU heap region.
  • Thread cache: A small per-thread store of recently freed blocks.
  • Alignment: The requirement that pointers are multiples of a specific power of two.
  • ABA problem: A lock-free hazard where a value changes from A to B back to A.
  • Poisoning: Writing known patterns to freed memory to catch use-after-free.

Mental Model Diagram (ASCII)

Thread A: cache -> arena -> OS pages
Thread B: cache -> arena -> OS pages
               ^
               |
         global fallback

How It Works (Step-by-Step)

  1. Each thread has a local cache for common size classes.
  2. Allocation first checks the local cache; if empty, it refills from the arena.
  3. Free pushes back to the local cache; if the cache is full, it spills to the arena.
  4. Arena operations are protected by a lock or a lock-free structure.
  5. Alignment is enforced when carving blocks from a page.

Invariants:

  • Blocks returned to the user are aligned to alignof(max_align_t) or better.
  • A block is never in two free lists at the same time.

Failure modes:

  • Race conditions that corrupt free lists.
  • Misaligned allocations causing crashes in SIMD code.

Minimal Concrete Example

static inline size_t align_up(size_t n, size_t a) {
    return (n + a - 1) & ~(a - 1);
}

Common Misconceptions

  • “Thread-local means no memory overhead” -> It increases fragmentation and RSS.
  • “Alignment only affects speed” -> It can be required for correctness.
  • “Lock-free is always better” -> It can be harder to reason about and slower under contention.

Check-Your-Understanding Questions

  1. Why do thread caches reduce contention but increase memory footprint?
  2. What happens if you return a pointer that is not properly aligned?
  3. Why is the ABA problem relevant for lock-free free lists?

Check-Your-Understanding Answers

  1. Caches keep blocks local to each thread, reducing locking, but block reuse across threads is delayed.
  2. The program can crash or exhibit undefined behavior when accessing aligned types.
  3. A lock-free stack can mistakenly accept a stale pointer if the memory is freed and reallocated.

Real-World Applications

  • jemalloc uses per-arena caching to reduce contention on multi-core servers.
  • High-performance allocators align blocks for SIMD data structures and cache lines.
  • Debug allocators poison freed blocks to catch use-after-free errors.

Where You’ll Apply It

References

  • “C Interfaces and Implementations” - Memory allocation and alignment.
  • “Rust Atomics and Locks” - Memory ordering basics applicable to allocator locks.
  • “The Linux Programming Interface” - Threading and memory APIs.

Key Insights

Concurrency turns allocation into a scaling problem, and alignment turns it into a correctness problem.

Summary

A modern allocator must scale across cores, maintain strict alignment guarantees, and defend itself from metadata corruption. Thread-local arenas and caches improve throughput, but they require careful tuning to avoid memory bloat.

Homework/Exercises to Practice the Concept

  1. Add alignment rounding to a simple allocator and test with SIMD types.
  2. Implement a per-thread cache for a single size class.
  3. Add a debug mode that poisons freed blocks and detects use-after-free.

Solutions to the Homework/Exercises

  1. Alignment rounding ensures pointers are safe for any type and avoids UB.
  2. A thread cache removes the lock from the hot path at the cost of extra memory.
  3. Poisoning detects invalid reuse and helps debug heap corruption.

3. Project Specification

3.1 What You Will Build

You will build a production-style allocator library in C that can replace malloc, free, calloc, and realloc for real programs. The allocator must implement size classes, per-thread caches, and a page allocator that uses mmap for large allocations and page slabs for small allocations. It must provide a stats and benchmarking CLI that prints allocation throughput, fragmentation metrics, and latency percentiles. The project intentionally excludes a full debug UI and does not aim to be a drop-in replacement for every glibc feature, but it must be stable enough to run under LD_PRELOAD with common CLI programs.

3.2 Functional Requirements

  1. malloc/free compatibility: Provide malloc, free, calloc, and realloc with standard semantics.
  2. Size classes: Implement fixed size classes for small allocations and a separate large-allocation path.
  3. Per-thread caches: Implement thread-local caches with spillover to global arenas.
  4. OS integration: Use mmap for large allocations and to extend arenas.
  5. Statistics API: Provide counters for allocations, frees, bytes in use, and fragmentation estimates.
  6. Benchmark CLI: A CLI tool that runs repeatable allocation benchmarks and prints JSON results.
  7. Debug mode: A build flag that enables optional canaries or poisoning checks.

3.3 Non-Functional Requirements

  • Performance: Allocation and free should be O(1) in the common path; benchmark should show competitive throughput vs system malloc on at least two workloads.
  • Reliability: Must not crash on valid inputs; debug mode should detect and report double frees.
  • Usability: Clear build instructions and a stable C API with documented behavior.

3.4 Example Usage / Output

$ ./allocbench --seed 1337 --ops 200000 --profile mixed
allocbench 1.0
profile=mixed seed=1337 ops=200000
alloc/s: 18,420,000
free/s: 18,410,000
p50 latency: 62 ns
p99 latency: 410 ns
fragmentation: 9.8%

3.5 Data Formats / Schemas / Protocols

The benchmark tool outputs a JSON report:

{
  "profile": "mixed",
  "seed": 1337,
  "ops": 200000,
  "alloc_per_sec": 18420000,
  "free_per_sec": 18410000,
  "latency_ns": {"p50": 62, "p99": 410},
  "fragmentation_pct": 9.8
}

3.6 Edge Cases

  • malloc(0) returns either NULL or a unique pointer that can be freed.
  • realloc(NULL, n) behaves like malloc(n).
  • realloc(p, 0) frees p and returns NULL.
  • Double free is detected and reported in debug mode.
  • Very large allocations trigger mmap path and can fail gracefully.

3.7 Real World Outcome

A working allocator that can be preloaded into a real program and benchmarks that can be repeated deterministically.

3.7.1 How to Run (Copy/Paste)

# build
make

# run a deterministic benchmark
./allocbench --seed 1337 --ops 200000 --profile mixed

# preload into a program
LD_PRELOAD=./liballoc.so ls >/dev/null

3.7.2 Golden Path Demo (Deterministic)

  • Use a fixed seed of 1337 and a fixed workload profile.
  • Expect stable throughput within a small variance across runs.

3.7.3 If CLI: Exact Terminal Transcript

$ ./allocbench --seed 1337 --ops 100000 --profile small
allocbench 1.0
profile=small seed=1337 ops=100000
alloc/s: 21,010,000
free/s: 21,000,000
p50 latency: 52 ns
p99 latency: 280 ns
fragmentation: 6.1%
exit_code=0

$ ./allocbench --profile unknown
allocbench: invalid profile "unknown"
usage: allocbench --profile {small,mixed,large}
exit_code=2

3.7.4 If Web App

Not applicable.

3.7.5 If API

Not applicable.

3.7.6 If Library

Install/import

// link against liballoc.so or liballoc.a
#include "alloc.h"

Minimal usage

void* p = alloc_malloc(64);
memset(p, 0, 64);
alloc_free(p);

Error handling

void* p = alloc_malloc(1ULL<<40);
if (!p) {
    fprintf(stderr, "alloc: out of memory\n");
}

3.7.7 If GUI / Desktop / Mobile

Not applicable.

3.7.8 If TUI

Not applicable.


4. Solution Architecture

4.1 High-Level Design

+--------------------+      +----------------------+
| Public API (malloc)| ---> | Size Class Allocator |
+--------------------+      +----------------------+
           |                         |
           v                         v
   +---------------+         +-----------------+
   | Thread Cache  | <-----> | Arena (per CPU) |
   +---------------+         +-----------------+
                                       |
                                       v
                               +---------------+
                               | OS Page Layer |
                               +---------------+

4.2 Key Components

Component Responsibility Key Decisions
API layer Provide malloc/free/realloc interface Match standard semantics and alignment
Size classes Fast allocation for small sizes Class table and rounding strategy
Thread cache Reduce contention Cache size and spill policy
Arena Owns page slabs and global lists Lock granularity and coalescing policy
OS layer mmap/munmap and page management Large allocation threshold

4.4 Data Structures (No Full Code)

typedef struct alloc_block {
    size_t size;
    uint32_t flags;
    struct alloc_block* next;
} alloc_block_t;

typedef struct alloc_arena {
    pthread_mutex_t lock;
    alloc_block_t* free_lists[NUM_CLASSES];
    void* page_slabs;
} alloc_arena_t;

4.4 Algorithm Overview

Key Algorithm: Allocation

  1. Round size to class and check thread cache.
  2. If cache empty, lock arena and pop from free list.
  3. If free list empty, request pages and split into blocks.
  4. Return aligned user pointer.

Complexity Analysis:

  • Time: O(1) on fast path, O(1) amortized on slow path.
  • Space: O(n) for allocations plus metadata overhead.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install build-essential
make

5.2 Project Structure

allocator/
|-- src/
|   |-- alloc.c
|   |-- arena.c
|   |-- os_pages.c
|   `-- stats.c
|-- include/
|   `-- alloc.h
|-- tools/
|   `-- allocbench.c
|-- tests/
|   `-- test_alloc.c
`-- Makefile

5.3 The Core Question You’re Answering

“How can you allocate memory in O(1) time while staying correct under fragmentation and concurrency?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Virtual memory and pages
    • How mmap differs from brk
    • How page alignment affects metadata
  2. Size classes
    • Why allocators round up sizes
    • How size class design impacts fragmentation
  3. Thread synchronization
    • When locks are required for allocator state
    • Why per-thread caches reduce contention
  4. Alignment and UB
    • alignof(max_align_t) requirements
    • Why misalignment can crash SIMD code

5.5 Questions to Guide Your Design

  1. How many size classes do you need for small allocations?
  2. Where will you store metadata so it is not corrupted?
  3. What threshold decides between arena and mmap allocation?
  4. How will you detect and report double free in debug mode?

5.6 Thinking Exercise

Trace this by hand: when do you call into the OS?

for (int i = 0; i < 1000; i++) {
    void* p = malloc(24);
    free(p);
}
  • How many pages do you need if the size class is 32 bytes?
  • How often should mmap be called?

5.7 The Interview Questions They’ll Ask

  1. Why do allocators use size classes?
  2. What is internal vs external fragmentation?
  3. How do thread-local caches improve performance?
  4. Why does free not return memory to the OS?

5.8 Hints in Layers

Hint 1: Build a single-threaded allocator first

Start with one arena and one free list. Make it correct before making it fast.

Hint 2: Add size classes

Implement a table and a rounding function. Then add a free list per class.

Hint 3: Add thread caches

Make a small per-thread stash and refill it from the arena when empty.

Hint 4: Add debug checks

Add a canary and detect double free in debug builds.

5.9 Books That Will Help

Topic Book Chapter
Memory allocation “C Interfaces and Implementations” Ch. 5-6
Virtual memory “Computer Systems: A Programmer’s Perspective” Ch. 9
Debugging “The Linux Programming Interface” Memory chapters

5.10 Implementation Phases

Phase 1: Foundation (1 week)

Goals:

  • Implement OS page allocator and a single free list
  • Provide malloc/free semantics

Tasks:

  1. Build os_pages_alloc and os_pages_free.
  2. Implement a basic free list and header metadata.

Checkpoint: Simple allocation/free test passes.

Phase 2: Size Classes and Coalescing (1 to 2 weeks)

Goals:

  • Add size classes for small allocations
  • Implement coalescing for large blocks

Tasks:

  1. Create size class table and rounding logic.
  2. Implement free lists per class and split pages into blocks.

Checkpoint: Fragmentation tests show stable memory usage.

Phase 3: Concurrency and Polishing (1 to 3 weeks)

Goals:

  • Add thread caches and arena locking
  • Add stats and benchmark CLI

Tasks:

  1. Implement per-thread caches and spill policies.
  2. Add benchmarking tool with deterministic seed.

Checkpoint: Benchmarks show improvement over baseline.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Large allocation path mmap or heap mmap Avoids heap fragmentation
Free list type singly vs doubly singly Lower overhead, good enough
Coalescing eager vs lazy lazy Reduces free-time overhead

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate metadata and free lists size rounding, header integrity
Integration Tests Real workload simulations mixed allocation patterns
Edge Case Tests Special semantics realloc NULL, malloc(0)

6.2 Critical Test Cases

  1. Allocation alignment: Every returned pointer is aligned to alignof(max_align_t).
  2. Double free detection: Debug mode reports error and aborts.
  3. Large allocation path: mmap allocations are returned to OS on free.

6.3 Test Data

workload: mixed
seed: 1337
sizes: 8, 16, 24, 32, 64, 128, 256

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Misaligned blocks Crashes in SIMD code Align to max_align_t or 16 bytes
Double free Random crashes Add debug canaries and checks
Over-aggressive coalescing High latency Use lazy coalescing

7.2 Debugging Strategies

  • Heap poisoning: Fill freed blocks with a pattern and detect reuse.
  • Valgrind/ASAN: Run tests under sanitizers to catch invalid accesses.

7.3 Performance Traps

  • Too many mmap calls on the hot path.
  • Large thread caches causing excessive RSS.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a simple alloc_stats() API that returns counts.
  • Add a debug build that logs every allocation.

8.2 Intermediate Extensions

  • Implement per-CPU arenas instead of per-thread arenas.
  • Add optional guard pages for large allocations.

8.3 Advanced Extensions

  • Implement a lock-free free list for a single size class.
  • Integrate huge page support for large slabs.

9. Real-World Connections

9.1 Industry Applications

  • Web servers: High throughput allocators reduce tail latency.
  • Databases: Custom allocators control memory growth and fragmentation.
  • jemalloc: Scalable allocator with arenas and extensive stats.
  • tcmalloc: Thread caching allocator from Google.
  • mimalloc: Microsoft allocator focused on fragmentation and security.

9.3 Interview Relevance

  • Explain allocator trade-offs between speed and fragmentation.
  • Describe how thread caches reduce contention.

10. Resources

10.1 Essential Reading

  • “C Interfaces and Implementations” by David Hanson - allocator design.
  • “Computer Systems: A Programmer’s Perspective” - virtual memory.

10.2 Video Resources

  • CMU CS:APP lectures on memory.

10.3 Tools & Documentation

  • man mmap, man brk, man malloc for OS semantics.
  • valgrind and asan for debugging.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why size classes reduce allocation time.
  • I can explain internal vs external fragmentation.
  • I can explain why mmap is used for large allocations.

11.2 Implementation

  • All functional requirements are met.
  • The allocator passes all tests under ASAN.
  • The benchmark output is deterministic with a fixed seed.

11.3 Growth

  • I can describe a trade-off I would change for a different workload.
  • I can explain allocator design choices in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Basic malloc/free works with size classes.
  • Benchmarks run with deterministic seed.
  • No crashes on standard tests.

Full Completion:

  • Thread caches implemented and tested.
  • Debug mode detects double free.

Excellence (Going Above & Beyond):

  • Competes with system malloc on a real workload.
  • Includes optional huge page support.