P01: Custom Memory Allocator

Build a production-ready memory allocator that can replace malloc in real programs

Attribute	Value
Main Language	C
Alternative Languages	C++, Rust, Zig
Difficulty	Level 3: Advanced
Coolness Level	Level 4: Hardcore Tech Flex
Knowledge Area	Memory Management, Systems Programming
Key Tools	glibc, jemalloc, valgrind, LD_PRELOAD
Main Book	“C Interfaces and Implementations” - David Hanson

Learning Objectives

By completing this project, you will:

Understand memory allocation internals — Know exactly what happens when you call malloc() and free()
Master free list data structures — Implement and optimize the core data structure behind allocators
Handle fragmentation — Understand internal vs external fragmentation and implement coalescing
Design thread-safe systems — Make your allocator work correctly in multi-threaded programs
Profile and benchmark — Measure your allocator against glibc and jemalloc
Deploy real software — Use LD_PRELOAD to run actual programs with your allocator

The Core Question

“Why is malloc slow, and what does it actually take to make a fast, correct memory allocator?”

This question separates developers who use memory from those who understand memory. Most programmers treat malloc as magic. After this project, you’ll answer:

Why does malloc sometimes take 10ns and sometimes 10µs?
Why does allocating 16 bytes waste 8 more bytes on metadata?
Why doesn’t free always return memory to the OS?
Why do jemalloc and tcmalloc exist when glibc already has malloc?

Memory allocators sit at the intersection of three hard problems:

Speed — O(1) allocation without searching
Memory efficiency — Minimizing fragmentation and overhead
Thread safety — Scaling to many cores without lock contention

You cannot optimize all three. This project teaches you the tradeoffs by making you choose.

Deep Theoretical Foundation

1. Virtual Memory and Address Spaces

Before writing an allocator, you must understand where memory comes from.

Virtual vs Physical Addresses

When your program accesses address 0x7f8a2c001000, that’s a virtual address. The CPU’s Memory Management Unit (MMU) translates it to a physical RAM address. This translation happens via page tables maintained by the OS kernel.

Virtual Address Space (per process):
┌─────────────────────────────────────┐ 0xFFFFFFFFFFFFFFFF
│         Kernel Space                │
├─────────────────────────────────────┤
│              Stack                  │ ← grows downward
│                ↓                    │
│                                     │
│            (unused)                 │
│                                     │
│                ↑                    │
│              Heap                   │ ← grows upward (where malloc lives)
├─────────────────────────────────────┤
│         BSS (uninitialized)         │
├─────────────────────────────────────┤
│         Data (initialized)          │
├─────────────────────────────────────┤
│         Text (code)                 │
└─────────────────────────────────────┘ 0x0000000000000000

Getting Memory from the OS

Your allocator needs to request memory from the kernel. Two primary mechanisms:

sbrk()/brk() — Extends the heap segment
- Contiguous growth (easy to manage)
- Cannot release memory in the middle (only from the top)
- Deprecated on some systems
mmap() — Maps anonymous pages anywhere in the address space
- Can release individual regions via munmap()
- Higher per-call overhead
- Modern allocators prefer this for large allocations

// sbrk example: extend heap by 4KB
void* ptr = sbrk(4096);  // Returns old break address

// mmap example: get 4KB of anonymous memory
void* ptr = mmap(NULL, 4096,
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS,
                 -1, 0);

Page Granularity

The OS allocates memory in pages (typically 4KB). When you request 100 bytes:

The allocator asks the OS for at least one page (4096 bytes)
Your 100 bytes come from that page
The remaining 3996 bytes stay in the allocator’s pool for future requests

2. Metadata: The Hidden Cost of malloc

Every allocation needs bookkeeping. When you call malloc(100), the allocator must remember:

How big is this block? (Needed for realloc and coalescing)
Is it currently in use? (Needed to know if it’s available)

Where to Store Metadata?

Option 1: Header before the block (most common)

Memory Layout:
[Header: size=108, used=1][User Data: 100 bytes]
                          ^
                          Pointer returned to user

When free(ptr) is called, the allocator backs up from ptr to find the header:

void free(void* ptr) {
    block_header_t* header = (block_header_t*)ptr - 1;
    // Now we know the size and can mark it free
}

Option 2: Boundary tags (header + footer)

[Header: size=108][User Data: 100 bytes][Footer: size=108]

The footer enables finding the previous block during coalescing.

Option 3: Separate metadata table

User blocks: [Block 0][Block 1][Block 2]...
Metadata:    [Entry 0][Entry 1][Entry 2]...

Isolates user data from metadata (prevents corruption) but requires lookup.

Minimum Block Size

Even if a user requests 1 byte, your block needs space for:

Header (8-16 bytes typically)
User data (rounded up for alignment)
When free: a “next” pointer to link in free list (8 bytes on 64-bit)

Typical minimum: 16-32 bytes per allocation.

3. Free Lists: The Core Data Structure

A free list is a linked list of available memory blocks. The key insight: use the free block’s memory itself to store the link.

Free List Structure:
                                    ┌─────────────────┐
head ────────────────────────────→  │ size: 64        │
                                    │ next: ─────────────┐
                                    │ [free memory]   │   │
                                    └─────────────────┘   │
                                                          ↓
                                    ┌─────────────────┐
                                    │ size: 128       │
                                    │ next: ─────────────┐
                                    │ [free memory]   │   │
                                    └─────────────────┘   │
                                                          ↓
                                    ┌─────────────────┐
                                    │ size: 256       │
                                    │ next: NULL      │
                                    │ [free memory]   │
                                    └─────────────────┘

Allocation Strategies

When searching for a block to satisfy malloc(100):

Strategy	Description	Pros	Cons
First-fit	Return first block ≥ size	Fast O(1) average	Fragments beginning of list
Best-fit	Return smallest block ≥ size	Minimizes waste	O(n) search, creates small unusable fragments
Worst-fit	Return largest block	Avoids tiny fragments	O(n) search, poor utilization
Next-fit	First-fit starting from last position	Spreads allocations	Still O(n) worst case

Splitting Blocks

If you find a 256-byte block for a 64-byte request:

Before: [Header: 256][──────── 256 bytes ────────]

After:  [Header: 64][Used: 64 bytes][Header: 192][Free: 192 bytes]
                    ^                            ^
                    Returned to user             Added to free list

Only split if the remainder is large enough for a useful block (header + minimum data).

4. Fragmentation: The Silent Killer

Internal Fragmentation

Wasted space inside allocated blocks:

User requests 17 bytes
Allocator gives 32 bytes (rounded for alignment)
Wasted: 15 bytes (47% overhead!)

External Fragmentation

Free memory exists but is scattered:

After many alloc/free cycles:
[Used:50][Free:50][Used:50][Free:50][Used:50][Free:50]

Total free: 150 bytes
Can allocate 100 bytes? NO! Largest contiguous free: 50 bytes

Coalescing: The Solution

When freeing a block, merge it with adjacent free blocks:

Before free(B):
[A: Free][B: Used][C: Free]

After free(B) with coalescing:
[───── Single Free Block (A+B+C) ─────]

To find adjacent blocks:

Next block: Current address + current size
Previous block: Need boundary tags (footer) or address-ordered list

5. Size Classes: Why jemalloc is Fast

Instead of one free list, maintain separate lists for common sizes:

Size Class 0 (16 bytes): [Free] -> [Free] -> [Free] -> NULL
Size Class 1 (32 bytes): [Free] -> [Free] -> NULL
Size Class 2 (64 bytes): [Free] -> [Free] -> [Free] -> [Free] -> NULL
Size Class 3 (128 bytes): [Free] -> NULL
...

Benefits:

O(1) allocation for small sizes (no searching)
Reduced fragmentation (similar sizes grouped)
Better cache locality

Common size class schemes:

Powers of 2: 16, 32, 64, 128, 256…
jemalloc-style: 8, 16, 32, 48, 64, 80, 96, 112, 128… (more granular)

Large allocations (>4KB typically) bypass size classes and use mmap directly.

6. Thread Safety

The Problem

Without synchronization, two threads calling malloc simultaneously corrupt the free list:

Thread 1: malloc(64)     Thread 2: malloc(64)
         ↓                        ↓
    Read head = A            Read head = A  (same value!)
         ↓                        ↓
    head = A->next           head = A->next
         ↓                        ↓
    Return A                 Return A (SAME BLOCK!)

Solution 1: Global Lock (Simple but Slow)

pthread_mutex_t malloc_lock = PTHREAD_MUTEX_INITIALIZER;

void* malloc(size_t size) {
    pthread_mutex_lock(&malloc_lock);
    // ... allocation logic ...
    pthread_mutex_unlock(&malloc_lock);
    return ptr;
}

All threads serialize through one lock — kills scalability.

Solution 2: Per-Thread Caches (What jemalloc Does)

Each thread has its own cache of blocks:

Thread 1: Local cache [Block][Block][Block]
Thread 2: Local cache [Block][Block]
Thread 3: Local cache [Block][Block][Block][Block]

Global arena (locked): [Blocks for refilling caches]

Allocations from local cache need no locking. Only refilling from global arena requires synchronization.

False Sharing

Even with per-thread data, placing thread-local variables on the same cache line causes performance collapse:

Cache line (64 bytes):
[Thread1 counter][Thread2 counter][Thread3 counter]...

When Thread1 increments its counter, Thread2's cache line is invalidated!

Solution: Pad data structures to cache-line boundaries:

struct thread_cache {
    // ... actual data ...
    char padding[64];  // Ensure cache-line isolation
};

7. Alignment Requirements

Modern CPUs require (or strongly prefer) aligned memory access:

Type	Alignment Requirement
`char`	1 byte
`short`	2 bytes
`int`	4 bytes
`long`, `pointer`	8 bytes
SSE vectors	16 bytes
AVX vectors	32 bytes

The C standard requires malloc to return memory suitable for any type. On most systems: 16-byte alignment.

// Rounding up to alignment
size_t aligned_size = (size + 15) & ~15;  // Round up to 16

// Checking alignment
assert((uintptr_t)ptr % 16 == 0);

Project Specification

What You’ll Build

A general-purpose memory allocator implementing:

void* malloc(size_t size) — Allocate memory
void free(void* ptr) — Release memory
void* calloc(size_t nmemb, size_t size) — Allocate and zero
void* realloc(void* ptr, size_t size) — Resize allocation

Your allocator must:

Be loadable via LD_PRELOAD to replace system malloc
Handle multi-threaded programs correctly
Have competitive performance (within 2x of glibc for common workloads)
Pass memory leak detection (valgrind)

Deliverables

Source code: myalloc.c, myalloc.h
Shared library: libmyalloc.so (built with -shared -fPIC)
Benchmark suite: Comparing against glibc and jemalloc
Test suite: Unit tests and stress tests
Documentation: Design decisions and tradeoffs made

Success Criteria

# Basic functionality
$ gcc -shared -fPIC -o libmyalloc.so myalloc.c -lpthread
$ LD_PRELOAD=./libmyalloc.so ls
# Should complete without crashes

# Runs Python
$ LD_PRELOAD=./libmyalloc.so python3 -c "print('Hello!')"
Hello!

# Passes valgrind
$ valgrind --leak-check=full ./test_allocator
All heap blocks were freed -- no leaks possible

# Benchmark output
$ ./bench_allocator
Your allocator: 1,234,567 allocs/sec
glibc malloc:   2,345,678 allocs/sec
Ratio: 0.53x  # Within acceptable range

Solution Architecture

Component Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Public API                                │
│  malloc() / free() / calloc() / realloc()                       │
└───────────────────────────┬─────────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────────┐
│                    Thread-Local Cache                            │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                         │
│  │ Thread 1 │ │ Thread 2 │ │ Thread 3 │  ... (one per thread)   │
│  │  Cache   │ │  Cache   │ │  Cache   │                         │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘                         │
└───────┼────────────┼────────────┼───────────────────────────────┘
        │            │            │
        └────────────┼────────────┘
                     │ (when cache empty/full)
┌────────────────────▼────────────────────────────────────────────┐
│                    Central Arena (Locked)                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Size Class Free Lists                       │    │
│  │  Class 0: [16B] → [16B] → [16B]                         │    │
│  │  Class 1: [32B] → [32B]                                 │    │
│  │  Class 2: [64B] → [64B] → [64B] → [64B]                │    │
│  │  ...                                                     │    │
│  └─────────────────────────────────────────────────────────┘    │
└───────────────────────────┬─────────────────────────────────────┘
                            │ (when arena needs more memory)
┌───────────────────────────▼─────────────────────────────────────┐
│                    OS Memory (mmap/sbrk)                         │
│  [Page][Page][Page][Page][Page][Page][Page][Page]...            │
└─────────────────────────────────────────────────────────────────┘

Data Structures

Block Header

typedef struct block_header {
    size_t size;           // Size including header (low bit = in-use flag)
    struct block_header* next;  // Next in free list (only when free)
} block_header_t;

// Size encoding trick: use low bit for in-use flag
#define GET_SIZE(header)   ((header)->size & ~1)
#define IS_USED(header)    ((header)->size & 1)
#define SET_USED(header)   ((header)->size |= 1)
#define SET_FREE(header)   ((header)->size &= ~1)

Size Class Array

#define NUM_SIZE_CLASSES 8
#define SIZE_CLASS_THRESHOLD 2048

// Size classes: 16, 32, 64, 128, 256, 512, 1024, 2048
static block_header_t* size_classes[NUM_SIZE_CLASSES];
static pthread_mutex_t class_locks[NUM_SIZE_CLASSES];

Thread-Local Cache

#define CACHE_SIZE 32

typedef struct {
    void* blocks[CACHE_SIZE];
    int count;
} thread_cache_t;

static __thread thread_cache_t thread_caches[NUM_SIZE_CLASSES];

Key Algorithms

Allocation Flow

malloc(size):
Round size up to next size class
Check thread-local cache for that class
   - If hit: return cached block (no lock!)
Lock size class
Search size class free list
   - If found: remove from list, unlock, return
Request blocks from OS (mmap)
Split into size-class blocks
Return one, add rest to cache/free list

Deallocation Flow

free(ptr):
1. Get header (ptr - sizeof(header))
2. Determine size class from header
3. If thread-local cache not full:
   - Add to cache (no lock!)
4. Else:
   - Lock size class
   - Add to free list
   - Optionally coalesce with neighbors
   - Unlock

Phased Implementation Guide

Phase 1: Minimal Allocator (Days 1-2)

Goal: A working malloc/free with a single global free list.

Steps:

Define block header structure
Implement get_memory_from_os() using mmap
Implement first-fit malloc():
- Search free list
- If no fit, get more from OS
- Split block if too large
Implement free():
- Find header, mark as free
- Add to free list

Test: Can allocate and free without crashing.

Phase 2: Coalescing (Days 2-3)

Goal: Prevent fragmentation by merging adjacent free blocks.

Steps:

Add boundary tags (footer with size)
On free(), check if next block is free → merge
On free(), check if previous block is free → merge
Keep free list address-sorted (simplifies coalescing)

Test: After many alloc/free cycles, large allocations still succeed.

Phase 3: Size Classes (Days 3-4)

Goal: O(1) allocation for common sizes.

Steps:

Define size classes (e.g., powers of 2 up to 2048)
Create separate free list for each class
Round allocation requests up to appropriate class
Large allocations (>2048) go directly to mmap

Test: Benchmark shows significant speedup for small allocations.

Phase 4: Thread Safety (Days 4-5)

Goal: Correct operation in multi-threaded programs.

Steps:

Add mutex per size class (not one global lock)
Lock only when accessing shared free list
Test with multiple threads allocating concurrently

Test: No crashes or corruption with 8+ threads.

Phase 5: Thread-Local Caches (Days 5-7)

Goal: Avoid locking in the common case.

Steps:

Add __thread cache per size class
Allocate from cache first (no lock)
Free to cache first (no lock)
When cache full/empty, batch-transfer to/from global list
Pad structures to avoid false sharing

Test: Near-linear scaling with thread count.

Phase 6: Benchmarking and Optimization (Days 7-10)

Goal: Competitive performance with glibc.

Steps:

Build comprehensive benchmark suite
Profile with perf to find hotspots
Optimize critical paths
Compare against glibc and jemalloc

Testing Strategy

Unit Tests

// Test basic allocation
void test_basic_malloc() {
    void* p = malloc(100);
    assert(p != NULL);
    memset(p, 0x42, 100);  // Should not crash
    free(p);
}

// Test alignment
void test_alignment() {
    for (int i = 0; i < 1000; i++) {
        void* p = malloc(rand() % 1024 + 1);
        assert((uintptr_t)p % 16 == 0);
        free(p);
    }
}

// Test realloc
void test_realloc() {
    char* p = malloc(10);
    strcpy(p, "hello");
    p = realloc(p, 100);
    assert(strcmp(p, "hello") == 0);
    free(p);
}

Stress Tests

// Fragmentation stress test
void test_fragmentation() {
    void* ptrs[1000];
    for (int i = 0; i < 1000; i++) {
        ptrs[i] = malloc(64);
    }
    // Free every other block
    for (int i = 0; i < 1000; i += 2) {
        free(ptrs[i]);
    }
    // This should still work (coalescing)
    void* big = malloc(32000);
    assert(big != NULL);
    free(big);
}

// Thread stress test
void* thread_func(void* arg) {
    for (int i = 0; i < 10000; i++) {
        void* p = malloc(rand() % 1024);
        free(p);
    }
    return NULL;
}

void test_multithreaded() {
    pthread_t threads[8];
    for (int i = 0; i < 8; i++) {
        pthread_create(&threads[i], NULL, thread_func, NULL);
    }
    for (int i = 0; i < 8; i++) {
        pthread_join(threads[i], NULL);
    }
}

Real-World Testing

# Test with real programs
LD_PRELOAD=./libmyalloc.so ls -la /usr
LD_PRELOAD=./libmyalloc.so find /usr -name "*.so"
LD_PRELOAD=./libmyalloc.so python3 -c "import numpy; print(numpy.zeros(1000))"

# Memory leak detection
valgrind --leak-check=full LD_PRELOAD=./libmyalloc.so ./test_suite

Common Pitfalls and Debugging

Pitfall 1: Corrupted Metadata

Symptom: Crashes on free(), or allocating same block twice.

Cause: User code writes past allocation boundary, corrupting header.

Debug:

// Add canary values
struct block_header {
    uint32_t canary_start;  // 0xDEADBEEF
    size_t size;
    struct block_header* next;
    uint32_t canary_end;    // 0xCAFEBABE
};

void check_canaries(block_header_t* h) {
    assert(h->canary_start == 0xDEADBEEF);
    assert(h->canary_end == 0xCAFEBABE);
}

Pitfall 2: Alignment Violations

Symptom: Bus error (SIGBUS) on some architectures, or slow performance.

Cause: Returning unaligned pointer.

Fix:

// Always round header size to alignment
#define HEADER_SIZE ((sizeof(block_header_t) + 15) & ~15)

void* malloc(size_t size) {
    size_t total = HEADER_SIZE + ((size + 15) & ~15);
    // ...
}

Pitfall 3: Double Free

Symptom: Corruption, same block appearing in free list twice.

Debug:

void free(void* ptr) {
    block_header_t* h = get_header(ptr);
    if (!IS_USED(h)) {
        fprintf(stderr, "DOUBLE FREE at %p!\n", ptr);
        abort();
    }
    SET_FREE(h);
    // ...
}

Pitfall 4: Race Conditions

Symptom: Crashes or corruption only under heavy threading.

Debug:

Use ThreadSanitizer: gcc -fsanitize=thread
Add assertions that check invariants
Log all lock/unlock operations

Pitfall 5: Memory Leaks in Allocator

Symptom: Memory usage grows unboundedly.

Debug:

Track total bytes from OS vs bytes in free lists vs bytes allocated
Use valgrind on your test programs
Add stats printing: malloc_stats() function

Extensions and Challenges

Extension 1: Arena/Pool Allocator

Build a specialized allocator for fixed-size objects:

pool_t* pool = pool_create(sizeof(MyStruct), 1000);
MyStruct* s = pool_alloc(pool);  // O(1), no fragmentation
pool_free(pool, s);              // O(1)
pool_destroy(pool);              // Frees everything

Extension 2: Memory-Mapped File Allocator

Persist allocations to a file:

heap_t* h = mmap_heap_open("data.heap");
void* p = mmap_malloc(h, 100);
// Survives program restart!

Extension 3: Debug Allocator

Add rich debugging features:

Track allocation call stacks
Detect use-after-free (poison freed memory)
Detect buffer overflows (guard pages)
Generate allocation heatmaps

Extension 4: NUMA-Aware Allocator

On multi-socket systems, allocate from the NUMA node closest to the requesting CPU:

void* numa_malloc(size_t size) {
    int node = get_current_numa_node();
    return allocate_from_node(node, size);
}

Real-World Connections

Where Memory Allocators Matter

High-Frequency Trading: Latency-sensitive systems pre-allocate all memory. Understanding allocators reveals why.
Game Engines: Frame-rate consistency requires predictable allocation. Many games use custom pools.
Databases: B-trees and page buffers use arena allocators for bulk operations.
Web Browsers: Tab isolation often uses separate heaps per tab (security + cleanup).
Embedded Systems: No OS = must implement your own allocator.

Industry Allocators to Study

Allocator	Key Innovation
jemalloc	Thread-local arenas, size classes, extensive metadata
tcmalloc	Per-thread caches, page-level management
mimalloc	Free list sharding, eager page-reset
Hoard	Provably scalable, superblock-based

Resources

Essential Reading

“C Interfaces and Implementations” by David Hanson — Chapter 5 (Arena) and Chapter 6 (Mem)
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Chapter 9: Virtual Memory
“The Linux Programming Interface” by Michael Kerrisk — Chapter 7: Memory Allocation

Papers

“A Scalable Concurrent malloc Implementation” — Jason Evans (jemalloc)
“Hoard: A Scalable Memory Allocator” — Emery Berger
“Reconsidering Custom Memory Allocation” — Berger et al.

Online Resources

CS 341 Malloc Tutorial — Step-by-step implementation guide
Embedded Artistry: Implementing Malloc
jemalloc Wiki

Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

I can explain why malloc needs metadata and where it’s stored
I can describe internal vs external fragmentation
I can explain why coalescing is necessary
I can describe three different free list strategies and their tradeoffs
I can explain why per-thread caches improve performance
I can explain what false sharing is and how to avoid it

Implementation

My allocator compiles to a shared library
LD_PRELOAD=./libmyalloc.so ls works correctly
My allocator handles multi-threaded programs
Valgrind reports no leaks in my test programs
My benchmarks show reasonable performance (within 3x of glibc)

Verification

I’ve tested with programs of varying sizes (ls, Python, large apps)
I’ve stress-tested with thousands of alloc/free cycles
I’ve tested with 8+ concurrent threads
I’ve profiled and can explain where time is spent

When you complete this project, you’ll have built something that can literally replace a core system library. That’s not a toy—it’s systems programming at its finest.