P01: Custom Memory Allocator

P01: Custom Memory Allocator

Build a production-ready memory allocator that can replace malloc in real programs

Attribute Value
Main Language C
Alternative Languages C++, Rust, Zig
Difficulty Level 3: Advanced
Coolness Level Level 4: Hardcore Tech Flex
Knowledge Area Memory Management, Systems Programming
Key Tools glibc, jemalloc, valgrind, LD_PRELOAD
Main Book โ€œC Interfaces and Implementationsโ€ - David Hanson

Learning Objectives

By completing this project, you will:

  1. Understand memory allocation internals โ€” Know exactly what happens when you call malloc() and free()
  2. Master free list data structures โ€” Implement and optimize the core data structure behind allocators
  3. Handle fragmentation โ€” Understand internal vs external fragmentation and implement coalescing
  4. Design thread-safe systems โ€” Make your allocator work correctly in multi-threaded programs
  5. Profile and benchmark โ€” Measure your allocator against glibc and jemalloc
  6. Deploy real software โ€” Use LD_PRELOAD to run actual programs with your allocator

The Core Question

โ€œWhy is malloc slow, and what does it actually take to make a fast, correct memory allocator?โ€

This question separates developers who use memory from those who understand memory. Most programmers treat malloc as magic. After this project, youโ€™ll answer:

  • Why does malloc sometimes take 10ns and sometimes 10ยตs?
  • Why does allocating 16 bytes waste 8 more bytes on metadata?
  • Why doesnโ€™t free always return memory to the OS?
  • Why do jemalloc and tcmalloc exist when glibc already has malloc?

Memory allocators sit at the intersection of three hard problems:

  1. Speed โ€” O(1) allocation without searching
  2. Memory efficiency โ€” Minimizing fragmentation and overhead
  3. Thread safety โ€” Scaling to many cores without lock contention

You cannot optimize all three. This project teaches you the tradeoffs by making you choose.


Deep Theoretical Foundation

1. Virtual Memory and Address Spaces

Before writing an allocator, you must understand where memory comes from.

Virtual vs Physical Addresses

When your program accesses address 0x7f8a2c001000, thatโ€™s a virtual address. The CPUโ€™s Memory Management Unit (MMU) translates it to a physical RAM address. This translation happens via page tables maintained by the OS kernel.

Virtual Address Space (per process):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 0xFFFFFFFFFFFFFFFF
โ”‚         Kernel Space                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Stack                  โ”‚ โ† grows downward
โ”‚                โ†“                    โ”‚
โ”‚                                     โ”‚
โ”‚            (unused)                 โ”‚
โ”‚                                     โ”‚
โ”‚                โ†‘                    โ”‚
โ”‚              Heap                   โ”‚ โ† grows upward (where malloc lives)
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         BSS (uninitialized)         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         Data (initialized)          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         Text (code)                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 0x0000000000000000

Getting Memory from the OS

Your allocator needs to request memory from the kernel. Two primary mechanisms:

  1. sbrk()/brk() โ€” Extends the heap segment
    • Contiguous growth (easy to manage)
    • Cannot release memory in the middle (only from the top)
    • Deprecated on some systems
  2. mmap() โ€” Maps anonymous pages anywhere in the address space
    • Can release individual regions via munmap()
    • Higher per-call overhead
    • Modern allocators prefer this for large allocations
// sbrk example: extend heap by 4KB
void* ptr = sbrk(4096);  // Returns old break address

// mmap example: get 4KB of anonymous memory
void* ptr = mmap(NULL, 4096,
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS,
                 -1, 0);

Page Granularity

The OS allocates memory in pages (typically 4KB). When you request 100 bytes:

  • The allocator asks the OS for at least one page (4096 bytes)
  • Your 100 bytes come from that page
  • The remaining 3996 bytes stay in the allocatorโ€™s pool for future requests

2. Metadata: The Hidden Cost of malloc

Every allocation needs bookkeeping. When you call malloc(100), the allocator must remember:

  • How big is this block? (Needed for realloc and coalescing)
  • Is it currently in use? (Needed to know if itโ€™s available)

Where to Store Metadata?

Option 1: Header before the block (most common)

Memory Layout:
[Header: size=108, used=1][User Data: 100 bytes]
                          ^
                          Pointer returned to user

When free(ptr) is called, the allocator backs up from ptr to find the header:

void free(void* ptr) {
    block_header_t* header = (block_header_t*)ptr - 1;
    // Now we know the size and can mark it free
}

Option 2: Boundary tags (header + footer)

[Header: size=108][User Data: 100 bytes][Footer: size=108]

The footer enables finding the previous block during coalescing.

Option 3: Separate metadata table

User blocks: [Block 0][Block 1][Block 2]...
Metadata:    [Entry 0][Entry 1][Entry 2]...

Isolates user data from metadata (prevents corruption) but requires lookup.

Minimum Block Size

Even if a user requests 1 byte, your block needs space for:

  • Header (8-16 bytes typically)
  • User data (rounded up for alignment)
  • When free: a โ€œnextโ€ pointer to link in free list (8 bytes on 64-bit)

Typical minimum: 16-32 bytes per allocation.

3. Free Lists: The Core Data Structure

A free list is a linked list of available memory blocks. The key insight: use the free blockโ€™s memory itself to store the link.

Free List Structure:
                                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
head โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”‚ size: 64        โ”‚
                                    โ”‚ next: โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                    โ”‚ [free memory]   โ”‚   โ”‚
                                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
                                                          โ†“
                                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                    โ”‚ size: 128       โ”‚
                                    โ”‚ next: โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                    โ”‚ [free memory]   โ”‚   โ”‚
                                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
                                                          โ†“
                                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                    โ”‚ size: 256       โ”‚
                                    โ”‚ next: NULL      โ”‚
                                    โ”‚ [free memory]   โ”‚
                                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Allocation Strategies

When searching for a block to satisfy malloc(100):

Strategy Description Pros Cons
First-fit Return first block โ‰ฅ size Fast O(1) average Fragments beginning of list
Best-fit Return smallest block โ‰ฅ size Minimizes waste O(n) search, creates small unusable fragments
Worst-fit Return largest block Avoids tiny fragments O(n) search, poor utilization
Next-fit First-fit starting from last position Spreads allocations Still O(n) worst case

Splitting Blocks

If you find a 256-byte block for a 64-byte request:

Before: [Header: 256][โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 256 bytes โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€]

After:  [Header: 64][Used: 64 bytes][Header: 192][Free: 192 bytes]
                    ^                            ^
                    Returned to user             Added to free list

Only split if the remainder is large enough for a useful block (header + minimum data).

4. Fragmentation: The Silent Killer

Internal Fragmentation

Wasted space inside allocated blocks:

User requests 17 bytes
Allocator gives 32 bytes (rounded for alignment)
Wasted: 15 bytes (47% overhead!)

External Fragmentation

Free memory exists but is scattered:

After many alloc/free cycles:
[Used:50][Free:50][Used:50][Free:50][Used:50][Free:50]

Total free: 150 bytes
Can allocate 100 bytes? NO! Largest contiguous free: 50 bytes

Coalescing: The Solution

When freeing a block, merge it with adjacent free blocks:

Before free(B):
[A: Free][B: Used][C: Free]

After free(B) with coalescing:
[โ”€โ”€โ”€โ”€โ”€ Single Free Block (A+B+C) โ”€โ”€โ”€โ”€โ”€]

To find adjacent blocks:

  • Next block: Current address + current size
  • Previous block: Need boundary tags (footer) or address-ordered list

5. Size Classes: Why jemalloc is Fast

Instead of one free list, maintain separate lists for common sizes:

Size Class 0 (16 bytes): [Free] -> [Free] -> [Free] -> NULL
Size Class 1 (32 bytes): [Free] -> [Free] -> NULL
Size Class 2 (64 bytes): [Free] -> [Free] -> [Free] -> [Free] -> NULL
Size Class 3 (128 bytes): [Free] -> NULL
...

Benefits:

  • O(1) allocation for small sizes (no searching)
  • Reduced fragmentation (similar sizes grouped)
  • Better cache locality

Common size class schemes:

  • Powers of 2: 16, 32, 64, 128, 256โ€ฆ
  • jemalloc-style: 8, 16, 32, 48, 64, 80, 96, 112, 128โ€ฆ (more granular)

Large allocations (>4KB typically) bypass size classes and use mmap directly.

6. Thread Safety

The Problem

Without synchronization, two threads calling malloc simultaneously corrupt the free list:

Thread 1: malloc(64)     Thread 2: malloc(64)
         โ†“                        โ†“
    Read head = A            Read head = A  (same value!)
         โ†“                        โ†“
    head = A->next           head = A->next
         โ†“                        โ†“
    Return A                 Return A (SAME BLOCK!)

Solution 1: Global Lock (Simple but Slow)

pthread_mutex_t malloc_lock = PTHREAD_MUTEX_INITIALIZER;

void* malloc(size_t size) {
    pthread_mutex_lock(&malloc_lock);
    // ... allocation logic ...
    pthread_mutex_unlock(&malloc_lock);
    return ptr;
}

All threads serialize through one lock โ€” kills scalability.

Solution 2: Per-Thread Caches (What jemalloc Does)

Each thread has its own cache of blocks:

Thread 1: Local cache [Block][Block][Block]
Thread 2: Local cache [Block][Block]
Thread 3: Local cache [Block][Block][Block][Block]

Global arena (locked): [Blocks for refilling caches]

Allocations from local cache need no locking. Only refilling from global arena requires synchronization.

False Sharing

Even with per-thread data, placing thread-local variables on the same cache line causes performance collapse:

Cache line (64 bytes):
[Thread1 counter][Thread2 counter][Thread3 counter]...

When Thread1 increments its counter, Thread2's cache line is invalidated!

Solution: Pad data structures to cache-line boundaries:

struct thread_cache {
    // ... actual data ...
    char padding[64];  // Ensure cache-line isolation
};

7. Alignment Requirements

Modern CPUs require (or strongly prefer) aligned memory access:

Type Alignment Requirement
char 1 byte
short 2 bytes
int 4 bytes
long, pointer 8 bytes
SSE vectors 16 bytes
AVX vectors 32 bytes

The C standard requires malloc to return memory suitable for any type. On most systems: 16-byte alignment.

// Rounding up to alignment
size_t aligned_size = (size + 15) & ~15;  // Round up to 16

// Checking alignment
assert((uintptr_t)ptr % 16 == 0);

Project Specification

What Youโ€™ll Build

A general-purpose memory allocator implementing:

  • void* malloc(size_t size) โ€” Allocate memory
  • void free(void* ptr) โ€” Release memory
  • void* calloc(size_t nmemb, size_t size) โ€” Allocate and zero
  • void* realloc(void* ptr, size_t size) โ€” Resize allocation

Your allocator must:

  1. Be loadable via LD_PRELOAD to replace system malloc
  2. Handle multi-threaded programs correctly
  3. Have competitive performance (within 2x of glibc for common workloads)
  4. Pass memory leak detection (valgrind)

Deliverables

  1. Source code: myalloc.c, myalloc.h
  2. Shared library: libmyalloc.so (built with -shared -fPIC)
  3. Benchmark suite: Comparing against glibc and jemalloc
  4. Test suite: Unit tests and stress tests
  5. Documentation: Design decisions and tradeoffs made

Success Criteria

# Basic functionality
$ gcc -shared -fPIC -o libmyalloc.so myalloc.c -lpthread
$ LD_PRELOAD=./libmyalloc.so ls
# Should complete without crashes

# Runs Python
$ LD_PRELOAD=./libmyalloc.so python3 -c "print('Hello!')"
Hello!

# Passes valgrind
$ valgrind --leak-check=full ./test_allocator
All heap blocks were freed -- no leaks possible

# Benchmark output
$ ./bench_allocator
Your allocator: 1,234,567 allocs/sec
glibc malloc:   2,345,678 allocs/sec
Ratio: 0.53x  # Within acceptable range

Solution Architecture

Component Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        Public API                                โ”‚
โ”‚  malloc() / free() / calloc() / realloc()                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Thread-Local Cache                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                         โ”‚
โ”‚  โ”‚ Thread 1 โ”‚ โ”‚ Thread 2 โ”‚ โ”‚ Thread 3 โ”‚  ... (one per thread)   โ”‚
โ”‚  โ”‚  Cache   โ”‚ โ”‚  Cache   โ”‚ โ”‚  Cache   โ”‚                         โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚            โ”‚            โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ (when cache empty/full)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Central Arena (Locked)                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚              Size Class Free Lists                       โ”‚    โ”‚
โ”‚  โ”‚  Class 0: [16B] โ†’ [16B] โ†’ [16B]                         โ”‚    โ”‚
โ”‚  โ”‚  Class 1: [32B] โ†’ [32B]                                 โ”‚    โ”‚
โ”‚  โ”‚  Class 2: [64B] โ†’ [64B] โ†’ [64B] โ†’ [64B]                โ”‚    โ”‚
โ”‚  โ”‚  ...                                                     โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚ (when arena needs more memory)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    OS Memory (mmap/sbrk)                         โ”‚
โ”‚  [Page][Page][Page][Page][Page][Page][Page][Page]...            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Structures

Block Header

typedef struct block_header {
    size_t size;           // Size including header (low bit = in-use flag)
    struct block_header* next;  // Next in free list (only when free)
} block_header_t;

// Size encoding trick: use low bit for in-use flag
#define GET_SIZE(header)   ((header)->size & ~1)
#define IS_USED(header)    ((header)->size & 1)
#define SET_USED(header)   ((header)->size |= 1)
#define SET_FREE(header)   ((header)->size &= ~1)

Size Class Array

#define NUM_SIZE_CLASSES 8
#define SIZE_CLASS_THRESHOLD 2048

// Size classes: 16, 32, 64, 128, 256, 512, 1024, 2048
static block_header_t* size_classes[NUM_SIZE_CLASSES];
static pthread_mutex_t class_locks[NUM_SIZE_CLASSES];

Thread-Local Cache

#define CACHE_SIZE 32

typedef struct {
    void* blocks[CACHE_SIZE];
    int count;
} thread_cache_t;

static __thread thread_cache_t thread_caches[NUM_SIZE_CLASSES];

Key Algorithms

Allocation Flow

malloc(size):
1. Round size up to next size class
2. Check thread-local cache for that class
   - If hit: return cached block (no lock!)
3. Lock size class
4. Search size class free list
   - If found: remove from list, unlock, return
5. Request blocks from OS (mmap)
6. Split into size-class blocks
7. Return one, add rest to cache/free list

Deallocation Flow

free(ptr):
1. Get header (ptr - sizeof(header))
2. Determine size class from header
3. If thread-local cache not full:
   - Add to cache (no lock!)
4. Else:
   - Lock size class
   - Add to free list
   - Optionally coalesce with neighbors
   - Unlock

Phased Implementation Guide

Phase 1: Minimal Allocator (Days 1-2)

Goal: A working malloc/free with a single global free list.

Steps:

  1. Define block header structure
  2. Implement get_memory_from_os() using mmap
  3. Implement first-fit malloc():
    • Search free list
    • If no fit, get more from OS
    • Split block if too large
  4. Implement free():
    • Find header, mark as free
    • Add to free list

Test: Can allocate and free without crashing.

Phase 2: Coalescing (Days 2-3)

Goal: Prevent fragmentation by merging adjacent free blocks.

Steps:

  1. Add boundary tags (footer with size)
  2. On free(), check if next block is free โ†’ merge
  3. On free(), check if previous block is free โ†’ merge
  4. Keep free list address-sorted (simplifies coalescing)

Test: After many alloc/free cycles, large allocations still succeed.

Phase 3: Size Classes (Days 3-4)

Goal: O(1) allocation for common sizes.

Steps:

  1. Define size classes (e.g., powers of 2 up to 2048)
  2. Create separate free list for each class
  3. Round allocation requests up to appropriate class
  4. Large allocations (>2048) go directly to mmap

Test: Benchmark shows significant speedup for small allocations.

Phase 4: Thread Safety (Days 4-5)

Goal: Correct operation in multi-threaded programs.

Steps:

  1. Add mutex per size class (not one global lock)
  2. Lock only when accessing shared free list
  3. Test with multiple threads allocating concurrently

Test: No crashes or corruption with 8+ threads.

Phase 5: Thread-Local Caches (Days 5-7)

Goal: Avoid locking in the common case.

Steps:

  1. Add __thread cache per size class
  2. Allocate from cache first (no lock)
  3. Free to cache first (no lock)
  4. When cache full/empty, batch-transfer to/from global list
  5. Pad structures to avoid false sharing

Test: Near-linear scaling with thread count.

Phase 6: Benchmarking and Optimization (Days 7-10)

Goal: Competitive performance with glibc.

Steps:

  1. Build comprehensive benchmark suite
  2. Profile with perf to find hotspots
  3. Optimize critical paths
  4. Compare against glibc and jemalloc

Testing Strategy

Unit Tests

// Test basic allocation
void test_basic_malloc() {
    void* p = malloc(100);
    assert(p != NULL);
    memset(p, 0x42, 100);  // Should not crash
    free(p);
}

// Test alignment
void test_alignment() {
    for (int i = 0; i < 1000; i++) {
        void* p = malloc(rand() % 1024 + 1);
        assert((uintptr_t)p % 16 == 0);
        free(p);
    }
}

// Test realloc
void test_realloc() {
    char* p = malloc(10);
    strcpy(p, "hello");
    p = realloc(p, 100);
    assert(strcmp(p, "hello") == 0);
    free(p);
}

Stress Tests

// Fragmentation stress test
void test_fragmentation() {
    void* ptrs[1000];
    for (int i = 0; i < 1000; i++) {
        ptrs[i] = malloc(64);
    }
    // Free every other block
    for (int i = 0; i < 1000; i += 2) {
        free(ptrs[i]);
    }
    // This should still work (coalescing)
    void* big = malloc(32000);
    assert(big != NULL);
    free(big);
}

// Thread stress test
void* thread_func(void* arg) {
    for (int i = 0; i < 10000; i++) {
        void* p = malloc(rand() % 1024);
        free(p);
    }
    return NULL;
}

void test_multithreaded() {
    pthread_t threads[8];
    for (int i = 0; i < 8; i++) {
        pthread_create(&threads[i], NULL, thread_func, NULL);
    }
    for (int i = 0; i < 8; i++) {
        pthread_join(threads[i], NULL);
    }
}

Real-World Testing

# Test with real programs
LD_PRELOAD=./libmyalloc.so ls -la /usr
LD_PRELOAD=./libmyalloc.so find /usr -name "*.so"
LD_PRELOAD=./libmyalloc.so python3 -c "import numpy; print(numpy.zeros(1000))"

# Memory leak detection
valgrind --leak-check=full LD_PRELOAD=./libmyalloc.so ./test_suite

Common Pitfalls and Debugging

Pitfall 1: Corrupted Metadata

Symptom: Crashes on free(), or allocating same block twice.

Cause: User code writes past allocation boundary, corrupting header.

Debug:

// Add canary values
struct block_header {
    uint32_t canary_start;  // 0xDEADBEEF
    size_t size;
    struct block_header* next;
    uint32_t canary_end;    // 0xCAFEBABE
};

void check_canaries(block_header_t* h) {
    assert(h->canary_start == 0xDEADBEEF);
    assert(h->canary_end == 0xCAFEBABE);
}

Pitfall 2: Alignment Violations

Symptom: Bus error (SIGBUS) on some architectures, or slow performance.

Cause: Returning unaligned pointer.

Fix:

// Always round header size to alignment
#define HEADER_SIZE ((sizeof(block_header_t) + 15) & ~15)

void* malloc(size_t size) {
    size_t total = HEADER_SIZE + ((size + 15) & ~15);
    // ...
}

Pitfall 3: Double Free

Symptom: Corruption, same block appearing in free list twice.

Debug:

void free(void* ptr) {
    block_header_t* h = get_header(ptr);
    if (!IS_USED(h)) {
        fprintf(stderr, "DOUBLE FREE at %p!\n", ptr);
        abort();
    }
    SET_FREE(h);
    // ...
}

Pitfall 4: Race Conditions

Symptom: Crashes or corruption only under heavy threading.

Debug:

  • Use ThreadSanitizer: gcc -fsanitize=thread
  • Add assertions that check invariants
  • Log all lock/unlock operations

Pitfall 5: Memory Leaks in Allocator

Symptom: Memory usage grows unboundedly.

Debug:

  • Track total bytes from OS vs bytes in free lists vs bytes allocated
  • Use valgrind on your test programs
  • Add stats printing: malloc_stats() function

Extensions and Challenges

Extension 1: Arena/Pool Allocator

Build a specialized allocator for fixed-size objects:

pool_t* pool = pool_create(sizeof(MyStruct), 1000);
MyStruct* s = pool_alloc(pool);  // O(1), no fragmentation
pool_free(pool, s);              // O(1)
pool_destroy(pool);              // Frees everything

Extension 2: Memory-Mapped File Allocator

Persist allocations to a file:

heap_t* h = mmap_heap_open("data.heap");
void* p = mmap_malloc(h, 100);
// Survives program restart!

Extension 3: Debug Allocator

Add rich debugging features:

  • Track allocation call stacks
  • Detect use-after-free (poison freed memory)
  • Detect buffer overflows (guard pages)
  • Generate allocation heatmaps

Extension 4: NUMA-Aware Allocator

On multi-socket systems, allocate from the NUMA node closest to the requesting CPU:

void* numa_malloc(size_t size) {
    int node = get_current_numa_node();
    return allocate_from_node(node, size);
}

Real-World Connections

Where Memory Allocators Matter

  1. High-Frequency Trading: Latency-sensitive systems pre-allocate all memory. Understanding allocators reveals why.

  2. Game Engines: Frame-rate consistency requires predictable allocation. Many games use custom pools.

  3. Databases: B-trees and page buffers use arena allocators for bulk operations.

  4. Web Browsers: Tab isolation often uses separate heaps per tab (security + cleanup).

  5. Embedded Systems: No OS = must implement your own allocator.

Industry Allocators to Study

Allocator Key Innovation
jemalloc Thread-local arenas, size classes, extensive metadata
tcmalloc Per-thread caches, page-level management
mimalloc Free list sharding, eager page-reset
Hoard Provably scalable, superblock-based

Resources

Essential Reading

  1. โ€œC Interfaces and Implementationsโ€ by David Hanson โ€” Chapter 5 (Arena) and Chapter 6 (Mem)
  2. โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron โ€” Chapter 9: Virtual Memory
  3. โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk โ€” Chapter 7: Memory Allocation

Papers

  • โ€œA Scalable Concurrent malloc Implementationโ€ โ€” Jason Evans (jemalloc)
  • โ€œHoard: A Scalable Memory Allocatorโ€ โ€” Emery Berger
  • โ€œReconsidering Custom Memory Allocationโ€ โ€” Berger et al.

Online Resources


Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

  • I can explain why malloc needs metadata and where itโ€™s stored
  • I can describe internal vs external fragmentation
  • I can explain why coalescing is necessary
  • I can describe three different free list strategies and their tradeoffs
  • I can explain why per-thread caches improve performance
  • I can explain what false sharing is and how to avoid it

Implementation

  • My allocator compiles to a shared library
  • LD_PRELOAD=./libmyalloc.so ls works correctly
  • My allocator handles multi-threaded programs
  • Valgrind reports no leaks in my test programs
  • My benchmarks show reasonable performance (within 3x of glibc)

Verification

  • Iโ€™ve tested with programs of varying sizes (ls, Python, large apps)
  • Iโ€™ve stress-tested with thousands of alloc/free cycles
  • Iโ€™ve tested with 8+ concurrent threads
  • Iโ€™ve profiled and can explain where time is spent

When you complete this project, youโ€™ll have built something that can literally replace a core system library. Thatโ€™s not a toyโ€”itโ€™s systems programming at its finest.