Project 5: Custom Memory Allocator

Build a specialized memory allocator (arena/pool allocator) that eliminates malloc from your critical path - because malloc is a latency killer in HFT systems.

Quick Reference

Attribute	Details
Difficulty	Intermediate
Time Estimate	1 week
Primary Language	C
Alternative Languages	Rust, C++, Zig
Knowledge Area	Memory Management, Systems Programming
Tools Required	perf, gdb/lldb, valgrind (optional)
Primary Reference	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron, Ch. 9

Learning Objectives

By completing this project, you will be able to:

Explain heap allocation internals including how malloc works, fragmentation types, and why general-purpose allocators are slow
Design pool allocators with O(1) allocation and deallocation for fixed-size objects
Implement arena allocators using bump allocation with bulk free semantics
Measure allocation performance and demonstrate 30x+ latency improvements over malloc
Integrate custom allocators with language APIs (GlobalAlloc in Rust, operator new in C++)
Analyze fragmentation and design allocation strategies for specific use cases
Build thread-safe allocators using thread-local storage to avoid false sharing

Theoretical Foundation

3.1 Core Concepts

What is Heap Allocation?

When you call malloc(n), the allocator must:

Find a free block of at least n bytes
Split the block if it’s larger than needed
Update metadata (free lists, size headers)
Return a properly aligned pointer

This process involves:

Searching through data structures (O(log n) or worse)
Potential system calls for more memory (sbrk, mmap)
Lock acquisition in multi-threaded environments

malloc(32) - What Actually Happens:

    ┌─────────────────────────────────────────────────────────────────┐
    │                       User Request                               │
    │                       malloc(32)                                 │
    └─────────────────────────────────┬───────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    1. Check Thread Cache                         │
    │    Thread has cached blocks? ──► Yes ──► Return cached block    │
    │                │                                                 │
    │                ▼ No                                              │
    └─────────────────────────────────┬───────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    2. Search Free Lists                          │
    │    Find block >= 32 bytes in appropriate size class             │
    │    May need to search multiple bins                             │
    │    Best-fit? First-fit? Next-fit?                               │
    └─────────────────────────────────┬───────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    3. Split or Coalesce                          │
    │    Block too large? Split and return remainder to free list     │
    │    Adjacent free blocks? Coalesce them                          │
    │    Update headers and footers                                    │
    └─────────────────────────────────┬───────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    4. Maybe Get More Memory                      │
    │    No suitable block? ──► sbrk() or mmap()                      │
    │    System call: 1000+ cycles, potential page faults             │
    └─────────────────────────────────┬───────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    5. Acquire Lock (Multi-threaded)              │
    │    Global arena lock or per-arena lock                          │
    │    Contention = waiting = latency                               │
    └─────────────────────────────────┬───────────────────────────────┘
                                      │
                                      ▼
                              Return pointer

    Total time: 50-500+ nanoseconds (unpredictable!)

Types of Fragmentation

External Fragmentation: Free memory exists but in pieces too small to use.

External Fragmentation Example:

Memory:  [USED 64][FREE 16][USED 32][FREE 24][USED 48][FREE 8]

Total free: 16 + 24 + 8 = 48 bytes
Request: malloc(40)
Result: FAILS! (No contiguous 40-byte block)

The memory is there, but scattered.

Internal Fragmentation: Allocated blocks are larger than requested.

Internal Fragmentation Example:

Request: malloc(17)
Allocator returns: 32-byte block (next power of 2)
Wasted: 15 bytes per allocation

With 1 million allocations: 15 MB wasted!

Pool Allocator Concept

A pool allocator pre-allocates many fixed-size blocks and chains them in a free list:

Pool Allocator Structure:

Pool for 64-byte objects:

    free_list ──────┐
                    ▼
    ┌──────────────────────────────────────────────────────────┐
    │ MEMORY POOL (pre-allocated from OS)                      │
    │                                                          │
    │ ┌───────┐   ┌───────┐   ┌───────┐   ┌───────┐           │
    │ │Block 0│──▶│Block 1│──▶│Block 2│──▶│Block 3│──▶ NULL   │
    │ │64 byte│   │64 byte│   │64 byte│   │64 byte│           │
    │ │ FREE  │   │ FREE  │   │ FREE  │   │ FREE  │           │
    │ └───────┘   └───────┘   └───────┘   └───────┘           │
    │     ▲                                                    │
    │     │                                                    │
    │   next pointer stored IN the free block itself!         │
    │                                                          │
    └──────────────────────────────────────────────────────────┘

Allocation (O(1)):
    1. block = free_list
    2. free_list = block->next
    3. return block

Deallocation (O(1)):
    1. block->next = free_list
    2. free_list = block

Arena Allocator Concept

An arena allocator uses bump allocation - just increment a pointer:

Arena Allocator with Bump Allocation:

    ┌──────────────────────────────────────────────────────────┐
    │ ARENA (one large pre-allocated buffer)                   │
    │                                                          │
    │ base                bump_ptr                        end  │
    │   │                    │                             │   │
    │   ▼                    ▼                             ▼   │
    │   ┌────────────────────┬─────────────────────────────┐   │
    │   │  USED MEMORY       │      AVAILABLE SPACE       │   │
    │   │ obj1 obj2 obj3 ... │                             │   │
    │   └────────────────────┴─────────────────────────────┘   │
    │                                                          │
    └──────────────────────────────────────────────────────────┘

Allocation (O(1)):
    1. Check: bump_ptr + size <= end
    2. ptr = bump_ptr
    3. bump_ptr += ALIGN(size)
    4. return ptr

Deallocation:
    Individual free: NOT SUPPORTED!
    Reset entire arena: bump_ptr = base  (O(1), frees everything)

Perfect for: Batch processing, request handling, game frames

3.2 Why This Matters for HFT

malloc is Unpredictable

The worst-case latency of malloc can be 1000x the average:

malloc Latency Distribution (real measurements):

    Latency (ns)  |  Frequency
    ─────────────────────────────
         50       |  ████████████████████  (typical)
        100       |  ████████
        200       |  ███
        500       |  ██
       1000       |  █
       5000       |  ▌ (rare but deadly in HFT)
      50000       |  ▏ (mmap/sbrk triggered)

    p50:    80 ns
    p99:   350 ns
    p999: 5200 ns   ← This is the problem!
    max: 50000+ ns  ← This is catastrophic

In HFT, a 50-microsecond delay means someone else got the trade.

Pool Allocator Latency Distribution

pool_alloc Latency Distribution:

    Latency (ns)  |  Frequency
    ─────────────────────────────
          5       |  ██████████████████████████████
          8       |  ██████████████
         10       |  ████
         12       |  █
         15       |  ▌ (L1 cache miss)

    p50:   6 ns
    p99:  10 ns
    p999: 14 ns   ← Predictable!
    max:  20 ns   ← Bounded!

The 30x Improvement

Allocator	p50	p99	p999	Predictability
glibc malloc	80 ns	350 ns	5200 ns	Poor
jemalloc	50 ns	150 ns	800 ns	Better
Pool allocator	6 ns	10 ns	14 ns	Excellent

3.3 Historical Context

The Evolution of Memory Allocators

Timeline of Memory Allocator Development:

1960s: First-fit allocation
       Simple but fragmentation-prone

1970s: Buddy system (Knuth)
       Powers of 2, fast coalescing, internal fragmentation
       Still used in Linux kernel page allocator

1980s: Doug Lea's malloc (dlmalloc)
       Segregated free lists, binning
       Became glibc default

1990s: Hoard (Berger et al.)
       First scalable multi-threaded allocator
       Thread-local caching concept

2000s: tcmalloc (Google)
       Thread-caching malloc
       Per-thread free lists

2006:  jemalloc (Jason Evans)
       Arena-based, size classes
       Facebook, FreeBSD default

2010s: mimalloc (Microsoft)
       Free list sharding, security focus

2019:  Scudo (Android)
       Security-hardened allocator

HFT practice: Skip all of this, use pools!

Why General-Purpose Allocators Are Complex

General-purpose allocators must handle:

Any size from 1 byte to gigabytes
Any allocation/deallocation pattern
Multi-threaded access
Memory fragmentation over long runtimes
Security concerns (heap exploits)

This complexity means unpredictable performance. HFT systems know their allocation patterns and can specialize.

3.4 Common Misconceptions

Misconception 1: “malloc is slow because of system calls”

Reality: Most malloc calls don’t make system calls. The slowness comes from:

Free list searching
Lock contention
Cache misses on metadata

Misconception 2: “Fragmentation only matters for long-running processes”

Reality: Fragmentation affects performance immediately:

Searching larger free lists
Worse cache locality
Potential for failed allocations

Misconception 3: “Pool allocators waste memory”

Reality: They trade internal fragmentation for:

Deterministic latency
Zero external fragmentation
Better cache locality
Simpler implementation

Misconception 4: “Thread-local allocators solve all contention”

Reality: Thread-local helps but:

Memory must eventually be reclaimed globally
Large allocations may still be shared
False sharing in metadata can still occur

Project Specification

4.1 What You Will Build

A memory allocation library called hft_alloc with three components:

Fixed-size Pool Allocator: O(1) alloc/free for single-size objects
Size-class Pool Allocator: Multiple pools for different size ranges
Arena Allocator: Bump allocation with bulk free

Your HFT Allocator Architecture:

    ┌─────────────────────────────────────────────────────────────────┐
    │                    hft_alloc Library                             │
    ├─────────────────────────────────────────────────────────────────┤
    │                                                                  │
    │   ┌─────────────────────────────────────────────────────────┐   │
    │   │             Thread-Local Pools (per thread)              │   │
    │   │                                                          │   │
    │   │   ┌──────────┐  ┌──────────┐  ┌──────────┐              │   │
    │   │   │ 16-byte  │  │ 32-byte  │  │ 64-byte  │  ...         │   │
    │   │   │   Pool   │  │   Pool   │  │   Pool   │              │   │
    │   │   │ free_lst │  │ free_lst │  │ free_lst │              │   │
    │   │   └──────────┘  └──────────┘  └──────────┘              │   │
    │   │                                                          │   │
    │   └─────────────────────────────────────────────────────────┘   │
    │                              │                                   │
    │                              ▼                                   │
    │   ┌─────────────────────────────────────────────────────────┐   │
    │   │                 Central Slab Allocator                   │   │
    │   │           (refills thread-local pools)                   │   │
    │   │                                                          │   │
    │   │   ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐       │   │
    │   │   │ Slab 16 │ │ Slab 32 │ │ Slab 64 │ │Slab 128 │ ...   │   │
    │   │   └─────────┘ └─────────┘ └─────────┘ └─────────┘       │   │
    │   │                                                          │   │
    │   └─────────────────────────────────────────────────────────┘   │
    │                              │                                   │
    │                              ▼                                   │
    │   ┌─────────────────────────────────────────────────────────┐   │
    │   │                    System Memory                         │   │
    │   │                  (mmap, large pages)                     │   │
    │   └─────────────────────────────────────────────────────────┘   │
    │                                                                  │
    └─────────────────────────────────────────────────────────────────┘

4.2 Functional Requirements

Core API:

// Pool allocator for fixed-size objects
typedef struct hft_pool hft_pool_t;

hft_pool_t* hft_pool_create(size_t object_size, size_t initial_capacity);
void*       hft_pool_alloc(hft_pool_t* pool);
void        hft_pool_free(hft_pool_t* pool, void* ptr);
void        hft_pool_destroy(hft_pool_t* pool);
void        hft_pool_stats(hft_pool_t* pool, pool_stats_t* stats);

// Arena allocator for bump allocation
typedef struct hft_arena hft_arena_t;

hft_arena_t* hft_arena_create(size_t capacity);
void*        hft_arena_alloc(hft_arena_t* arena, size_t size);
void         hft_arena_reset(hft_arena_t* arena);  // Free everything
void         hft_arena_destroy(hft_arena_t* arena);

// Size-class allocator (combines multiple pools)
void  hft_init(void);  // Call once at startup
void* hft_alloc(size_t size);
void  hft_free(void* ptr);
void  hft_thread_init(void);  // Call per thread

Statistics Reporting:

typedef struct {
    size_t object_size;
    size_t total_objects;
    size_t allocated_objects;
    size_t free_objects;
    size_t allocations_count;
    size_t frees_count;
    size_t bytes_allocated;
    size_t bytes_overhead;
} pool_stats_t;

4.3 Non-Functional Requirements

Requirement	Target	Measurement
Allocation latency	< 10 ns p99	Benchmark
Deallocation latency	< 10 ns p99	Benchmark
Memory overhead	< 5%	Stats reporting
Fragmentation	0% external	By design
Thread safety	Lock-free fast path	Code inspection

4.4 Example Usage and Output

Benchmark Program:

#include "hft_alloc.h"
#include <stdio.h>
#include <time.h>

#define NUM_ALLOCS 1000000
#define OBJECT_SIZE 64

int main() {
    // Benchmark malloc
    struct timespec start, end;
    void* ptrs[NUM_ALLOCS];

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < NUM_ALLOCS; i++) {
        ptrs[i] = malloc(OBJECT_SIZE);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    double malloc_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                       (end.tv_nsec - start.tv_nsec);
    malloc_ns /= NUM_ALLOCS;

    for (int i = 0; i < NUM_ALLOCS; i++) {
        free(ptrs[i]);
    }

    // Benchmark pool allocator
    hft_pool_t* pool = hft_pool_create(OBJECT_SIZE, NUM_ALLOCS);

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < NUM_ALLOCS; i++) {
        ptrs[i] = hft_pool_alloc(pool);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    double pool_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                     (end.tv_nsec - start.tv_nsec);
    pool_ns /= NUM_ALLOCS;

    // Statistics
    pool_stats_t stats;
    hft_pool_stats(pool, &stats);

    printf("Memory Allocator Benchmark\n");
    printf("==========================\n");
    printf("Object size:     %zu bytes\n", (size_t)OBJECT_SIZE);
    printf("Allocations:     %d\n\n", NUM_ALLOCS);

    printf("malloc:          %.1f ns avg\n", malloc_ns);
    printf("pool_alloc:      %.1f ns avg\n", pool_ns);
    printf("Speedup:         %.1fx\n\n", malloc_ns / pool_ns);

    printf("Pool Statistics:\n");
    printf("  Total objects:     %zu\n", stats.total_objects);
    printf("  Allocated:         %zu\n", stats.allocated_objects);
    printf("  Free:              %zu\n", stats.free_objects);
    printf("  Memory used:       %zu bytes\n", stats.bytes_allocated);
    printf("  Overhead:          %zu bytes (%.1f%%)\n",
           stats.bytes_overhead,
           100.0 * stats.bytes_overhead / stats.bytes_allocated);

    hft_pool_destroy(pool);
    return 0;
}

Expected Output:

Memory Allocator Benchmark
==========================
Object size:     64 bytes
Allocations:     1000000

malloc:          85.3 ns avg
pool_alloc:      5.7 ns avg
Speedup:         15.0x

Pool Statistics:
  Total objects:     1000000
  Allocated:         1000000
  Free:              0
  Memory used:       64000000 bytes
  Overhead:          48 bytes (0.0%)

4.5 Real World Outcome

After completing this project, you will have:

Latency Benchmark: Demonstrable 15-30x improvement over malloc
Integration Example: Pool allocator integrated into an order book
Memory Stats: Zero fragmentation, minimal overhead
Stress Test: Millions of alloc/free cycles without degradation

Solution Architecture

5.1 High-Level Design

Pool Allocator Internal Structure:

    hft_pool_t
    ┌──────────────────────────────────────────────────────────────────┐
    │                                                                   │
    │  ┌─────────────────┐    ┌─────────────────────────────────────┐  │
    │  │ Pool Header     │    │ Memory Blocks                       │  │
    │  │                 │    │                                     │  │
    │  │ object_size: 64 │    │ ┌──────┐ ┌──────┐ ┌──────┐         │  │
    │  │ capacity: 1000  │    │ │Blk 0 │ │Blk 1 │ │Blk 2 │ ...     │  │
    │  │ allocated: 347  │    │ │      │ │      │ │      │         │  │
    │  │                 │    │ └──┬───┘ └──┬───┘ └──┬───┘         │  │
    │  │ free_list: ─────┼────┼────┘       │       │               │  │
    │  │                 │    │            ▼       ▼               │  │
    │  │ blocks: ────────┼────┼───────────────────────────────────▶│  │
    │  │                 │    │                                     │  │
    │  │ stats: {...}    │    │ Cache-aligned, contiguous memory    │  │
    │  │                 │    │                                     │  │
    │  └─────────────────┘    └─────────────────────────────────────┘  │
    │                                                                   │
    └──────────────────────────────────────────────────────────────────┘


Free List Structure (Intrusive):

    free_list
        │
        ▼
    ┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
    │ Block N          │     │ Block M          │     │ Block K          │
    │                  │     │                  │     │                  │
    │ ┌──────────────┐ │     │ ┌──────────────┐ │     │ ┌──────────────┐ │
    │ │ next ────────┼─┼────▶│ │ next ────────┼─┼────▶│ │ next ────────┼─┼──▶ NULL
    │ └──────────────┘ │     │ └──────────────┘ │     │ └──────────────┘ │
    │                  │     │                  │     │                  │
    │  (unused space)  │     │  (unused space)  │     │  (unused space)  │
    │                  │     │                  │     │                  │
    └──────────────────┘     └──────────────────┘     └──────────────────┘

    The 'next' pointer is stored IN the free block itself!
    No separate metadata needed for free blocks.

Arena Allocator with Bump Pointer:

    hft_arena_t
    ┌──────────────────────────────────────────────────────────────────┐
    │                                                                   │
    │  ┌─────────────────┐                                             │
    │  │ Arena Header    │                                             │
    │  │                 │                                             │
    │  │ capacity: 1MB   │                                             │
    │  │ base: ──────────┼────────────────────────┐                    │
    │  │ bump_ptr: ──────┼────────────────────────┼────────┐           │
    │  │ end: ───────────┼────────────────────────┼────────┼──────┐    │
    │  │                 │                        │        │      │    │
    │  └─────────────────┘                        │        │      │    │
    │                                             │        │      │    │
    │  ┌──────────────────────────────────────────┴────────┴──────┴─┐  │
    │  │                    Memory Region                           │  │
    │  │                                                            │  │
    │  │ base                                bump_ptr           end │  │
    │  │  │                                      │               │  │  │
    │  │  ▼                                      ▼               ▼  │  │
    │  │  ┌────────┬────────┬────────┬──────────┬─────────────────┐ │  │
    │  │  │ Obj 1  │ Obj 2  │ Obj 3  │ padding  │   FREE SPACE    │ │  │
    │  │  │ 48 B   │ 128 B  │ 64 B   │ (align)  │                 │ │  │
    │  │  └────────┴────────┴────────┴──────────┴─────────────────┘ │  │
    │  │                                                            │  │
    │  └────────────────────────────────────────────────────────────┘  │
    │                                                                   │
    └──────────────────────────────────────────────────────────────────┘

Allocation: bump_ptr += aligned_size (O(1), no searching)
Reset: bump_ptr = base (O(1), frees everything)
Individual free: Not supported (by design)

5.2 Key Components

Component	Purpose	Complexity
Free List	Track available blocks	O(1) operations
Block Storage	Contiguous memory region	Pre-allocated
Size Classes	Route requests to appropriate pool	O(1) lookup
Thread-Local Cache	Avoid lock contention	Per-thread state
Central Slab	Refill thread-local pools	Occasional lock

5.3 Data Structures

// Free list node (stored IN the free block)
typedef struct free_node {
    struct free_node* next;
} free_node_t;

// Pool allocator
typedef struct hft_pool {
    size_t object_size;      // Size of each object
    size_t capacity;         // Maximum objects
    size_t allocated;        // Currently allocated count

    free_node_t* free_list;  // Head of free list
    void* blocks;            // Contiguous memory region
    void* blocks_end;        // End of memory region

    // Statistics
    uint64_t alloc_count;
    uint64_t free_count;
} hft_pool_t;

// Arena allocator
typedef struct hft_arena {
    size_t capacity;
    char* base;
    char* bump_ptr;
    char* end;

    uint64_t alloc_count;
    size_t total_allocated;
} hft_arena_t;

// Size class for multi-size allocator
typedef struct {
    size_t min_size;
    size_t max_size;
    size_t actual_size;     // Rounded up for alignment
    hft_pool_t* pool;
} size_class_t;

5.4 Algorithm Overview

Pool Allocation (O(1)):

POOL_ALLOC(pool):
IF free_list == NULL:
           Return NULL (pool exhausted)
block = free_list
free_list = block->next
pool->allocated++
Return block

Pool Deallocation (O(1)):

POOL_FREE(pool, ptr):
Validate ptr is within pool range (optional, debug)
node = (free_node_t*)ptr
node->next = free_list
free_list = node
pool->allocated--

Arena Allocation (O(1)):

ARENA_ALLOC(arena, size):
aligned_size = ALIGN_UP(size, 16)
IF bump_ptr + aligned_size > end:
           Return NULL (arena exhausted)
ptr = bump_ptr
bump_ptr += aligned_size
Return ptr

Arena Reset (O(1)):

ARENA_RESET(arena):
    1. bump_ptr = base
    2. (All allocations now invalid)

Implementation Guide

6.1 Development Environment Setup

Required Tools:

# C compiler with optimization
gcc -v  # Should be 9+ for best optimization
# or
clang --version

# Performance measurement
perf --version  # Linux only
# or Instruments on macOS

# Debugging
gdb --version
# or
lldb --version

# Memory checking (optional)
valgrind --version

Compiler Flags:

# Development build
gcc -Wall -Wextra -g -O0 -fsanitize=address

# Performance build
gcc -Wall -Wextra -O3 -march=native -DNDEBUG

# Benchmarking build (no sanitizers, full optimization)
gcc -O3 -march=native -DNDEBUG -flto

6.2 Project Structure

hft_alloc/
├── include/
│   └── hft_alloc.h         # Public API
├── src/
│   ├── pool.c              # Pool allocator implementation
│   ├── arena.c             # Arena allocator implementation
│   ├── size_class.c        # Size-class routing
│   └── thread_local.c      # Thread-local pools
├── tests/
│   ├── test_pool.c         # Pool unit tests
│   ├── test_arena.c        # Arena unit tests
│   └── test_thread.c       # Multi-threaded tests
├── bench/
│   ├── bench_malloc.c      # malloc comparison
│   ├── bench_latency.c     # Latency histogram
│   └── bench_orderbook.c   # Integration benchmark
├── Makefile
└── README.md

6.3 The Core Question You’re Answering

“Why is malloc slow in HFT systems, and how can we achieve 30x better allocation performance with a custom allocator?”

malloc is slow because:

It must handle any allocation size
It must search for free blocks
It must handle multi-threaded access with locks
It has unpredictable latency (system calls, fragmentation)

You achieve 30x better performance by:

Restricting to known, fixed sizes
Using pre-built free lists (no searching)
Using thread-local pools (no contention)
Pre-allocating all memory (no system calls in hot path)

6.4 Concepts You Must Understand First

Memory Alignment

All allocations must be properly aligned for the CPU:

Alignment Requirements (x86-64):

Type         Required Alignment
─────────────────────────────────
char         1 byte
short        2 bytes
int          4 bytes
long/ptr     8 bytes
double       8 bytes
__m128       16 bytes (SSE)
__m256       32 bytes (AVX)
cache line   64 bytes (optimal)

Unaligned access:
- May work but slower (extra cache lines)
- May crash (SIGBUS on strict architectures)

// Alignment macro
#define ALIGN_UP(x, align) (((x) + (align) - 1) & ~((align) - 1))

// Example:
ALIGN_UP(17, 8)  = 24   // Next multiple of 8
ALIGN_UP(17, 16) = 32   // Next multiple of 16
ALIGN_UP(64, 64) = 64   // Already aligned

Cache Line Awareness

Cache Line Layout (64 bytes):

    ┌─────────────────────────────────────────────────────────────────┐
    │                    One Cache Line (64 bytes)                     │
    │                                                                  │
    │  ┌────────┬────────┬────────┬────────┬────────┬────────┬─────┐  │
    │  │ 8 byte │ 8 byte │ 8 byte │ 8 byte │ 8 byte │ 8 byte │ ... │  │
    │  │  slot  │  slot  │  slot  │  slot  │  slot  │  slot  │     │  │
    │  └────────┴────────┴────────┴────────┴────────┴────────┴─────┘  │
    │                                                                  │
    │  Accessing ANY byte loads the ENTIRE 64-byte line                │
    │                                                                  │
    └─────────────────────────────────────────────────────────────────┘

For pool allocator:
- Objects smaller than 64 bytes: multiple per cache line (good)
- Objects 64 bytes: exactly one per line (optimal)
- Objects larger: spans multiple lines

Consider aligning pool objects to cache line boundaries
for performance-critical applications.

Pointer Arithmetic

// Pool memory layout calculation
void* blocks = mmap(..., capacity * object_size, ...);
void* blocks_end = (char*)blocks + capacity * object_size;

// Initialize free list
for (size_t i = 0; i < capacity; i++) {
    void* block = (char*)blocks + i * object_size;
    free_node_t* node = (free_node_t*)block;
    node->next = free_list;
    free_list = node;
}

// Validate pointer during free (debug mode)
bool is_valid_pool_ptr(hft_pool_t* pool, void* ptr) {
    if (ptr < pool->blocks || ptr >= pool->blocks_end) {
        return false;
    }
    // Check alignment
    uintptr_t offset = (char*)ptr - (char*)pool->blocks;
    return (offset % pool->object_size) == 0;
}

6.5 Questions to Guide Your Design

Before implementing, consider:

Pool Allocator Design:

What happens when the pool is exhausted? (Fail, grow, or fallback?)
Should objects be cache-line aligned for optimal access?
How do you handle incorrect free() calls (double-free, wild pointer)?

Arena Allocator Design:

Can arenas grow, or are they fixed size?
What’s the reset strategy? (Reset per request? Per frame?)
How do you handle alignment for variable-size allocations?

Size Class Design:

What size classes do you need? (Powers of 2? Other spacing?)
How do you route an allocation to the right size class?
What about allocations larger than the largest class?

Thread Safety:

Thread-local pools for the fast path?
How do pools get refilled from a central allocator?
How do you handle memory pressure across threads?

6.6 Thinking Exercise

Before coding, trace through this scenario:

EXERCISE: Design a pool for order book entries

Order Entry (used in matching engine):
struct Order {
    uint64_t order_id;      // 8 bytes
    uint64_t price;         // 8 bytes
    uint64_t quantity;      // 8 bytes
    uint64_t timestamp;     // 8 bytes
    uint32_t symbol_id;     // 4 bytes
    uint8_t  side;          // 1 byte (BUY/SELL)
    uint8_t  type;          // 1 byte (LIMIT/MARKET)
    uint8_t  status;        // 1 byte
    uint8_t  padding;       // 1 byte (alignment)
    struct Order* next;     // 8 bytes (for order book queue)
};

Total: 48 bytes

QUESTIONS:
1. What object size should the pool use? (48? 64 for alignment?)
2. If we expect 1 million concurrent orders, how much memory?
3. How many objects fit in L3 cache (assume 32 MB)?
4. Draw the free list after allocating 3 orders and freeing 1.
5. Calculate theoretical alloc/free latency (cache hit vs miss).

Work through the calculation:

48 bytes rounds to 64 for cache-line alignment
1M orders * 64 bytes = 64 MB pool size
32 MB L3 / 64 bytes = 512K orders fit in L3
After freeing, that order goes to head of free list

6.7 Hints in Layers

Hint Level 1 (Starting Point): Start with the simplest possible pool: fixed size, single allocation, no thread safety. Get the free list manipulation correct first.

Hint Level 2 (More Direction): Store the ‘next’ pointer inside the free block itself. When a block is in use, the next pointer is overwritten with user data. When free, it points to the next free block.

Hint Level 3 (Technical Details):

typedef struct free_node {
    struct free_node* next;
} free_node_t;

void* pool_alloc(pool_t* p) {
    if (!p->free_list) return NULL;
    void* block = p->free_list;
    p->free_list = p->free_list->next;
    return block;
}

void pool_free(pool_t* p, void* block) {
    free_node_t* node = (free_node_t*)block;
    node->next = p->free_list;
    p->free_list = node;
}

Hint Level 4 (Tools and Verification): Use perf record -e cache-misses to verify your allocator has good cache behavior. Each pool_alloc should show 0-1 cache misses (for the free list head). Compare with malloc which often shows many more.

6.8 The Interview Questions They’ll Ask

If you build a custom allocator, expect these in systems interviews:

“Walk me through what happens when I call malloc(64)”
- Expected: Describe size class lookup, free list search, potential splitting, header/footer metadata
“Why would you use a pool allocator instead of malloc?”
- Expected: O(1) vs O(log n), no fragmentation, predictable latency, cache locality
“How does your pool allocator handle multi-threaded access?”
- Expected: Thread-local pools, lock-free techniques, or explicit locking strategy
“What’s the tradeoff between pool and arena allocators?”
- Pool: Individual free, fixed size, more overhead
- Arena: No individual free, variable size, minimal overhead
“How would you detect memory leaks with your allocator?”
- Expected: Track allocations in debug mode, compare alloc/free counts, use allocator-aware tooling
“Your pool is exhausted. What strategies can you use?”
- Fail (return NULL)
- Grow (allocate another slab)
- Fallback to general allocator
- Steal from other threads’ pools
“Explain false sharing and how it affects your thread-local pools”
- Different threads writing to same cache line
- Pad pool headers to cache line boundaries
- Keep hot data (free_list head) on separate lines

6.9 Books That Will Help

Book	Chapters	What You’ll Learn
Computer Systems: A Programmer’s Perspective (CS:APP)	Ch. 9: Virtual Memory	Heap management, malloc implementation, fragmentation
The Linux Programming Interface	Ch. 7: Memory Allocation	brk/sbrk, mmap, allocator internals
Fluent C	Ch. 6: Memory Management	C-specific allocation patterns, custom allocators
Game Engine Architecture	Ch. 5.2: Memory Management	Pool/arena patterns used in games
C++ Concurrency in Action	Ch. 7.2: Memory allocation	Thread-safe allocator design

Papers:

“Hoard: A Scalable Memory Allocator” (Berger et al., 2000)
“A Scalable Concurrent malloc(3)” (Evans, 2006) - jemalloc design
“Mimalloc: Free List Sharding in Action” (Leijen et al., 2019)

6.10 Implementation Phases

Phase 1: Basic Fixed-Size Pool (Days 1-2)

Goal: Create a working pool allocator for a single object size.

// Minimal implementation
typedef struct {
    size_t object_size;
    size_t capacity;
    void* blocks;
    free_node_t* free_list;
} hft_pool_t;

hft_pool_t* hft_pool_create(size_t object_size, size_t capacity);
void*       hft_pool_alloc(hft_pool_t* pool);
void        hft_pool_free(hft_pool_t* pool, void* ptr);
void        hft_pool_destroy(hft_pool_t* pool);

Validation:

Allocate capacity objects (should all succeed)
Next allocation should fail (return NULL)
Free one, allocate one (should succeed)
All operations < 20 ns

Phase 2: Arena Allocator (Day 3)

Goal: Implement bump allocation with bulk reset.

typedef struct {
    char* base;
    char* bump_ptr;
    char* end;
} hft_arena_t;

hft_arena_t* hft_arena_create(size_t capacity);
void*        hft_arena_alloc(hft_arena_t* arena, size_t size);
void         hft_arena_reset(hft_arena_t* arena);
void         hft_arena_destroy(hft_arena_t* arena);

Validation:

Allocate various sizes (should all succeed while space remains)
Reset and allocate again (should reuse space)
Measure allocation: should be < 5 ns

Phase 3: Size Classes (Days 4-5)

Goal: Route allocations to appropriate pools.

// Size classes: 16, 32, 64, 128, 256, 512, 1024, 2048 bytes
#define NUM_SIZE_CLASSES 8

static size_t size_classes[] = {16, 32, 64, 128, 256, 512, 1024, 2048};
static hft_pool_t* pools[NUM_SIZE_CLASSES];

void  hft_init(void);
void* hft_alloc(size_t size);
void  hft_free(void* ptr);

Validation:

malloc(17) uses 32-byte class
malloc(65) uses 128-byte class
Free routes to correct pool

Phase 4: Thread-Local Pools (Days 6-7)

Goal: Eliminate contention with per-thread allocation.

// Thread-local storage
static __thread hft_pool_t* tls_pools[NUM_SIZE_CLASSES];
static __thread bool tls_initialized = false;

void hft_thread_init(void);
void* hft_alloc(size_t size);  // Uses tls_pools

Validation:

Multi-threaded stress test shows no contention
Each thread’s allocations don’t affect others
Performance scales linearly with threads

6.11 Key Implementation Decisions

Decision 1: Minimum Object Size

Objects must be at least as large as a pointer (8 bytes on 64-bit) to store the free list link:

size_t actual_size = object_size < sizeof(void*)
                   ? sizeof(void*)
                   : object_size;

Decision 2: Alignment Strategy

Choose your alignment based on use case:

// Option 1: Natural alignment (minimum correctness)
#define POOL_ALIGNMENT 8

// Option 2: Cache-line alignment (best performance)
#define POOL_ALIGNMENT 64

// Option 3: User-specified
hft_pool_t* hft_pool_create_aligned(size_t size, size_t cap, size_t align);

Decision 3: Pool Exhaustion Handling

// Option 1: Fail (simplest, fastest)
if (!pool->free_list) return NULL;

// Option 2: Grow (allocate new slab)
if (!pool->free_list) {
    pool_grow(pool);  // Adds another chunk of blocks
    if (!pool->free_list) return NULL;
}

// Option 3: Fallback (use malloc as backup)
if (!pool->free_list) return malloc(pool->object_size);

Decision 4: Debug Features

#ifdef DEBUG_ALLOCATOR
// Add guard bytes to detect overflow
#define GUARD_SIZE 8
#define GUARD_PATTERN 0xDEADBEEF

// Track all allocations
typedef struct {
    void* ptr;
    size_t size;
    const char* file;
    int line;
} alloc_record_t;
#endif

Testing Strategy

Unit Tests

// test_pool.c

void test_pool_alloc_free(void) {
    hft_pool_t* pool = hft_pool_create(64, 1000);
    assert(pool != NULL);

    // Allocate one
    void* p1 = hft_pool_alloc(pool);
    assert(p1 != NULL);

    // Free it
    hft_pool_free(pool, p1);

    // Allocate again - should get same block (LIFO)
    void* p2 = hft_pool_alloc(pool);
    assert(p2 == p1);

    hft_pool_destroy(pool);
}

void test_pool_exhaust(void) {
    hft_pool_t* pool = hft_pool_create(64, 10);
    void* ptrs[10];

    // Allocate all
    for (int i = 0; i < 10; i++) {
        ptrs[i] = hft_pool_alloc(pool);
        assert(ptrs[i] != NULL);
    }

    // Next should fail
    assert(hft_pool_alloc(pool) == NULL);

    // Free one
    hft_pool_free(pool, ptrs[0]);

    // Now should succeed
    assert(hft_pool_alloc(pool) != NULL);

    hft_pool_destroy(pool);
}

void test_pool_alignment(void) {
    hft_pool_t* pool = hft_pool_create(64, 100);

    for (int i = 0; i < 100; i++) {
        void* p = hft_pool_alloc(pool);
        // Check 16-byte alignment (minimum for SSE)
        assert(((uintptr_t)p & 0xF) == 0);
    }

    hft_pool_destroy(pool);
}

Stress Tests

// test_stress.c

void test_pool_stress(void) {
    hft_pool_t* pool = hft_pool_create(64, 10000);
    void* ptrs[10000];

    // Allocate all
    for (int i = 0; i < 10000; i++) {
        ptrs[i] = hft_pool_alloc(pool);
    }

    // Randomly free half
    for (int i = 0; i < 5000; i++) {
        int idx = rand() % 10000;
        if (ptrs[idx]) {
            hft_pool_free(pool, ptrs[idx]);
            ptrs[idx] = NULL;
        }
    }

    // Reallocate
    for (int i = 0; i < 10000; i++) {
        if (!ptrs[i]) {
            ptrs[i] = hft_pool_alloc(pool);
        }
    }

    // Verify pool stats
    pool_stats_t stats;
    hft_pool_stats(pool, &stats);
    assert(stats.allocated_objects <= 10000);

    hft_pool_destroy(pool);
}

Performance Tests

// bench_latency.c

void benchmark_latency_distribution(void) {
    hft_pool_t* pool = hft_pool_create(64, 100000);
    uint64_t latencies[100000];

    // Pre-allocate for freeing
    void* ptrs[100000];
    for (int i = 0; i < 100000; i++) {
        ptrs[i] = hft_pool_alloc(pool);
    }

    // Reset pool
    hft_pool_destroy(pool);
    pool = hft_pool_create(64, 100000);

    // Measure allocation latency
    for (int i = 0; i < 100000; i++) {
        uint64_t start = rdtsc();
        void* p = hft_pool_alloc(pool);
        uint64_t end = rdtsc();
        latencies[i] = end - start;
        ptrs[i] = p;
    }

    // Sort and compute percentiles
    qsort(latencies, 100000, sizeof(uint64_t), compare_uint64);

    printf("Allocation latency (cycles):\n");
    printf("  p50:  %lu\n", latencies[50000]);
    printf("  p99:  %lu\n", latencies[99000]);
    printf("  p999: %lu\n", latencies[99900]);
    printf("  max:  %lu\n", latencies[99999]);

    hft_pool_destroy(pool);
}

Common Pitfalls and Debugging

Pitfall 1: Forgetting Minimum Object Size

Problem:

// BUG: 4-byte objects can't store 8-byte pointer!
hft_pool_t* pool = hft_pool_create(4, 1000);

Symptom: Memory corruption, crashes on 64-bit systems.

Solution:

size_t actual_size = size < sizeof(free_node_t)
                   ? sizeof(free_node_t)
                   : size;

Pitfall 2: Alignment Violations

Problem:

// BUG: Objects may not be aligned
void* blocks = malloc(capacity * object_size);

Symptom: SIGBUS on strict architectures, poor performance elsewhere.

Solution:

// Use aligned allocation
void* blocks = aligned_alloc(ALIGNMENT, capacity * object_size);
// Or with posix_memalign
posix_memalign(&blocks, ALIGNMENT, capacity * object_size);

Pitfall 3: Free List Corruption

Problem:

// BUG: Writing to freed block corrupts free list!
void* p = hft_pool_alloc(pool);
hft_pool_free(pool, p);
memset(p, 0, 64);  // Oops! Overwrites next pointer

Symptom: Pool returns garbage pointers, crashes.

Solution:

// Debug mode: Check free list integrity
void validate_free_list(hft_pool_t* pool) {
    free_node_t* node = pool->free_list;
    while (node) {
        assert(is_valid_pool_ptr(pool, node));
        node = node->next;
    }
}

Pitfall 4: Double Free

Problem:

void* p = hft_pool_alloc(pool);
hft_pool_free(pool, p);
hft_pool_free(pool, p);  // BUG: Double free!

Symptom: Free list becomes circular, infinite loop or corruption.

Solution:

#ifdef DEBUG
// Track allocated pointers
void hft_pool_free(hft_pool_t* pool, void* ptr) {
    if (!is_in_allocated_set(pool, ptr)) {
        fprintf(stderr, "Double free detected: %p\n", ptr);
        abort();
    }
    remove_from_allocated_set(pool, ptr);
    // ... normal free
}
#endif

Pitfall 5: Thread-Local Initialization

Problem:

// BUG: Forgot to initialize thread-local pools
void* worker_thread(void* arg) {
    // hft_thread_init();  // Missing!
    void* p = hft_alloc(64);  // Crash: tls_pools not initialized
}

Solution:

void* hft_alloc(size_t size) {
    if (!tls_initialized) {
        hft_thread_init();
    }
    // ... normal alloc
}

Debugging Checklist

When allocator misbehaves:

Check alignment: printf("ptr: %p, aligned: %d\n", p, ((uintptr_t)p & 0xF) == 0);
Verify pool bounds: Ensure freed pointers are within pool range
Count allocations: allocated_count should match expectations
Validate free list: Walk the list, verify all pointers valid
Use sanitizers: gcc -fsanitize=address catches many issues

Extensions and Challenges

Extension 1: NUMA-Aware Allocation

On multi-socket systems, allocate from local memory:

#include <numa.h>

hft_pool_t* hft_pool_create_numa(size_t size, size_t cap, int node) {
    void* blocks = numa_alloc_onnode(cap * size, node);
    // ... initialize pool with blocks
}

Extension 2: Growing Pools (Slab Allocator)

When a pool runs out, add another slab:

Growing Pool with Slabs:

    pool
    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │  free_list ───────────────────────────────────────────┐     │
    │                                                        ▼     │
    │  slab_list ─────────────────────────────────────┐    ┌───┐  │
    │       │                                          │    │ F │  │
    │       ▼                                          ▼    └───┘  │
    │  ┌─────────┐      ┌─────────┐      ┌─────────┐              │
    │  │ Slab 0  │ ───▶ │ Slab 1  │ ───▶ │ Slab 2  │ ───▶ NULL   │
    │  │ (full)  │      │ (partial│      │ (empty) │              │
    │  └─────────┘      └─────────┘      └─────────┘              │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

When pool exhausted:
1. Allocate new slab
2. Add all blocks to free list
3. Link slab to slab_list for cleanup

Extension 3: Debugging Features

Add runtime checks for catching bugs:

typedef struct {
    uint32_t magic;           // 0xALLOC or 0xFREED
    uint32_t size;
    const char* alloc_file;
    int alloc_line;
} debug_header_t;

void* hft_pool_alloc_debug(hft_pool_t* pool, const char* file, int line) {
    void* block = hft_pool_alloc(pool);
    if (block) {
        debug_header_t* hdr = (debug_header_t*)block;
        hdr->magic = 0xALLOC;
        hdr->alloc_file = file;
        hdr->alloc_line = line;
    }
    return block;
}

#define pool_alloc(p) hft_pool_alloc_debug(p, __FILE__, __LINE__)

Extension 4: Integration with Rust

Implement GlobalAlloc for Rust:

use std::alloc::{GlobalAlloc, Layout};

struct HftAllocator;

unsafe impl GlobalAlloc for HftAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        hft_alloc(layout.size()) as *mut u8
    }

    unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
        hft_free(ptr as *mut std::ffi::c_void)
    }
}

#[global_allocator]
static ALLOCATOR: HftAllocator = HftAllocator;

Challenge: Lock-Free Pool Allocator

Implement allocation without locks using compare-and-swap:

void* pool_alloc_lockfree(pool_t* pool) {
    free_node_t* old_head;
    free_node_t* new_head;

    do {
        old_head = atomic_load(&pool->free_list);
        if (old_head == NULL) return NULL;
        new_head = old_head->next;
    } while (!atomic_compare_exchange_weak(&pool->free_list,
                                            &old_head, new_head));

    return old_head;
}

Real-World Connections

Industry Allocators

jemalloc (FreeBSD, Facebook):

Multiple arenas to reduce contention
Size-class segregation
Thread-local caching
Used by Firefox, Redis, Cassandra

tcmalloc (Google):

Thread-caching malloc
Per-thread free lists for small objects
Central heap for larger allocations
Used internally at Google

mimalloc (Microsoft):

Free list sharding
Security features (guard pages, randomization)
Excellent performance across workloads

HFT Practices

Real HFT systems use techniques like:

Pre-allocation at startup: All memory allocated before market opens
Object pools for each message type: Order pool, Quote pool, Trade pool
Arena per trading session: Reset at session end
Huge pages: 2MB pages reduce TLB misses

HFT Memory Architecture Example:

    ┌─────────────────────────────────────────────────────────────────┐
    │                    Trading Application                          │
    │                                                                  │
    │   ┌───────────────┐  ┌───────────────┐  ┌───────────────┐      │
    │   │ Order Pool    │  │ Quote Pool    │  │ Trade Pool    │      │
    │   │ 1M orders     │  │ 500K quotes   │  │ 100K trades   │      │
    │   │ 64 bytes each │  │ 32 bytes each │  │ 48 bytes each │      │
    │   └───────────────┘  └───────────────┘  └───────────────┘      │
    │                                                                  │
    │   ┌─────────────────────────────────────────────────────────┐   │
    │   │              Per-Request Arena (reset after each order)  │   │
    │   │                                                          │   │
    │   │   Temp buffers, string parsing, validation structures    │   │
    │   │                                                          │   │
    │   └─────────────────────────────────────────────────────────┘   │
    │                                                                  │
    │   All memory pre-allocated on huge pages before market open     │
    │   Zero malloc() calls in trading hot path                       │
    │                                                                  │
    └─────────────────────────────────────────────────────────────────┘

Resources

Essential Reading

CS:APP Chapter 9: Virtual Memory and Dynamic Memory Allocation
“What Every Programmer Should Know About Memory” by Ulrich Drepper
jemalloc design paper: “A Scalable Concurrent malloc(3) Implementation”
“Fluent C” Chapter 6: Practical C allocator patterns

Reference Implementations

Implementation	Language	Features
jemalloc	C	Production-grade, extensive
mimalloc	C	Modern, security-focused
Hoard	C++	Academic, influential
bumpalo	Rust	Arena allocator
typed-arena	Rust	Typed arena

Tools

perf: perf stat -e cache-misses ./bench
Valgrind: valgrind --tool=massif ./program
heaptrack: Visual heap profiling
AddressSanitizer: gcc -fsanitize=address

Self-Assessment Checklist

Before considering this project complete, verify:

Core Understanding

You can explain why malloc has unpredictable latency
You can describe the difference between pool and arena allocators
You understand internal vs external fragmentation
You can explain the free list data structure

Implementation

Pool allocator achieves O(1) alloc and free
Arena allocator uses bump allocation
All allocations are properly aligned
Thread-local pools avoid contention

Testing

Unit tests cover allocation, deallocation, exhaustion
Stress tests complete without memory corruption
Latency benchmark shows < 10 ns p99

Performance

Demonstrated 15-30x speedup over malloc
Integrated into order book or similar structure
Zero external fragmentation by design

Extensions (Optional)

Implemented size classes for variable sizes
Added debug features (leak detection, double-free detection)
Thread-safe with lock-free or thread-local approach

Submission/Completion Criteria

Your project is complete when you have:

Working Code:
- hft_pool.c/h: Fixed-size pool allocator
- hft_arena.c/h: Bump-pointer arena allocator
- hft_alloc.c/h: Size-class router (optional)
- Test suite with > 90% coverage

Benchmark Results:

Memory Allocator Benchmark
==========================
malloc:          ~80 ns avg
pool_alloc:      ~6 ns avg
Speedup:         13x+

Latency Distribution (pool_alloc):
p50:   ~5 ns
p99:   ~10 ns
p999:  ~15 ns

Integration Demo:
- Pool allocator integrated into order book (from Project 1)
- Before/after latency comparison
- Memory stats showing zero fragmentation
Documentation:
- README with usage examples
- API documentation
- Design decisions explained

Learning Milestones

Milestone 1: Basic Fixed-Size Pool

Understand free list manipulation
Achieve O(1) operations
Pass basic unit tests
Measure: 10-20 ns per operation

Milestone 2: Multiple Size Classes

Route allocations to appropriate pool
Handle various object sizes
Manage memory efficiently
Measure: Size-class lookup < 5 ns overhead

Milestone 3: Thread-Safe Version

Thread-local pools for fast path
Central pool for refills
No contention in hot path
Measure: Linear scaling with threads

After completing this project, you will understand why HFT systems never call malloc in the critical path. You’ll have built an allocator that is 30x faster than general-purpose alternatives and have deep knowledge of memory management internals. This knowledge transfers to any performance-critical application, from databases to game engines to embedded systems.