Project 5: Custom Memory Allocator
Build a specialized memory allocator (arena/pool allocator) that eliminates malloc from your critical path - because malloc is a latency killer in HFT systems.
Quick Reference
| Attribute | Details |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 week |
| Primary Language | C |
| Alternative Languages | Rust, C++, Zig |
| Knowledge Area | Memory Management, Systems Programming |
| Tools Required | perf, gdb/lldb, valgrind (optional) |
| Primary Reference | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron, Ch. 9 |
Learning Objectives
By completing this project, you will be able to:
- Explain heap allocation internals including how malloc works, fragmentation types, and why general-purpose allocators are slow
- Design pool allocators with O(1) allocation and deallocation for fixed-size objects
- Implement arena allocators using bump allocation with bulk free semantics
- Measure allocation performance and demonstrate 30x+ latency improvements over malloc
- Integrate custom allocators with language APIs (GlobalAlloc in Rust, operator new in C++)
- Analyze fragmentation and design allocation strategies for specific use cases
- Build thread-safe allocators using thread-local storage to avoid false sharing
Theoretical Foundation
3.1 Core Concepts
What is Heap Allocation?
When you call malloc(n), the allocator must:
- Find a free block of at least n bytes
- Split the block if it’s larger than needed
- Update metadata (free lists, size headers)
- Return a properly aligned pointer
This process involves:
- Searching through data structures (O(log n) or worse)
- Potential system calls for more memory (sbrk, mmap)
- Lock acquisition in multi-threaded environments
malloc(32) - What Actually Happens:
┌─────────────────────────────────────────────────────────────────┐
│ User Request │
│ malloc(32) │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 1. Check Thread Cache │
│ Thread has cached blocks? ──► Yes ──► Return cached block │
│ │ │
│ ▼ No │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. Search Free Lists │
│ Find block >= 32 bytes in appropriate size class │
│ May need to search multiple bins │
│ Best-fit? First-fit? Next-fit? │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. Split or Coalesce │
│ Block too large? Split and return remainder to free list │
│ Adjacent free blocks? Coalesce them │
│ Update headers and footers │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. Maybe Get More Memory │
│ No suitable block? ──► sbrk() or mmap() │
│ System call: 1000+ cycles, potential page faults │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 5. Acquire Lock (Multi-threaded) │
│ Global arena lock or per-arena lock │
│ Contention = waiting = latency │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
Return pointer
Total time: 50-500+ nanoseconds (unpredictable!)
Types of Fragmentation
External Fragmentation: Free memory exists but in pieces too small to use.
External Fragmentation Example:
Memory: [USED 64][FREE 16][USED 32][FREE 24][USED 48][FREE 8]
Total free: 16 + 24 + 8 = 48 bytes
Request: malloc(40)
Result: FAILS! (No contiguous 40-byte block)
The memory is there, but scattered.
Internal Fragmentation: Allocated blocks are larger than requested.
Internal Fragmentation Example:
Request: malloc(17)
Allocator returns: 32-byte block (next power of 2)
Wasted: 15 bytes per allocation
With 1 million allocations: 15 MB wasted!
Pool Allocator Concept
A pool allocator pre-allocates many fixed-size blocks and chains them in a free list:
Pool Allocator Structure:
Pool for 64-byte objects:
free_list ──────┐
▼
┌──────────────────────────────────────────────────────────┐
│ MEMORY POOL (pre-allocated from OS) │
│ │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Block 0│──▶│Block 1│──▶│Block 2│──▶│Block 3│──▶ NULL │
│ │64 byte│ │64 byte│ │64 byte│ │64 byte│ │
│ │ FREE │ │ FREE │ │ FREE │ │ FREE │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ │
│ ▲ │
│ │ │
│ next pointer stored IN the free block itself! │
│ │
└──────────────────────────────────────────────────────────┘
Allocation (O(1)):
1. block = free_list
2. free_list = block->next
3. return block
Deallocation (O(1)):
1. block->next = free_list
2. free_list = block
Arena Allocator Concept
An arena allocator uses bump allocation - just increment a pointer:
Arena Allocator with Bump Allocation:
┌──────────────────────────────────────────────────────────┐
│ ARENA (one large pre-allocated buffer) │
│ │
│ base bump_ptr end │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────────┬─────────────────────────────┐ │
│ │ USED MEMORY │ AVAILABLE SPACE │ │
│ │ obj1 obj2 obj3 ... │ │ │
│ └────────────────────┴─────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
Allocation (O(1)):
1. Check: bump_ptr + size <= end
2. ptr = bump_ptr
3. bump_ptr += ALIGN(size)
4. return ptr
Deallocation:
Individual free: NOT SUPPORTED!
Reset entire arena: bump_ptr = base (O(1), frees everything)
Perfect for: Batch processing, request handling, game frames
3.2 Why This Matters for HFT
malloc is Unpredictable
The worst-case latency of malloc can be 1000x the average:
malloc Latency Distribution (real measurements):
Latency (ns) | Frequency
─────────────────────────────
50 | ████████████████████ (typical)
100 | ████████
200 | ███
500 | ██
1000 | █
5000 | ▌ (rare but deadly in HFT)
50000 | ▏ (mmap/sbrk triggered)
p50: 80 ns
p99: 350 ns
p999: 5200 ns ← This is the problem!
max: 50000+ ns ← This is catastrophic
In HFT, a 50-microsecond delay means someone else got the trade.
Pool Allocator Latency Distribution
pool_alloc Latency Distribution:
Latency (ns) | Frequency
─────────────────────────────
5 | ██████████████████████████████
8 | ██████████████
10 | ████
12 | █
15 | ▌ (L1 cache miss)
p50: 6 ns
p99: 10 ns
p999: 14 ns ← Predictable!
max: 20 ns ← Bounded!
The 30x Improvement
| Allocator | p50 | p99 | p999 | Predictability |
|---|---|---|---|---|
| glibc malloc | 80 ns | 350 ns | 5200 ns | Poor |
| jemalloc | 50 ns | 150 ns | 800 ns | Better |
| Pool allocator | 6 ns | 10 ns | 14 ns | Excellent |
3.3 Historical Context
The Evolution of Memory Allocators
Timeline of Memory Allocator Development:
1960s: First-fit allocation
Simple but fragmentation-prone
1970s: Buddy system (Knuth)
Powers of 2, fast coalescing, internal fragmentation
Still used in Linux kernel page allocator
1980s: Doug Lea's malloc (dlmalloc)
Segregated free lists, binning
Became glibc default
1990s: Hoard (Berger et al.)
First scalable multi-threaded allocator
Thread-local caching concept
2000s: tcmalloc (Google)
Thread-caching malloc
Per-thread free lists
2006: jemalloc (Jason Evans)
Arena-based, size classes
Facebook, FreeBSD default
2010s: mimalloc (Microsoft)
Free list sharding, security focus
2019: Scudo (Android)
Security-hardened allocator
HFT practice: Skip all of this, use pools!
Why General-Purpose Allocators Are Complex
General-purpose allocators must handle:
- Any size from 1 byte to gigabytes
- Any allocation/deallocation pattern
- Multi-threaded access
- Memory fragmentation over long runtimes
- Security concerns (heap exploits)
This complexity means unpredictable performance. HFT systems know their allocation patterns and can specialize.
3.4 Common Misconceptions
Misconception 1: “malloc is slow because of system calls”
Reality: Most malloc calls don’t make system calls. The slowness comes from:
- Free list searching
- Lock contention
- Cache misses on metadata
Misconception 2: “Fragmentation only matters for long-running processes”
Reality: Fragmentation affects performance immediately:
- Searching larger free lists
- Worse cache locality
- Potential for failed allocations
Misconception 3: “Pool allocators waste memory”
Reality: They trade internal fragmentation for:
- Deterministic latency
- Zero external fragmentation
- Better cache locality
- Simpler implementation
Misconception 4: “Thread-local allocators solve all contention”
Reality: Thread-local helps but:
- Memory must eventually be reclaimed globally
- Large allocations may still be shared
- False sharing in metadata can still occur
Project Specification
4.1 What You Will Build
A memory allocation library called hft_alloc with three components:
- Fixed-size Pool Allocator: O(1) alloc/free for single-size objects
- Size-class Pool Allocator: Multiple pools for different size ranges
- Arena Allocator: Bump allocation with bulk free
Your HFT Allocator Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ hft_alloc Library │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Thread-Local Pools (per thread) │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ 16-byte │ │ 32-byte │ │ 64-byte │ ... │ │
│ │ │ Pool │ │ Pool │ │ Pool │ │ │
│ │ │ free_lst │ │ free_lst │ │ free_lst │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Central Slab Allocator │ │
│ │ (refills thread-local pools) │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Slab 16 │ │ Slab 32 │ │ Slab 64 │ │Slab 128 │ ... │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ System Memory │ │
│ │ (mmap, large pages) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
4.2 Functional Requirements
Core API:
// Pool allocator for fixed-size objects
typedef struct hft_pool hft_pool_t;
hft_pool_t* hft_pool_create(size_t object_size, size_t initial_capacity);
void* hft_pool_alloc(hft_pool_t* pool);
void hft_pool_free(hft_pool_t* pool, void* ptr);
void hft_pool_destroy(hft_pool_t* pool);
void hft_pool_stats(hft_pool_t* pool, pool_stats_t* stats);
// Arena allocator for bump allocation
typedef struct hft_arena hft_arena_t;
hft_arena_t* hft_arena_create(size_t capacity);
void* hft_arena_alloc(hft_arena_t* arena, size_t size);
void hft_arena_reset(hft_arena_t* arena); // Free everything
void hft_arena_destroy(hft_arena_t* arena);
// Size-class allocator (combines multiple pools)
void hft_init(void); // Call once at startup
void* hft_alloc(size_t size);
void hft_free(void* ptr);
void hft_thread_init(void); // Call per thread
Statistics Reporting:
typedef struct {
size_t object_size;
size_t total_objects;
size_t allocated_objects;
size_t free_objects;
size_t allocations_count;
size_t frees_count;
size_t bytes_allocated;
size_t bytes_overhead;
} pool_stats_t;
4.3 Non-Functional Requirements
| Requirement | Target | Measurement |
|---|---|---|
| Allocation latency | < 10 ns p99 | Benchmark |
| Deallocation latency | < 10 ns p99 | Benchmark |
| Memory overhead | < 5% | Stats reporting |
| Fragmentation | 0% external | By design |
| Thread safety | Lock-free fast path | Code inspection |
4.4 Example Usage and Output
Benchmark Program:
#include "hft_alloc.h"
#include <stdio.h>
#include <time.h>
#define NUM_ALLOCS 1000000
#define OBJECT_SIZE 64
int main() {
// Benchmark malloc
struct timespec start, end;
void* ptrs[NUM_ALLOCS];
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < NUM_ALLOCS; i++) {
ptrs[i] = malloc(OBJECT_SIZE);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double malloc_ns = (end.tv_sec - start.tv_sec) * 1e9 +
(end.tv_nsec - start.tv_nsec);
malloc_ns /= NUM_ALLOCS;
for (int i = 0; i < NUM_ALLOCS; i++) {
free(ptrs[i]);
}
// Benchmark pool allocator
hft_pool_t* pool = hft_pool_create(OBJECT_SIZE, NUM_ALLOCS);
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < NUM_ALLOCS; i++) {
ptrs[i] = hft_pool_alloc(pool);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double pool_ns = (end.tv_sec - start.tv_sec) * 1e9 +
(end.tv_nsec - start.tv_nsec);
pool_ns /= NUM_ALLOCS;
// Statistics
pool_stats_t stats;
hft_pool_stats(pool, &stats);
printf("Memory Allocator Benchmark\n");
printf("==========================\n");
printf("Object size: %zu bytes\n", (size_t)OBJECT_SIZE);
printf("Allocations: %d\n\n", NUM_ALLOCS);
printf("malloc: %.1f ns avg\n", malloc_ns);
printf("pool_alloc: %.1f ns avg\n", pool_ns);
printf("Speedup: %.1fx\n\n", malloc_ns / pool_ns);
printf("Pool Statistics:\n");
printf(" Total objects: %zu\n", stats.total_objects);
printf(" Allocated: %zu\n", stats.allocated_objects);
printf(" Free: %zu\n", stats.free_objects);
printf(" Memory used: %zu bytes\n", stats.bytes_allocated);
printf(" Overhead: %zu bytes (%.1f%%)\n",
stats.bytes_overhead,
100.0 * stats.bytes_overhead / stats.bytes_allocated);
hft_pool_destroy(pool);
return 0;
}
Expected Output:
Memory Allocator Benchmark
==========================
Object size: 64 bytes
Allocations: 1000000
malloc: 85.3 ns avg
pool_alloc: 5.7 ns avg
Speedup: 15.0x
Pool Statistics:
Total objects: 1000000
Allocated: 1000000
Free: 0
Memory used: 64000000 bytes
Overhead: 48 bytes (0.0%)
4.5 Real World Outcome
After completing this project, you will have:
- Latency Benchmark: Demonstrable 15-30x improvement over malloc
- Integration Example: Pool allocator integrated into an order book
- Memory Stats: Zero fragmentation, minimal overhead
- Stress Test: Millions of alloc/free cycles without degradation
Solution Architecture
5.1 High-Level Design
Pool Allocator Internal Structure:
hft_pool_t
┌──────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Pool Header │ │ Memory Blocks │ │
│ │ │ │ │ │
│ │ object_size: 64 │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ capacity: 1000 │ │ │Blk 0 │ │Blk 1 │ │Blk 2 │ ... │ │
│ │ allocated: 347 │ │ │ │ │ │ │ │ │ │
│ │ │ │ └──┬───┘ └──┬───┘ └──┬───┘ │ │
│ │ free_list: ─────┼────┼────┘ │ │ │ │
│ │ │ │ ▼ ▼ │ │
│ │ blocks: ────────┼────┼───────────────────────────────────▶│ │
│ │ │ │ │ │
│ │ stats: {...} │ │ Cache-aligned, contiguous memory │ │
│ │ │ │ │ │
│ └─────────────────┘ └─────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Free List Structure (Intrusive):
free_list
│
▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Block N │ │ Block M │ │ Block K │
│ │ │ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ next ────────┼─┼────▶│ │ next ────────┼─┼────▶│ │ next ────────┼─┼──▶ NULL
│ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │
│ │ │ │ │ │
│ (unused space) │ │ (unused space) │ │ (unused space) │
│ │ │ │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
The 'next' pointer is stored IN the free block itself!
No separate metadata needed for free blocks.
Arena Allocator with Bump Pointer:
hft_arena_t
┌──────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ │
│ │ Arena Header │ │
│ │ │ │
│ │ capacity: 1MB │ │
│ │ base: ──────────┼────────────────────────┐ │
│ │ bump_ptr: ──────┼────────────────────────┼────────┐ │
│ │ end: ───────────┼────────────────────────┼────────┼──────┐ │
│ │ │ │ │ │ │
│ └─────────────────┘ │ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────────────────┴────────┴──────┴─┐ │
│ │ Memory Region │ │
│ │ │ │
│ │ base bump_ptr end │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌────────┬────────┬────────┬──────────┬─────────────────┐ │ │
│ │ │ Obj 1 │ Obj 2 │ Obj 3 │ padding │ FREE SPACE │ │ │
│ │ │ 48 B │ 128 B │ 64 B │ (align) │ │ │ │
│ │ └────────┴────────┴────────┴──────────┴─────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Allocation: bump_ptr += aligned_size (O(1), no searching)
Reset: bump_ptr = base (O(1), frees everything)
Individual free: Not supported (by design)
5.2 Key Components
| Component | Purpose | Complexity |
|---|---|---|
| Free List | Track available blocks | O(1) operations |
| Block Storage | Contiguous memory region | Pre-allocated |
| Size Classes | Route requests to appropriate pool | O(1) lookup |
| Thread-Local Cache | Avoid lock contention | Per-thread state |
| Central Slab | Refill thread-local pools | Occasional lock |
5.3 Data Structures
// Free list node (stored IN the free block)
typedef struct free_node {
struct free_node* next;
} free_node_t;
// Pool allocator
typedef struct hft_pool {
size_t object_size; // Size of each object
size_t capacity; // Maximum objects
size_t allocated; // Currently allocated count
free_node_t* free_list; // Head of free list
void* blocks; // Contiguous memory region
void* blocks_end; // End of memory region
// Statistics
uint64_t alloc_count;
uint64_t free_count;
} hft_pool_t;
// Arena allocator
typedef struct hft_arena {
size_t capacity;
char* base;
char* bump_ptr;
char* end;
uint64_t alloc_count;
size_t total_allocated;
} hft_arena_t;
// Size class for multi-size allocator
typedef struct {
size_t min_size;
size_t max_size;
size_t actual_size; // Rounded up for alignment
hft_pool_t* pool;
} size_class_t;
5.4 Algorithm Overview
Pool Allocation (O(1)):
POOL_ALLOC(pool):
1. IF free_list == NULL:
Return NULL (pool exhausted)
2. block = free_list
3. free_list = block->next
4. pool->allocated++
5. Return block
Pool Deallocation (O(1)):
POOL_FREE(pool, ptr):
1. Validate ptr is within pool range (optional, debug)
2. node = (free_node_t*)ptr
3. node->next = free_list
4. free_list = node
5. pool->allocated--
Arena Allocation (O(1)):
ARENA_ALLOC(arena, size):
1. aligned_size = ALIGN_UP(size, 16)
2. IF bump_ptr + aligned_size > end:
Return NULL (arena exhausted)
3. ptr = bump_ptr
4. bump_ptr += aligned_size
5. Return ptr
Arena Reset (O(1)):
ARENA_RESET(arena):
1. bump_ptr = base
2. (All allocations now invalid)
Implementation Guide
6.1 Development Environment Setup
Required Tools:
# C compiler with optimization
gcc -v # Should be 9+ for best optimization
# or
clang --version
# Performance measurement
perf --version # Linux only
# or Instruments on macOS
# Debugging
gdb --version
# or
lldb --version
# Memory checking (optional)
valgrind --version
Compiler Flags:
# Development build
gcc -Wall -Wextra -g -O0 -fsanitize=address
# Performance build
gcc -Wall -Wextra -O3 -march=native -DNDEBUG
# Benchmarking build (no sanitizers, full optimization)
gcc -O3 -march=native -DNDEBUG -flto
6.2 Project Structure
hft_alloc/
├── include/
│ └── hft_alloc.h # Public API
├── src/
│ ├── pool.c # Pool allocator implementation
│ ├── arena.c # Arena allocator implementation
│ ├── size_class.c # Size-class routing
│ └── thread_local.c # Thread-local pools
├── tests/
│ ├── test_pool.c # Pool unit tests
│ ├── test_arena.c # Arena unit tests
│ └── test_thread.c # Multi-threaded tests
├── bench/
│ ├── bench_malloc.c # malloc comparison
│ ├── bench_latency.c # Latency histogram
│ └── bench_orderbook.c # Integration benchmark
├── Makefile
└── README.md
6.3 The Core Question You’re Answering
“Why is malloc slow in HFT systems, and how can we achieve 30x better allocation performance with a custom allocator?”
malloc is slow because:
- It must handle any allocation size
- It must search for free blocks
- It must handle multi-threaded access with locks
- It has unpredictable latency (system calls, fragmentation)
You achieve 30x better performance by:
- Restricting to known, fixed sizes
- Using pre-built free lists (no searching)
- Using thread-local pools (no contention)
- Pre-allocating all memory (no system calls in hot path)
6.4 Concepts You Must Understand First
Memory Alignment
All allocations must be properly aligned for the CPU:
Alignment Requirements (x86-64):
Type Required Alignment
─────────────────────────────────
char 1 byte
short 2 bytes
int 4 bytes
long/ptr 8 bytes
double 8 bytes
__m128 16 bytes (SSE)
__m256 32 bytes (AVX)
cache line 64 bytes (optimal)
Unaligned access:
- May work but slower (extra cache lines)
- May crash (SIGBUS on strict architectures)
// Alignment macro
#define ALIGN_UP(x, align) (((x) + (align) - 1) & ~((align) - 1))
// Example:
ALIGN_UP(17, 8) = 24 // Next multiple of 8
ALIGN_UP(17, 16) = 32 // Next multiple of 16
ALIGN_UP(64, 64) = 64 // Already aligned
Cache Line Awareness
Cache Line Layout (64 bytes):
┌─────────────────────────────────────────────────────────────────┐
│ One Cache Line (64 bytes) │
│ │
│ ┌────────┬────────┬────────┬────────┬────────┬────────┬─────┐ │
│ │ 8 byte │ 8 byte │ 8 byte │ 8 byte │ 8 byte │ 8 byte │ ... │ │
│ │ slot │ slot │ slot │ slot │ slot │ slot │ │ │
│ └────────┴────────┴────────┴────────┴────────┴────────┴─────┘ │
│ │
│ Accessing ANY byte loads the ENTIRE 64-byte line │
│ │
└─────────────────────────────────────────────────────────────────┘
For pool allocator:
- Objects smaller than 64 bytes: multiple per cache line (good)
- Objects 64 bytes: exactly one per line (optimal)
- Objects larger: spans multiple lines
Consider aligning pool objects to cache line boundaries
for performance-critical applications.
Pointer Arithmetic
// Pool memory layout calculation
void* blocks = mmap(..., capacity * object_size, ...);
void* blocks_end = (char*)blocks + capacity * object_size;
// Initialize free list
for (size_t i = 0; i < capacity; i++) {
void* block = (char*)blocks + i * object_size;
free_node_t* node = (free_node_t*)block;
node->next = free_list;
free_list = node;
}
// Validate pointer during free (debug mode)
bool is_valid_pool_ptr(hft_pool_t* pool, void* ptr) {
if (ptr < pool->blocks || ptr >= pool->blocks_end) {
return false;
}
// Check alignment
uintptr_t offset = (char*)ptr - (char*)pool->blocks;
return (offset % pool->object_size) == 0;
}
6.5 Questions to Guide Your Design
Before implementing, consider:
Pool Allocator Design:
- What happens when the pool is exhausted? (Fail, grow, or fallback?)
- Should objects be cache-line aligned for optimal access?
- How do you handle incorrect free() calls (double-free, wild pointer)?
Arena Allocator Design:
- Can arenas grow, or are they fixed size?
- What’s the reset strategy? (Reset per request? Per frame?)
- How do you handle alignment for variable-size allocations?
Size Class Design:
- What size classes do you need? (Powers of 2? Other spacing?)
- How do you route an allocation to the right size class?
- What about allocations larger than the largest class?
Thread Safety:
- Thread-local pools for the fast path?
- How do pools get refilled from a central allocator?
- How do you handle memory pressure across threads?
6.6 Thinking Exercise
Before coding, trace through this scenario:
EXERCISE: Design a pool for order book entries
Order Entry (used in matching engine):
struct Order {
uint64_t order_id; // 8 bytes
uint64_t price; // 8 bytes
uint64_t quantity; // 8 bytes
uint64_t timestamp; // 8 bytes
uint32_t symbol_id; // 4 bytes
uint8_t side; // 1 byte (BUY/SELL)
uint8_t type; // 1 byte (LIMIT/MARKET)
uint8_t status; // 1 byte
uint8_t padding; // 1 byte (alignment)
struct Order* next; // 8 bytes (for order book queue)
};
Total: 48 bytes
QUESTIONS:
1. What object size should the pool use? (48? 64 for alignment?)
2. If we expect 1 million concurrent orders, how much memory?
3. How many objects fit in L3 cache (assume 32 MB)?
4. Draw the free list after allocating 3 orders and freeing 1.
5. Calculate theoretical alloc/free latency (cache hit vs miss).
Work through the calculation:
- 48 bytes rounds to 64 for cache-line alignment
- 1M orders * 64 bytes = 64 MB pool size
- 32 MB L3 / 64 bytes = 512K orders fit in L3
- After freeing, that order goes to head of free list
6.7 Hints in Layers
Hint Level 1 (Starting Point): Start with the simplest possible pool: fixed size, single allocation, no thread safety. Get the free list manipulation correct first.
Hint Level 2 (More Direction): Store the ‘next’ pointer inside the free block itself. When a block is in use, the next pointer is overwritten with user data. When free, it points to the next free block.
Hint Level 3 (Technical Details):
typedef struct free_node {
struct free_node* next;
} free_node_t;
void* pool_alloc(pool_t* p) {
if (!p->free_list) return NULL;
void* block = p->free_list;
p->free_list = p->free_list->next;
return block;
}
void pool_free(pool_t* p, void* block) {
free_node_t* node = (free_node_t*)block;
node->next = p->free_list;
p->free_list = node;
}
Hint Level 4 (Tools and Verification):
Use perf record -e cache-misses to verify your allocator has good cache behavior. Each pool_alloc should show 0-1 cache misses (for the free list head). Compare with malloc which often shows many more.
6.8 The Interview Questions They’ll Ask
If you build a custom allocator, expect these in systems interviews:
- “Walk me through what happens when I call malloc(64)”
- Expected: Describe size class lookup, free list search, potential splitting, header/footer metadata
- “Why would you use a pool allocator instead of malloc?”
- Expected: O(1) vs O(log n), no fragmentation, predictable latency, cache locality
- “How does your pool allocator handle multi-threaded access?”
- Expected: Thread-local pools, lock-free techniques, or explicit locking strategy
- “What’s the tradeoff between pool and arena allocators?”
- Pool: Individual free, fixed size, more overhead
- Arena: No individual free, variable size, minimal overhead
- “How would you detect memory leaks with your allocator?”
- Expected: Track allocations in debug mode, compare alloc/free counts, use allocator-aware tooling
- “Your pool is exhausted. What strategies can you use?”
- Fail (return NULL)
- Grow (allocate another slab)
- Fallback to general allocator
- Steal from other threads’ pools
- “Explain false sharing and how it affects your thread-local pools”
- Different threads writing to same cache line
- Pad pool headers to cache line boundaries
- Keep hot data (free_list head) on separate lines
6.9 Books That Will Help
| Book | Chapters | What You’ll Learn |
|---|---|---|
| Computer Systems: A Programmer’s Perspective (CS:APP) | Ch. 9: Virtual Memory | Heap management, malloc implementation, fragmentation |
| The Linux Programming Interface | Ch. 7: Memory Allocation | brk/sbrk, mmap, allocator internals |
| Fluent C | Ch. 6: Memory Management | C-specific allocation patterns, custom allocators |
| Game Engine Architecture | Ch. 5.2: Memory Management | Pool/arena patterns used in games |
| C++ Concurrency in Action | Ch. 7.2: Memory allocation | Thread-safe allocator design |
Papers:
- “Hoard: A Scalable Memory Allocator” (Berger et al., 2000)
- “A Scalable Concurrent malloc(3)” (Evans, 2006) - jemalloc design
- “Mimalloc: Free List Sharding in Action” (Leijen et al., 2019)
6.10 Implementation Phases
Phase 1: Basic Fixed-Size Pool (Days 1-2)
Goal: Create a working pool allocator for a single object size.
// Minimal implementation
typedef struct {
size_t object_size;
size_t capacity;
void* blocks;
free_node_t* free_list;
} hft_pool_t;
hft_pool_t* hft_pool_create(size_t object_size, size_t capacity);
void* hft_pool_alloc(hft_pool_t* pool);
void hft_pool_free(hft_pool_t* pool, void* ptr);
void hft_pool_destroy(hft_pool_t* pool);
Validation:
- Allocate capacity objects (should all succeed)
- Next allocation should fail (return NULL)
- Free one, allocate one (should succeed)
- All operations < 20 ns
Phase 2: Arena Allocator (Day 3)
Goal: Implement bump allocation with bulk reset.
typedef struct {
char* base;
char* bump_ptr;
char* end;
} hft_arena_t;
hft_arena_t* hft_arena_create(size_t capacity);
void* hft_arena_alloc(hft_arena_t* arena, size_t size);
void hft_arena_reset(hft_arena_t* arena);
void hft_arena_destroy(hft_arena_t* arena);
Validation:
- Allocate various sizes (should all succeed while space remains)
- Reset and allocate again (should reuse space)
- Measure allocation: should be < 5 ns
Phase 3: Size Classes (Days 4-5)
Goal: Route allocations to appropriate pools.
// Size classes: 16, 32, 64, 128, 256, 512, 1024, 2048 bytes
#define NUM_SIZE_CLASSES 8
static size_t size_classes[] = {16, 32, 64, 128, 256, 512, 1024, 2048};
static hft_pool_t* pools[NUM_SIZE_CLASSES];
void hft_init(void);
void* hft_alloc(size_t size);
void hft_free(void* ptr);
Validation:
- malloc(17) uses 32-byte class
- malloc(65) uses 128-byte class
- Free routes to correct pool
Phase 4: Thread-Local Pools (Days 6-7)
Goal: Eliminate contention with per-thread allocation.
// Thread-local storage
static __thread hft_pool_t* tls_pools[NUM_SIZE_CLASSES];
static __thread bool tls_initialized = false;
void hft_thread_init(void);
void* hft_alloc(size_t size); // Uses tls_pools
Validation:
- Multi-threaded stress test shows no contention
- Each thread’s allocations don’t affect others
- Performance scales linearly with threads
6.11 Key Implementation Decisions
Decision 1: Minimum Object Size
Objects must be at least as large as a pointer (8 bytes on 64-bit) to store the free list link:
size_t actual_size = object_size < sizeof(void*)
? sizeof(void*)
: object_size;
Decision 2: Alignment Strategy
Choose your alignment based on use case:
// Option 1: Natural alignment (minimum correctness)
#define POOL_ALIGNMENT 8
// Option 2: Cache-line alignment (best performance)
#define POOL_ALIGNMENT 64
// Option 3: User-specified
hft_pool_t* hft_pool_create_aligned(size_t size, size_t cap, size_t align);
Decision 3: Pool Exhaustion Handling
// Option 1: Fail (simplest, fastest)
if (!pool->free_list) return NULL;
// Option 2: Grow (allocate new slab)
if (!pool->free_list) {
pool_grow(pool); // Adds another chunk of blocks
if (!pool->free_list) return NULL;
}
// Option 3: Fallback (use malloc as backup)
if (!pool->free_list) return malloc(pool->object_size);
Decision 4: Debug Features
#ifdef DEBUG_ALLOCATOR
// Add guard bytes to detect overflow
#define GUARD_SIZE 8
#define GUARD_PATTERN 0xDEADBEEF
// Track all allocations
typedef struct {
void* ptr;
size_t size;
const char* file;
int line;
} alloc_record_t;
#endif
Testing Strategy
Unit Tests
// test_pool.c
void test_pool_alloc_free(void) {
hft_pool_t* pool = hft_pool_create(64, 1000);
assert(pool != NULL);
// Allocate one
void* p1 = hft_pool_alloc(pool);
assert(p1 != NULL);
// Free it
hft_pool_free(pool, p1);
// Allocate again - should get same block (LIFO)
void* p2 = hft_pool_alloc(pool);
assert(p2 == p1);
hft_pool_destroy(pool);
}
void test_pool_exhaust(void) {
hft_pool_t* pool = hft_pool_create(64, 10);
void* ptrs[10];
// Allocate all
for (int i = 0; i < 10; i++) {
ptrs[i] = hft_pool_alloc(pool);
assert(ptrs[i] != NULL);
}
// Next should fail
assert(hft_pool_alloc(pool) == NULL);
// Free one
hft_pool_free(pool, ptrs[0]);
// Now should succeed
assert(hft_pool_alloc(pool) != NULL);
hft_pool_destroy(pool);
}
void test_pool_alignment(void) {
hft_pool_t* pool = hft_pool_create(64, 100);
for (int i = 0; i < 100; i++) {
void* p = hft_pool_alloc(pool);
// Check 16-byte alignment (minimum for SSE)
assert(((uintptr_t)p & 0xF) == 0);
}
hft_pool_destroy(pool);
}
Stress Tests
// test_stress.c
void test_pool_stress(void) {
hft_pool_t* pool = hft_pool_create(64, 10000);
void* ptrs[10000];
// Allocate all
for (int i = 0; i < 10000; i++) {
ptrs[i] = hft_pool_alloc(pool);
}
// Randomly free half
for (int i = 0; i < 5000; i++) {
int idx = rand() % 10000;
if (ptrs[idx]) {
hft_pool_free(pool, ptrs[idx]);
ptrs[idx] = NULL;
}
}
// Reallocate
for (int i = 0; i < 10000; i++) {
if (!ptrs[i]) {
ptrs[i] = hft_pool_alloc(pool);
}
}
// Verify pool stats
pool_stats_t stats;
hft_pool_stats(pool, &stats);
assert(stats.allocated_objects <= 10000);
hft_pool_destroy(pool);
}
Performance Tests
// bench_latency.c
void benchmark_latency_distribution(void) {
hft_pool_t* pool = hft_pool_create(64, 100000);
uint64_t latencies[100000];
// Pre-allocate for freeing
void* ptrs[100000];
for (int i = 0; i < 100000; i++) {
ptrs[i] = hft_pool_alloc(pool);
}
// Reset pool
hft_pool_destroy(pool);
pool = hft_pool_create(64, 100000);
// Measure allocation latency
for (int i = 0; i < 100000; i++) {
uint64_t start = rdtsc();
void* p = hft_pool_alloc(pool);
uint64_t end = rdtsc();
latencies[i] = end - start;
ptrs[i] = p;
}
// Sort and compute percentiles
qsort(latencies, 100000, sizeof(uint64_t), compare_uint64);
printf("Allocation latency (cycles):\n");
printf(" p50: %lu\n", latencies[50000]);
printf(" p99: %lu\n", latencies[99000]);
printf(" p999: %lu\n", latencies[99900]);
printf(" max: %lu\n", latencies[99999]);
hft_pool_destroy(pool);
}
Common Pitfalls and Debugging
Pitfall 1: Forgetting Minimum Object Size
Problem:
// BUG: 4-byte objects can't store 8-byte pointer!
hft_pool_t* pool = hft_pool_create(4, 1000);
Symptom: Memory corruption, crashes on 64-bit systems.
Solution:
size_t actual_size = size < sizeof(free_node_t)
? sizeof(free_node_t)
: size;
Pitfall 2: Alignment Violations
Problem:
// BUG: Objects may not be aligned
void* blocks = malloc(capacity * object_size);
Symptom: SIGBUS on strict architectures, poor performance elsewhere.
Solution:
// Use aligned allocation
void* blocks = aligned_alloc(ALIGNMENT, capacity * object_size);
// Or with posix_memalign
posix_memalign(&blocks, ALIGNMENT, capacity * object_size);
Pitfall 3: Free List Corruption
Problem:
// BUG: Writing to freed block corrupts free list!
void* p = hft_pool_alloc(pool);
hft_pool_free(pool, p);
memset(p, 0, 64); // Oops! Overwrites next pointer
Symptom: Pool returns garbage pointers, crashes.
Solution:
// Debug mode: Check free list integrity
void validate_free_list(hft_pool_t* pool) {
free_node_t* node = pool->free_list;
while (node) {
assert(is_valid_pool_ptr(pool, node));
node = node->next;
}
}
Pitfall 4: Double Free
Problem:
void* p = hft_pool_alloc(pool);
hft_pool_free(pool, p);
hft_pool_free(pool, p); // BUG: Double free!
Symptom: Free list becomes circular, infinite loop or corruption.
Solution:
#ifdef DEBUG
// Track allocated pointers
void hft_pool_free(hft_pool_t* pool, void* ptr) {
if (!is_in_allocated_set(pool, ptr)) {
fprintf(stderr, "Double free detected: %p\n", ptr);
abort();
}
remove_from_allocated_set(pool, ptr);
// ... normal free
}
#endif
Pitfall 5: Thread-Local Initialization
Problem:
// BUG: Forgot to initialize thread-local pools
void* worker_thread(void* arg) {
// hft_thread_init(); // Missing!
void* p = hft_alloc(64); // Crash: tls_pools not initialized
}
Solution:
void* hft_alloc(size_t size) {
if (!tls_initialized) {
hft_thread_init();
}
// ... normal alloc
}
Debugging Checklist
When allocator misbehaves:
- Check alignment:
printf("ptr: %p, aligned: %d\n", p, ((uintptr_t)p & 0xF) == 0); - Verify pool bounds: Ensure freed pointers are within pool range
- Count allocations:
allocated_countshould match expectations - Validate free list: Walk the list, verify all pointers valid
- Use sanitizers:
gcc -fsanitize=addresscatches many issues
Extensions and Challenges
Extension 1: NUMA-Aware Allocation
On multi-socket systems, allocate from local memory:
#include <numa.h>
hft_pool_t* hft_pool_create_numa(size_t size, size_t cap, int node) {
void* blocks = numa_alloc_onnode(cap * size, node);
// ... initialize pool with blocks
}
Extension 2: Growing Pools (Slab Allocator)
When a pool runs out, add another slab:
Growing Pool with Slabs:
pool
┌─────────────────────────────────────────────────────────────┐
│ │
│ free_list ───────────────────────────────────────────┐ │
│ ▼ │
│ slab_list ─────────────────────────────────────┐ ┌───┐ │
│ │ │ │ F │ │
│ ▼ ▼ └───┘ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Slab 0 │ ───▶ │ Slab 1 │ ───▶ │ Slab 2 │ ───▶ NULL │
│ │ (full) │ │ (partial│ │ (empty) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
When pool exhausted:
1. Allocate new slab
2. Add all blocks to free list
3. Link slab to slab_list for cleanup
Extension 3: Debugging Features
Add runtime checks for catching bugs:
typedef struct {
uint32_t magic; // 0xALLOC or 0xFREED
uint32_t size;
const char* alloc_file;
int alloc_line;
} debug_header_t;
void* hft_pool_alloc_debug(hft_pool_t* pool, const char* file, int line) {
void* block = hft_pool_alloc(pool);
if (block) {
debug_header_t* hdr = (debug_header_t*)block;
hdr->magic = 0xALLOC;
hdr->alloc_file = file;
hdr->alloc_line = line;
}
return block;
}
#define pool_alloc(p) hft_pool_alloc_debug(p, __FILE__, __LINE__)
Extension 4: Integration with Rust
Implement GlobalAlloc for Rust:
use std::alloc::{GlobalAlloc, Layout};
struct HftAllocator;
unsafe impl GlobalAlloc for HftAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
hft_alloc(layout.size()) as *mut u8
}
unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
hft_free(ptr as *mut std::ffi::c_void)
}
}
#[global_allocator]
static ALLOCATOR: HftAllocator = HftAllocator;
Challenge: Lock-Free Pool Allocator
Implement allocation without locks using compare-and-swap:
void* pool_alloc_lockfree(pool_t* pool) {
free_node_t* old_head;
free_node_t* new_head;
do {
old_head = atomic_load(&pool->free_list);
if (old_head == NULL) return NULL;
new_head = old_head->next;
} while (!atomic_compare_exchange_weak(&pool->free_list,
&old_head, new_head));
return old_head;
}
Real-World Connections
Industry Allocators
jemalloc (FreeBSD, Facebook):
- Multiple arenas to reduce contention
- Size-class segregation
- Thread-local caching
- Used by Firefox, Redis, Cassandra
tcmalloc (Google):
- Thread-caching malloc
- Per-thread free lists for small objects
- Central heap for larger allocations
- Used internally at Google
mimalloc (Microsoft):
- Free list sharding
- Security features (guard pages, randomization)
- Excellent performance across workloads
HFT Practices
Real HFT systems use techniques like:
- Pre-allocation at startup: All memory allocated before market opens
- Object pools for each message type: Order pool, Quote pool, Trade pool
- Arena per trading session: Reset at session end
- Huge pages: 2MB pages reduce TLB misses
HFT Memory Architecture Example:
┌─────────────────────────────────────────────────────────────────┐
│ Trading Application │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Order Pool │ │ Quote Pool │ │ Trade Pool │ │
│ │ 1M orders │ │ 500K quotes │ │ 100K trades │ │
│ │ 64 bytes each │ │ 32 bytes each │ │ 48 bytes each │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Per-Request Arena (reset after each order) │ │
│ │ │ │
│ │ Temp buffers, string parsing, validation structures │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ All memory pre-allocated on huge pages before market open │
│ Zero malloc() calls in trading hot path │
│ │
└─────────────────────────────────────────────────────────────────┘
Resources
Essential Reading
- CS:APP Chapter 9: Virtual Memory and Dynamic Memory Allocation
- “What Every Programmer Should Know About Memory” by Ulrich Drepper
- jemalloc design paper: “A Scalable Concurrent malloc(3) Implementation”
- “Fluent C” Chapter 6: Practical C allocator patterns
Reference Implementations
| Implementation | Language | Features |
|---|---|---|
| jemalloc | C | Production-grade, extensive |
| mimalloc | C | Modern, security-focused |
| Hoard | C++ | Academic, influential |
| bumpalo | Rust | Arena allocator |
| typed-arena | Rust | Typed arena |
Tools
- perf:
perf stat -e cache-misses ./bench - Valgrind:
valgrind --tool=massif ./program - heaptrack: Visual heap profiling
- AddressSanitizer:
gcc -fsanitize=address
Self-Assessment Checklist
Before considering this project complete, verify:
Core Understanding
- You can explain why malloc has unpredictable latency
- You can describe the difference between pool and arena allocators
- You understand internal vs external fragmentation
- You can explain the free list data structure
Implementation
- Pool allocator achieves O(1) alloc and free
- Arena allocator uses bump allocation
- All allocations are properly aligned
- Thread-local pools avoid contention
Testing
- Unit tests cover allocation, deallocation, exhaustion
- Stress tests complete without memory corruption
- Latency benchmark shows < 10 ns p99
Performance
- Demonstrated 15-30x speedup over malloc
- Integrated into order book or similar structure
- Zero external fragmentation by design
Extensions (Optional)
- Implemented size classes for variable sizes
- Added debug features (leak detection, double-free detection)
- Thread-safe with lock-free or thread-local approach
Submission/Completion Criteria
Your project is complete when you have:
- Working Code:
hft_pool.c/h: Fixed-size pool allocatorhft_arena.c/h: Bump-pointer arena allocatorhft_alloc.c/h: Size-class router (optional)- Test suite with > 90% coverage
- Benchmark Results:
Memory Allocator Benchmark ========================== malloc: ~80 ns avg pool_alloc: ~6 ns avg Speedup: 13x+ Latency Distribution (pool_alloc): p50: ~5 ns p99: ~10 ns p999: ~15 ns - Integration Demo:
- Pool allocator integrated into order book (from Project 1)
- Before/after latency comparison
- Memory stats showing zero fragmentation
- Documentation:
- README with usage examples
- API documentation
- Design decisions explained
Learning Milestones
Milestone 1: Basic Fixed-Size Pool
- Understand free list manipulation
- Achieve O(1) operations
- Pass basic unit tests
- Measure: 10-20 ns per operation
Milestone 2: Multiple Size Classes
- Route allocations to appropriate pool
- Handle various object sizes
- Manage memory efficiently
- Measure: Size-class lookup < 5 ns overhead
Milestone 3: Thread-Safe Version
- Thread-local pools for fast path
- Central pool for refills
- No contention in hot path
- Measure: Linear scaling with threads
After completing this project, you will understand why HFT systems never call malloc in the critical path. You’ll have built an allocator that is 30x faster than general-purpose alternatives and have deep knowledge of memory management internals. This knowledge transfers to any performance-critical application, from databases to game engines to embedded systems.