Project 5: Custom Memory Allocator for HFT

Build a specialized memory allocator (arena/pool allocator) that eliminates malloc from your hot path - because malloc is a latency killer in HFT systems.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1 week
Language	C or C++ (Rust for GlobalAlloc)
Prerequisites	Understanding of pointers, basic memory concepts
Key Topics	Pool allocators, arena allocators, thread-local storage, fragmentation

1. Learning Objectives

By completing this project, you will be able to:

Explain why malloc is unacceptable for HFT hot paths and quantify the latency difference between general-purpose and specialized allocators (250ns vs 8ns).
Implement a fixed-size pool allocator with O(1) allocation and deallocation using a free list data structure.
Design an arena allocator (bump allocator) for ultra-fast allocation when bulk deallocation is acceptable.
Create multiple size-class pools that can handle varying object sizes without external fragmentation.
Implement thread-local pools to eliminate contention in multi-threaded HFT systems.
Integrate your allocator with C++ std::allocator or Rust’s GlobalAlloc trait for seamless use with standard containers.
Benchmark and verify determinism by measuring allocation latency distributions and comparing p50/p99/p999 with malloc.
Identify and avoid common pitfalls including memory leaks, use-after-free, and false sharing in concurrent allocators.

2. Theoretical Foundation

2.1 Core Concepts

Why malloc is Slow for HFT

The general-purpose malloc implementation (glibc, jemalloc, tcmalloc) is optimized for a wide range of use cases, but this flexibility comes at a cost:

Why malloc Has Unpredictable Latency
┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  1. SYSTEM CALLS (worst case)                                     │
│     ┌──────────┐     ┌──────────┐     ┌──────────┐                │
│     │  malloc  │────►│  sbrk/   │────►│  Kernel  │                │
│     │          │     │  mmap    │     │  Context │                │
│     └──────────┘     └──────────┘     │  Switch  │                │
│           │                            └──────────┘                │
│           │              +10-50μs (microseconds!)                  │
│           │                                                        │
│  2. FRAGMENTATION OVERHEAD                                        │
│     ┌──────────────────────────────────────────┐                  │
│     │ Free: [16B] [32B] [16B] [64B] [8B] ...   │                  │
│     │ Request 24B → Search free list → Split   │                  │
│     │ Coalescing on free → More overhead       │                  │
│     └──────────────────────────────────────────┘                  │
│                                                                    │
│  3. LOCKING (multi-threaded)                                      │
│     ┌─────────┐  Lock contention  ┌─────────┐                     │
│     │ Thread1 │◄─────────────────►│ Thread2 │                     │
│     └─────────┘                   └─────────┘                     │
│           │                             │                          │
│           └──────── Mutex wait ─────────┘                          │
│                    +1-10μs per contention                          │
│                                                                    │
│  4. NON-DETERMINISTIC LATENCY                                     │
│     Best case:  ~50ns  (fast path, no contention)                 │
│     Typical:    ~250ns (normal workload)                          │
│     Worst case: ~50μs  (mmap, high contention)                    │
│                                                                    │
│     HFT cannot tolerate this variance!                            │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The Four Killers of malloc Performance:

System Calls: When malloc’s internal pools are exhausted, it must call sbrk() or mmap() to get more memory from the OS. A single system call can take 10-50 microseconds - an eternity in HFT.
Fragmentation Management: General-purpose allocators must handle any size request. This requires maintaining complex data structures (free lists, size classes, bins) and performing operations like coalescing adjacent free blocks.
Thread Synchronization: Multi-threaded programs require locks on shared allocator state. Even with optimizations like per-thread caches (tcmalloc), contention still occurs during cache replenishment.
Non-Determinism: The combination above means malloc has highly variable latency. HFT systems need predictable, worst-case latency guarantees.

Pool Allocators: Fixed-Size Block Freedom

A pool allocator pre-allocates a large chunk of memory and divides it into fixed-size blocks. The key insight: if all blocks are the same size, no fragmentation can occur.

Pool Allocator Structure
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  MEMORY POOL (single contiguous allocation)                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                             │   │
│  │  Block 0    Block 1    Block 2    Block 3    Block 4  ...   │   │
│  │  ┌──────┐   ┌──────┐   ┌──────┐   ┌──────┐   ┌──────┐       │   │
│  │  │ next │   │ next │   │ next │   │ next │   │ next │       │   │
│  │  │  ──────► │  ──────► │  ──────► │  ──────► │ NULL │       │   │
│  │  │      │   │      │   │      │   │      │   │      │       │   │
│  │  └──────┘   └──────┘   └──────┘   └──────┘   └──────┘       │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│        ▲                                                           │
│        │                                                           │
│  FREE LIST HEAD                                                    │
│  (points to first free block)                                      │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  ALLOCATION: O(1)                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  1. block = free_list_head                                  │   │
│  │  2. free_list_head = block->next                            │   │
│  │  3. return block                                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  DEALLOCATION: O(1)                                                │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  1. block->next = free_list_head                            │   │
│  │  2. free_list_head = block                                  │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Properties of Pool Allocators:

O(1) Allocation: Just pop from the free list head
O(1) Deallocation: Just push to the free list head
Zero Fragmentation: All blocks are identical size
No System Calls: Memory is pre-allocated at startup
Cache Friendly: Blocks are contiguous in memory

Trade-off: You must know the block size at compile time or pool creation time.

Arena Allocators: The Simplest Fast Allocator

An arena allocator (also called a bump allocator or linear allocator) is even simpler than a pool allocator:

Arena Allocator (Bump Pointer)
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  ARENA MEMORY                                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ ██████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   │
│  │ ◄── allocated ──────►◄──────────── free ─────────────────► │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                          ▲                                    ▲     │
│                          │                                    │     │
│                     bump_ptr                               end      │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  ALLOCATION: O(1) - just bump the pointer!                         │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  1. result = bump_ptr                                       │   │
│  │  2. bump_ptr += size                                        │   │
│  │  3. return result                                           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  DEALLOCATION: Not supported individually!                         │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  reset(): bump_ptr = start  // Free everything at once     │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  USE CASES:                                                        │
│  • Per-request allocations that all get freed together             │
│  • Game frame allocations (allocate during frame, reset at end)    │
│  • Order processing (allocate for order, reset after complete)     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Arena Allocator Trade-offs:

Advantage	Disadvantage
Fastest possible allocation (single pointer bump)	Cannot free individual objects
Zero overhead per allocation	Must reset entire arena at once
Perfect cache locality	May waste memory if objects vary in size
No fragmentation	Fixed maximum capacity

HFT Use Case: Processing a batch of market data updates. Allocate temporary structures for each message, then reset the arena when the batch is complete.

Thread-Local Pools: Eliminating Contention

In multi-threaded HFT systems, each thread typically handles different tasks:

Thread 1: Market data processing
Thread 2: Strategy execution
Thread 3: Order management

Sharing a single allocator between threads creates contention. The solution: thread-local pools.

Thread-Local Storage (TLS) Pattern
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  TRADITIONAL SHARED ALLOCATOR (BAD for HFT)                        │
│                                                                     │
│  ┌────────┐     ┌────────────┐     ┌────────┐                      │
│  │Thread 1│────►│   Shared   │◄────│Thread 2│                      │
│  └────────┘     │  Allocator │     └────────┘                      │
│       ▲         │   (LOCK!)  │          ▲                          │
│       │         └────────────┘          │                          │
│       └──────── CONTENTION ─────────────┘                          │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  THREAD-LOCAL POOLS (GOOD for HFT)                                 │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ Thread 1                                                   │    │
│  │ ┌─────────────────────────────────────────────────────┐   │    │
│  │ │ TLS Pool: [████░░░░░░░░░░░░░░░░░░░]                  │   │    │
│  │ └─────────────────────────────────────────────────────┘   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ Thread 2                                                   │    │
│  │ ┌─────────────────────────────────────────────────────┐   │    │
│  │ │ TLS Pool: [██████████░░░░░░░░░░░░░]                  │   │    │
│  │ └─────────────────────────────────────────────────────┘   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ Thread 3                                                   │    │
│  │ ┌─────────────────────────────────────────────────────┐   │    │
│  │ │ TLS Pool: [████████░░░░░░░░░░░░░░░]                  │   │    │
│  │ └─────────────────────────────────────────────────────┘   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  NO CONTENTION! Each thread has its own pool.                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Avoiding False Sharing:

Even with thread-local pools, you must avoid false sharing - when two threads modify data on the same cache line:

False Sharing Problem
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  CACHE LINE (64 bytes)                                             │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ Thread1_head │ Thread2_head │         unused                 │  │
│  │   8 bytes    │   8 bytes    │        48 bytes                │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         ▲                ▲                                          │
│         │                │                                          │
│     Thread 1         Thread 2                                       │
│     writes           writes                                         │
│                                                                     │
│  PROBLEM: Both threads invalidate each other's cache line!         │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  SOLUTION: Pad to cache line boundary                              │
│                                                                     │
│  CACHE LINE 1                        CACHE LINE 2                  │
│  ┌────────────────────────────┐     ┌────────────────────────────┐ │
│  │ Thread1_head │   padding   │     │ Thread2_head │   padding   │ │
│  │   8 bytes    │  56 bytes   │     │   8 bytes    │  56 bytes   │ │
│  └────────────────────────────┘     └────────────────────────────┘ │
│                                                                     │
│  NO FALSE SHARING! Each thread owns its cache line.                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2.2 Why This Matters in HFT

HFT systems operate under extreme constraints:

Constraint	Requirement	Why Allocators Matter
Latency	<10μs tick-to-trade	malloc’s 250ns+ adds up across many allocations
Determinism	Predictable worst case	malloc’s p99 can be 100x its p50
Throughput	1M+ messages/second	Lock contention destroys throughput
Memory	Minimize cache misses	Scattered allocations harm locality

Real-world impact: A trading strategy that allocates during the hot path will have inconsistent performance. On the day it matters most (high volatility), it will perform worst (more allocations, more contention).

The HFT rule: Never call malloc in the hot path.

Instead:

Pre-allocate all memory at startup
Use pool allocators for fixed-size objects (orders, quotes, trade records)
Use arena allocators for temporary per-message processing
Use thread-local pools to eliminate contention

2.3 Historical Context

Memory allocation has evolved significantly:

Era	Approach	Characteristics
1970s	Simple first-fit	O(n) search, severe fragmentation
1980s	Buddy allocators	Power-of-2 sizes, reduced fragmentation
1990s	Doug Lea’s malloc (dlmalloc)	Size classes, boundary tags
2000s	tcmalloc, jemalloc	Thread caching, per-CPU caches
2010s+	Custom HFT allocators	Application-specific, zero overhead

Key insight: General-purpose allocators keep adding features to handle more use cases. HFT goes the opposite direction - strip away everything not needed for the specific use case.

2.4 Common Misconceptions

Misconception 1: “jemalloc/tcmalloc is fast enough”

Reality: They are faster than glibc malloc but still have unpredictable latency. Thread cache replenishment can stall, and their complexity means more cache pollution.

Misconception 2: “I can just use stack allocation”

Reality: Stack allocation is fast but limited in size (typically 1-8MB) and lifetime. You cannot return stack-allocated data from a function, and deep call stacks compete for stack space.

Misconception 3: “Object pools are only for memory savings”

Reality: The primary benefit in HFT is latency determinism, not memory efficiency. Pools guarantee O(1) allocation with no system calls.

Misconception 4: “Rust’s ownership system eliminates the need for custom allocators”

Reality: Rust’s safety guarantees help prevent bugs, but Box::new() still calls the global allocator (malloc). For HFT Rust code, you still need custom allocators.

3. Project Specification

3.1 What You Will Build

A custom memory allocation library that provides:

Fixed-size pool allocator: Pre-allocated blocks of a single size with O(1) alloc/free
Multi-pool allocator: Multiple pools for different size classes
Arena allocator: Bump-pointer allocation with bulk reset
Thread-local wrapper: Per-thread pools without locks
C++ std::allocator integration: Use with std::vector, std::list, etc.
Comprehensive benchmarks: Compare against malloc, jemalloc

3.2 Functional Requirements

Requirement	Description
FR1	Pool allocator with configurable block size and count
FR2	O(1) allocation and deallocation for pool allocator
FR3	Arena allocator with configurable size and reset capability
FR4	Multi-pool supporting at least 4 size classes (e.g., 32B, 64B, 128B, 256B)
FR5	Thread-local pool storage without locks for same-thread alloc/free
FR6	Statistics tracking: allocations, frees, high watermark, current usage
FR7	Debug mode: detect double-free, use-after-free (at cost of performance)
FR8	Integration with C++ STL containers via std::allocator interface

3.3 Non-Functional Requirements

Requirement	Target
NFR1	Pool allocation: <15ns average latency
NFR2	Pool deallocation: <10ns average latency
NFR3	Deterministic: p99 < 2x p50 for pool operations
NFR4	Zero system calls after initialization
NFR5	Memory overhead: <5% for block metadata
NFR6	Thread-local pools: zero contention between threads

3.4 Example Usage / Output

Benchmark Output:

=== ALLOCATOR BENCHMARK ===
Running 1,000,000 allocation/deallocation cycles...

malloc (glibc 2.31):
  alloc avg:  247ns   p50: 180ns   p99: 1.2μs   p999: 8.5μs
  free avg:   125ns   p50: 95ns    p99: 820ns   p999: 5.2μs

pool_alloc (fixed 64B blocks):
  alloc avg:  8ns     p50: 7ns     p99: 12ns    p999: 15ns
  free avg:   6ns     p50: 5ns     p99: 9ns     p999: 11ns

arena_alloc (bump pointer):
  alloc avg:  3ns     p50: 3ns     p99: 4ns     p999: 5ns
  reset avg:  2ns     (bulk free)

Improvement over malloc:
  Pool allocation:  30x faster average, 100x better p99
  Arena allocation: 80x faster average, 300x better p99

=== INTEGRATION TEST: ORDER BOOK ===
Order book with std::allocator (malloc):
  Insert order avg: 850ns   p99: 12μs
  Remove order avg: 420ns   p99: 8μs

Order book with pool_allocator:
  Insert order avg: 320ns   p99: 450ns
  Remove order avg: 180ns   p99: 280ns

Improvement: 62% lower latency, 96% lower tail latency

=== MEMORY STATS ===
Pool: 64B blocks
  Total capacity: 65536 blocks (4MB)
  Current used:   12847 blocks
  Peak usage:     24531 blocks
  Allocations:    1,847,293
  Frees:          1,834,446

Arena: 1MB
  Current offset: 524288 bytes
  Resets:         1,423
  Allocations:    14,847,291

Code Usage Example (C++):

// Example of how your allocator will be used
#include "pool_allocator.hpp"

// Create a pool for Order objects
using OrderPool = PoolAllocator<Order, 10000>;
OrderPool order_pool;

// Allocate without malloc
Order* order = order_pool.allocate();
order->price = 100.50;
order->quantity = 100;

// Use in hot path with zero malloc calls
process_order(order);

// Free back to pool
order_pool.deallocate(order);

// Use with STL containers
std::vector<Order, OrderPool> orders(order_pool);
orders.push_back(Order{...});  // Uses pool, not malloc

3.5 Real World Outcome

After completing this project, you will have:

A working allocator library you can integrate into your order book from Project 1
Measurable latency improvements with benchmark data showing 10-100x improvement
Understanding of memory systems that applies to any performance-critical code
Interview-ready knowledge about memory allocation internals

This allocator will directly integrate with your other HFT projects:

Project 1 (Order Book): Use pool allocator for Order and PriceLevel objects
Project 2 (Lock-Free Queue): Use arena allocator for message parsing
Project 3 (Matching Engine): Use thread-local pools for per-connection state

4. Solution Architecture

4.1 High-Level Design

Custom Allocator System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  APPLICATION LAYER                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  std::vector<T, PoolAllocator<T>>                           │   │
│  │  Order* order = pool.allocate()                             │   │
│  │  Message* msg = arena.allocate(sizeof(Message))             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ALLOCATOR INTERFACE                                        │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐   │   │
│  │  │ allocate()  │ │deallocate() │ │ reset() (arena)     │   │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│          ┌───────────────────┼───────────────────┐                 │
│          ▼                   ▼                   ▼                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │  POOL        │    │  ARENA       │    │  MULTI-POOL  │          │
│  │  ALLOCATOR   │    │  ALLOCATOR   │    │  ALLOCATOR   │          │
│  │              │    │              │    │              │          │
│  │ Fixed-size   │    │ Bump pointer │    │ Size-class   │          │
│  │ Free list    │    │ Bulk reset   │    │ routing      │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│          │                   │                   │                  │
│          └───────────────────┼───────────────────┘                 │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  THREAD-LOCAL WRAPPER (optional)                            │   │
│  │                                                             │   │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐               │   │
│  │  │ Thread 1  │  │ Thread 2  │  │ Thread 3  │               │   │
│  │  │   Pool    │  │   Pool    │  │   Pool    │               │   │
│  │  └───────────┘  └───────────┘  └───────────┘               │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  BACKING MEMORY                                             │   │
│  │                                                             │   │
│  │  Single large allocation at startup (mmap or new char[])   │   │
│  │  ┌─────────────────────────────────────────────────────┐   │   │
│  │  │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│   │   │
│  │  └─────────────────────────────────────────────────────┘   │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component	Responsibility	Key Data Structures
PoolAllocator	Fixed-size block allocation	Free list, block array
ArenaAllocator	Bump-pointer allocation	Bump pointer, base pointer
MultiPoolAllocator	Route to size-appropriate pool	Array of pools, size-to-pool mapping
ThreadLocalPool	Per-thread pool access	Thread-local storage (TLS)
Statistics	Track usage metrics	Atomic counters
DebugAllocator	Detect memory errors	Guard bytes, allocation tracking

4.3 Data Structures

Pool Allocator Internal Structure

Pool Allocator Memory Layout
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  METADATA (stored separately from blocks for cache efficiency)     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ block_size: 64                                              │    │
│  │ num_blocks: 10000                                           │    │
│  │ free_count: 7342                                            │    │
│  │ free_list_head: ──────────────────────────────┐             │    │
│  │ memory_base: ──────────────────────┐          │             │    │
│  └────────────────────────────────────│──────────│─────────────┘    │
│                                       │          │                  │
│                                       ▼          │                  │
│  BLOCK ARRAY                                     │                  │
│  ┌───────────────────────────────────────────────│─────────────┐   │
│  │                                               │             │   │
│  │  Block 0      Block 1      Block 2           │             │   │
│  │  ┌────────┐   ┌────────┐   ┌────────┐   ◄────┘             │   │
│  │  │ IN USE │   │ next ──────► next ──────► NULL             │   │
│  │  │ (data) │   │ (free) │   │ (free) │                      │   │
│  │  │        │   │        │   │        │                      │   │
│  │  └────────┘   └────────┘   └────────┘   ...                │   │
│  │     64B         64B          64B                            │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  INTRUSIVE FREE LIST (no separate allocation for list nodes!)     │
│                                                                     │
│  When a block is FREE, its first 8 bytes store the 'next' pointer │
│  When a block is IN USE, user data overwrites the 'next' pointer  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Arena Allocator Internal Structure

Arena Allocator Memory Layout
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  METADATA                                                          │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ base:     0x7f8b4c000000                                     │  │
│  │ end:      0x7f8b4c100000  (base + 1MB)                       │  │
│  │ current:  0x7f8b4c002800  (next allocation position)         │  │
│  │ allocs:   142                                                │  │
│  │ resets:   3                                                  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  MEMORY REGION                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│  │
│  │                           ▲                                 ▲│  │
│  │                           │                                 ││  │
│  │                        current                            end│  │
│  │                                                              │  │
│  │  ◄────── allocated ──────►◄────────── available ────────────►│  │
│  │                                                              │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  ALLOCATION (returns current, bumps pointer)                       │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ Before: current = 0x2800                                     │  │
│  │ Request: 128 bytes (aligned to 8)                            │  │
│  │ After:  current = 0x2880                                     │  │
│  │ Return: 0x2800                                               │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  RESET (restore to beginning)                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ current = base                                               │  │
│  │ // All previous allocations now invalid                      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Multi-Pool Size Class Routing

Multi-Pool Allocator Size Class Design
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  SIZE CLASSES (power of 2 for simplicity)                          │
│                                                                     │
│  Request Size    Rounded To    Pool Index    Wasted Bytes (max)    │
│  ┌────────────┬─────────────┬────────────┬──────────────────────┐  │
│  │   1-32     │     32      │     0      │      31 (97%)        │  │
│  │  33-64     │     64      │     1      │      31 (48%)        │  │
│  │  65-128    │    128      │     2      │      63 (49%)        │  │
│  │ 129-256    │    256      │     3      │     127 (50%)        │  │
│  │ 257-512    │    512      │     4      │     255 (50%)        │  │
│  │ 513-1024   │   1024      │     5      │     511 (50%)        │  │
│  │  >1024     │   fallback to malloc     │                      │  │
│  └────────────┴─────────────┴────────────┴──────────────────────┘  │
│                                                                     │
│  ROUTING FUNCTION (must be O(1)!)                                  │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ // Efficient power-of-2 routing using bit manipulation       │  │
│  │ pool_index = (size <= 32) ? 0 :                              │  │
│  │              (63 - count_leading_zeros(size - 1)) - 4;       │  │
│  │                                                              │  │
│  │ // Or use lookup table for small sizes (cache-friendly)      │  │
│  │ pool_index = size_to_pool_table[size];                       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  POOL ARRAY                                                        │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ pools[0]: PoolAllocator<32>   ────► [32B blocks...]          │  │
│  │ pools[1]: PoolAllocator<64>   ────► [64B blocks...]          │  │
│  │ pools[2]: PoolAllocator<128>  ────► [128B blocks...]         │  │
│  │ pools[3]: PoolAllocator<256>  ────► [256B blocks...]         │  │
│  │ pools[4]: PoolAllocator<512>  ────► [512B blocks...]         │  │
│  │ pools[5]: PoolAllocator<1024> ────► [1024B blocks...]        │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

4.4 Algorithm Overview

Pool Allocator Algorithms

Initialization:

Allocate contiguous memory for all blocks: memory = mmap(block_size * num_blocks)
Initialize free list by linking all blocks: block[i].next = &block[i+1]
Set free_list_head = &block[0]

Allocation (O(1)):

Check free_list_head != nullptr (pool exhausted check)
block = free_list_head
free_list_head = block->next
Return block

Deallocation (O(1)):

block->next = free_list_head
free_list_head = block

Arena Allocator Algorithms

Initialization:

Allocate contiguous memory: memory = mmap(arena_size)
Set current = memory, end = memory + arena_size

Allocation (O(1)):

aligned_size = align_up(size, 8) // 8-byte alignment
Check current + aligned_size <= end (capacity check)
result = current
current += aligned_size
Return result

Reset (O(1)):

current = base
Note: All previous pointers are now invalid!

Thread-Local Pool Access

Get Thread-Local Pool:

Access TLS variable: thread_local PoolAllocator* tls_pool
If tls_pool == nullptr, initialize for this thread
Return tls_pool

Key Insight: No locks needed because each thread has its own pool. Cross-thread free (freeing memory allocated by another thread) requires special handling - typically a lock-free return queue.

5. Implementation Guide

5.1 Development Environment Setup

Required Tools:

Tool	Purpose	Installation
GCC 12+ or Clang 15+	C++20 support	`apt install g++-12` or `apt install clang-15`
CMake 3.20+	Build system	`apt install cmake`
perf	Benchmarking	`apt install linux-tools-generic`
Valgrind	Memory debugging	`apt install valgrind`
Google Benchmark	Microbenchmarks	`apt install libbenchmark-dev`

Recommended Compiler Flags:

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_FLAGS "-O3 -march=native -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -g -fsanitize=address,undefined")

5.2 Project Structure

hft-allocator/
├── CMakeLists.txt
├── include/
│   ├── hft/
│   │   ├── pool_allocator.hpp      # Fixed-size pool
│   │   ├── arena_allocator.hpp     # Bump allocator
│   │   ├── multi_pool.hpp          # Size-class routing
│   │   ├── thread_local_pool.hpp   # TLS wrapper
│   │   ├── stl_adapter.hpp         # std::allocator interface
│   │   └── allocator_stats.hpp     # Statistics tracking
├── src/
│   └── allocator_stats.cpp         # Statistics implementation
├── tests/
│   ├── pool_test.cpp               # Unit tests
│   ├── arena_test.cpp
│   ├── multi_pool_test.cpp
│   └── thread_local_test.cpp
├── benchmarks/
│   ├── alloc_benchmark.cpp         # Performance comparison
│   └── integration_benchmark.cpp   # With order book
└── examples/
    ├── basic_usage.cpp
    └── orderbook_integration.cpp

5.3 The Core Question You’re Answering

“How do you provide memory allocation with O(1) performance and zero fragmentation for a known set of object sizes?”

This question forces you to understand:

Why general-purpose allocators are slow (flexibility has cost)
How to trade flexibility for speed (fixed sizes, pre-allocation)
The difference between internal and external fragmentation
How to achieve deterministic performance (no system calls, no locks)

5.4 Concepts You Must Understand First

Before implementing, verify you understand these concepts:

Concept	Self-Assessment Question	Resource if Stuck
Pointers and Memory	What is the difference between `&x` and `*p`? What does `p + 1` mean for a pointer to a 64-byte struct?	CS:APP Ch. 3
Memory Layout	Where are stack, heap, and static variables stored? Which grows up vs down?	CS:APP Ch. 9
Alignment	Why must a `double` be aligned to 8 bytes on x86-64? What happens if it’s not?	CS:APP Ch. 3.9
Cache Lines	How big is a cache line? Why does accessing `arr[0]` also load `arr[1-7]`?	CS:APP Ch. 6
System Calls	What happens when you call `malloc`? When does it call `sbrk`?	TLPI Ch. 7
Thread-Local Storage	What is `thread_local` in C++? How is it different from global variables?	C++ reference

5.5 Questions to Guide Your Design

Pool Allocator Design:

How will you store the free list? Separate allocation or intrusive in the blocks?
What happens when the pool is exhausted? Fail fast, grow, or fallback?
How will you handle alignment requirements for different types?
Should you zero memory on allocation? On deallocation?

Arena Allocator Design:

What alignment will you use for allocations? Why 8 bytes?
How will you detect arena overflow? Check on every allocation?
When should an arena be reset? Who is responsible?
Can you have multiple arenas for different purposes?

Thread-Local Design:

When will thread-local pools be initialized? Lazy or at thread start?
What happens if Thread A allocates and Thread B tries to free?
How will you handle thread termination? Leak the pool or reclaim?

Integration Design:

How will your allocator satisfy the C++ Allocator concept requirements?
How will you enable/disable statistics tracking? Compile-time or runtime?
What debug checks should be enabled in development builds?

5.6 Thinking Exercise

Before writing code, trace through this scenario by hand:

Scenario: Pool with 4 blocks of 32 bytes each.

Initial State:
  memory: [Block0][Block1][Block2][Block3]
  free_list: Block0 -> Block1 -> Block2 -> Block3 -> null

Operations:
1. p1 = allocate()   -- Draw the state after this
2. p2 = allocate()   -- Draw the state after this
3. deallocate(p1)    -- Draw the state after this
4. p3 = allocate()   -- Where does p3 point? Why?
5. p4 = allocate()   -- Draw the state after this
6. p5 = allocate()   -- What happens? Why?

Work through this on paper before implementing. Understanding the free list mechanics is essential.

5.7 Hints in Layers

Hint 1: Starting Point (Conceptual Direction) Start with the simplest possible pool allocator - a fixed block size known at compile time, single-threaded, no statistics. Get this working and benchmarked before adding complexity.

Hint 2: Next Level (More Specific Guidance) Use an intrusive free list - store the next pointer inside the free block itself. This eliminates separate allocation for list nodes. The key insight: when a block is free, you can use its memory for bookkeeping.

Hint 3: Technical Details (Approach/Pseudocode)

Pool allocator structure:
- void* memory_base     // mmap'd or new char[]
- void* free_list_head  // points to first free block
- size_t block_size     // constant
- size_t num_blocks     // constant

allocate():
  if (free_list_head == nullptr) return nullptr;
  void* result = free_list_head;
  free_list_head = *((void**)free_list_head);  // Follow the intrusive pointer
  return result;

deallocate(void* ptr):
  *((void**)ptr) = free_list_head;  // Store old head in freed block
  free_list_head = ptr;             // New head is the freed block

Hint 4: Tools and Debugging (Verification Methods)

Use AddressSanitizer (-fsanitize=address) to catch use-after-free and buffer overflows
Use perf stat to verify zero system calls after init: perf stat -e syscalls:sys_enter_* ./benchmark
Add guard bytes (0xDEADBEEF) before/after blocks in debug mode to detect overflow
Print allocation traces with __builtin_return_address(0) to debug double-frees

5.8 The Interview Questions They’ll Ask

After completing this project, be prepared to answer:

“Why is malloc slow for low-latency applications?”
- Expected: Discuss system calls, fragmentation, locking, non-determinism
- Bonus: Mention specific latency numbers (250ns typical, 50us worst case)
“How does a pool allocator achieve O(1) allocation?”
- Expected: Explain free list, intrusive pointers, pre-allocation
- Bonus: Discuss cache implications of contiguous blocks
“What is the trade-off between pool and arena allocators?”
- Expected: Pool allows individual free, arena requires bulk free
- Bonus: Discuss use cases for each in a trading system
“How do you avoid lock contention in a multi-threaded allocator?”
- Expected: Thread-local storage, per-thread pools
- Bonus: Discuss cross-thread free problem and solutions
“What is false sharing and how do you prevent it?”
- Expected: Cache line sharing, padding to 64 bytes
- Bonus: Explain alignas(64) and __attribute__((aligned(64)))
“How would you integrate a custom allocator with STL containers?”
- Expected: Implement Allocator concept (allocate, deallocate, rebind)
- Bonus: Discuss stateful vs stateless allocators
“What debugging features would you add to a custom allocator?”
- Expected: Guard bytes, double-free detection, allocation tracking
- Bonus: Discuss compile-time vs runtime debug mode toggle

5.9 Books That Will Help

Topic	Book	Specific Chapters
Memory allocation internals	“Computer Systems: A Programmer’s Perspective” (CS:APP)	Ch. 9 (Virtual Memory)
Cache hierarchy	CS:APP	Ch. 6 (Memory Hierarchy)
Data structure alignment	CS:APP	Ch. 3.9 (Heterogeneous Data Structures)
System calls (mmap)	“The Linux Programming Interface” (TLPI)	Ch. 49 (Memory Mappings)
Pool allocator pattern	“Fluent C” by Preschern	Ch. 6 (Memory Pool)
C++ allocator interface	“Effective Modern C++” by Meyers	Item 21 (Custom allocators)
Lock-free programming	“C++ Concurrency in Action” by Williams	Ch. 7 (Lock-free data structures)
Thread-local storage	“Rust Atomics and Locks” by Bos	Ch. 8 (Thread-local storage)

5.10 Implementation Phases

Phase 1: Basic Fixed-Size Pool (Days 1-2)

Milestone: Single-threaded pool allocator that passes basic tests.

Tasks:

Define PoolAllocator class with template parameter for block size
Implement constructor that allocates backing memory with mmap or new char[]
Initialize intrusive free list linking all blocks
Implement allocate() - pop from free list
Implement deallocate() - push to free list
Write unit tests:
- Allocate all blocks, verify all unique pointers
- Free all blocks, allocate again, verify reuse
- Allocate one more than capacity, verify nullptr return

Verification:

# All tests pass
./pool_test

# Check no memory leaks
valgrind ./pool_test

Phase 2: Arena Allocator (Day 3)

Milestone: Bump-pointer allocator with reset capability.

Tasks:

Define ArenaAllocator class with size parameter
Implement constructor with backing memory allocation
Implement allocate(size_t n) - bump pointer with alignment
Implement reset() - restore pointer to beginning
Add overflow checking
Write unit tests:
- Allocate various sizes, verify alignment
- Fill arena, verify nullptr on overflow
- Reset, verify can allocate again

Verification:

./arena_test

Phase 3: Multi-Pool Size Classes (Day 4)

Milestone: Allocator that routes to appropriate pool based on size.

Tasks:

Define size classes (32, 64, 128, 256 bytes minimum)
Create array of pools, one per size class
Implement size-to-pool routing (bit manipulation or lookup table)
Implement allocate(size_t n) with routing
Implement deallocate(void* p, size_t n) - note: size must be provided!
Handle oversized allocations (fallback to malloc or reject)
Write unit tests for various sizes

Design Decision: How to know which pool a pointer came from when freeing?

Option A: Require size on deallocate (fastest, used by jemalloc)
Option B: Store header before each block (overhead per allocation)
Option C: Check pointer ranges (one comparison per pool)

Phase 4: Thread-Local Wrapper (Day 5)

Milestone: Per-thread pools without locks.

Tasks:

Create ThreadLocalPool wrapper using thread_local
Implement lazy initialization on first access
Test with multiple threads allocating concurrently
Verify zero lock contention with benchmarks

Complication - Cross-Thread Free: What if Thread A allocates and Thread B frees? Options:

Option A: Disallow (assert same thread)
Option B: Lock-free return queue per pool
Option C: Lock only for cross-thread free (rare path)

For HFT, Option A is usually correct - memory should be freed by the same thread that allocated it.

Phase 5: C++ Allocator Integration (Day 6)

Milestone: Use pool allocator with std::vector<T, PoolAllocator<T>>.

Tasks:

Implement C++17 Allocator concept requirements:
- value_type typedef
- allocate(n) returning T*
- deallocate(T* p, n)
Implement rebind for container node types
Handle containers that allocate different sizes (e.g., std::list nodes)
Test with std::vector, std::list, std::map

Challenge: Containers may allocate auxiliary data (e.g., map nodes) of different sizes than T. Your allocator must handle this via rebinding or multi-pool.

Phase 6: Benchmarking and Statistics (Day 7)

Milestone: Demonstrate 10-100x improvement over malloc.

Tasks:

Add statistics tracking: allocs, frees, current usage, high watermark
Create benchmark comparing:
- Pool allocator vs malloc
- Arena allocator vs malloc
- Thread-local pool vs shared pool
Measure p50, p99, p999 latencies
Integrate with order book from Project 1
Document results

Benchmark Framework:

// Use Google Benchmark for consistent measurements
static void BM_MallocAlloc(benchmark::State& state) {
    for (auto _ : state) {
        void* p = malloc(64);
        benchmark::DoNotOptimize(p);
        free(p);
    }
}
BENCHMARK(BM_MallocAlloc);

static void BM_PoolAlloc(benchmark::State& state) {
    PoolAllocator<64, 10000> pool;
    for (auto _ : state) {
        void* p = pool.allocate();
        benchmark::DoNotOptimize(p);
        pool.deallocate(p);
    }
}
BENCHMARK(BM_PoolAlloc);

5.11 Key Implementation Decisions

Decision	Recommended Choice	Rationale
Backing memory	`mmap` with `MAP_PRIVATE \| MAP_ANONYMOUS`	Avoid heap fragmentation, easy to return to OS
Block size	Template parameter	Compile-time optimization, no runtime overhead
Alignment	8 bytes minimum, configurable	Match x86-64 requirements, allow AVX alignment
Overflow behavior	Return nullptr	Fail fast, let caller decide
Debug checks	Compile-time flag	Zero overhead in release builds
Statistics	Optional, runtime flag	Allow measurement without recompile
Cross-thread free	Assert same thread	Simplest correct behavior for HFT

6. Testing Strategy

6.1 Unit Tests

Pool Allocator Tests:

TEST(PoolAllocator, AllocatesUniquePointers) {
    PoolAllocator<64, 100> pool;
    std::set<void*> ptrs;
    for (int i = 0; i < 100; ++i) {
        void* p = pool.allocate();
        ASSERT_NE(p, nullptr);
        ASSERT_EQ(ptrs.count(p), 0);  // Must be unique
        ptrs.insert(p);
    }
}

TEST(PoolAllocator, ReturnsNullWhenExhausted) {
    PoolAllocator<64, 10> pool;
    for (int i = 0; i < 10; ++i) pool.allocate();
    ASSERT_EQ(pool.allocate(), nullptr);
}

TEST(PoolAllocator, ReusesFreedBlocks) {
    PoolAllocator<64, 10> pool;
    void* p1 = pool.allocate();
    pool.deallocate(p1);
    void* p2 = pool.allocate();
    ASSERT_EQ(p1, p2);  // LIFO reuse
}

TEST(PoolAllocator, AlignmentCorrect) {
    PoolAllocator<64, 100> pool;
    for (int i = 0; i < 100; ++i) {
        void* p = pool.allocate();
        ASSERT_EQ(reinterpret_cast<uintptr_t>(p) % 8, 0);
    }
}

Arena Allocator Tests:

TEST(ArenaAllocator, BumpPointerAdvances) {
    ArenaAllocator<1024> arena;
    void* p1 = arena.allocate(32);
    void* p2 = arena.allocate(32);
    ASSERT_EQ(static_cast<char*>(p2) - static_cast<char*>(p1), 32);
}

TEST(ArenaAllocator, ResetRestoresCapacity) {
    ArenaAllocator<64> arena;
    for (int i = 0; i < 8; ++i) arena.allocate(8);
    ASSERT_EQ(arena.allocate(8), nullptr);  // Full
    arena.reset();
    ASSERT_NE(arena.allocate(8), nullptr);  // Available again
}

TEST(ArenaAllocator, AlignmentRoundsUp) {
    ArenaAllocator<1024> arena;
    void* p1 = arena.allocate(3);  // Rounds to 8
    void* p2 = arena.allocate(1);
    ASSERT_EQ(static_cast<char*>(p2) - static_cast<char*>(p1), 8);
}

6.2 Stress Tests

TEST(PoolAllocator, StressTestMillionCycles) {
    PoolAllocator<64, 1000> pool;
    std::vector<void*> ptrs;

    for (int cycle = 0; cycle < 1000000; ++cycle) {
        // Randomly allocate or free
        if (ptrs.empty() || (rand() % 2 == 0 && ptrs.size() < 1000)) {
            void* p = pool.allocate();
            if (p) ptrs.push_back(p);
        } else {
            size_t idx = rand() % ptrs.size();
            pool.deallocate(ptrs[idx]);
            ptrs.erase(ptrs.begin() + idx);
        }
    }

    // Free remaining
    for (void* p : ptrs) pool.deallocate(p);
}

TEST(ThreadLocalPool, ConcurrentAccess) {
    ThreadLocalPool<64, 1000> pool;
    std::vector<std::thread> threads;
    std::atomic<int> total_allocs{0};

    for (int t = 0; t < 8; ++t) {
        threads.emplace_back([&]() {
            for (int i = 0; i < 100000; ++i) {
                void* p = pool.allocate();
                ASSERT_NE(p, nullptr);
                pool.deallocate(p);
                total_allocs++;
            }
        });
    }

    for (auto& t : threads) t.join();
    ASSERT_EQ(total_allocs, 800000);
}

6.3 Property-Based Tests

Test invariants that should always hold:

// After any sequence of alloc/free, free_count + allocated_count == total_blocks
// All returned pointers are within the pool's memory range
// All returned pointers are aligned correctly
// Double-free should be detected (in debug mode)

6.4 Integration Tests

TEST(Integration, VectorWithPoolAllocator) {
    using OrderVector = std::vector<Order, PoolAllocator<Order, 1000>>;

    OrderVector orders;
    for (int i = 0; i < 100; ++i) {
        orders.push_back(Order{.id = i, .price = 100.0 + i});
    }

    ASSERT_EQ(orders.size(), 100);
    ASSERT_EQ(orders[50].id, 50);
}

7. Common Pitfalls & Debugging

7.1 Pitfall: Incorrect Free List Initialization

Symptom: Crash on first allocation, or same pointer returned multiple times.

Cause: Free list not properly initialized; blocks not linked correctly.

Fix: After allocating backing memory, iterate through all blocks and set each block’s next pointer:

for (size_t i = 0; i < num_blocks - 1; ++i) {
    *reinterpret_cast<void**>(block_at(i)) = block_at(i + 1);
}
*reinterpret_cast<void**>(block_at(num_blocks - 1)) = nullptr;
free_list_head = block_at(0);

Verification: Print free list after initialization, verify correct linking.

7.2 Pitfall: Alignment Violations

Symptom: SIGBUS on ARM, performance degradation on x86, incorrect data.

Cause: Blocks not aligned to required boundary (8 bytes for most types, 16/32 for SIMD).

Fix: Ensure block_size is a multiple of required alignment. Use alignas or align backing memory:

void* memory = aligned_alloc(64, total_size);  // 64-byte alignment for cache

Verification: Assert (uintptr_t)ptr % alignment == 0 on every allocation.

7.3 Pitfall: Use-After-Free Not Detected

Symptom: Mysterious data corruption, non-deterministic crashes.

Cause: Freed block reused while old pointer still in use.

Fix (debug mode): Fill freed blocks with poison pattern:

void deallocate(void* ptr) {
    #ifdef DEBUG
    memset(ptr, 0xDE, block_size);  // Poison pattern
    #endif
    // ... normal free logic
}

Verification: Enable AddressSanitizer during development.

7.4 Pitfall: Double Free Corruption

Symptom: Free list corruption, infinite loops, crashes.

Cause: Same pointer passed to deallocate twice.

Fix (debug mode): Maintain allocated set or use sentinel values:

void deallocate(void* ptr) {
    #ifdef DEBUG
    // Check first 8 bytes aren't already our free-list marker
    if (*reinterpret_cast<uint64_t*>(ptr) == FREE_MARKER) {
        abort();  // Double free!
    }
    *reinterpret_cast<uint64_t*>(ptr) = FREE_MARKER;
    #endif
    // ... normal free logic
}

Symptom: Thread-local pools slower than expected, high cache miss rates.

Cause: Pool metadata from different threads on same cache line.

Fix: Pad thread-local pool structures to cache line boundaries:

struct alignas(64) ThreadLocalPool {
    PoolAllocator<64, 1000> pool;
    char padding[64 - sizeof(PoolAllocator<64, 1000>) % 64];
};

Verification: Use perf stat -e cache-misses to compare before/after.

7.6 Pitfall: Returning Dangling Pointer After Reset

Symptom: Arena reset invalidates pointers, but caller still uses them.

Cause: Arena allocator semantics not understood by caller.

Fix: Document clearly that reset() invalidates ALL previous allocations. Consider:

// Return a scope guard that resets on destruction
auto scope = arena.scope_guard();  // Resets arena when scope ends

7.7 Debug Checklist

When something goes wrong:

Enable AddressSanitizer: -fsanitize=address
Add debug logging to allocate/deallocate
Verify free list integrity after each operation
Check pointer is within pool range before free
Use Valgrind for detailed memory analysis
Add guard bytes before/after blocks
Print allocation/free traces with stack traces

8. Extensions & Challenges

8.1 Extension: Lock-Free Multi-Producer Pool

Implement a pool that allows multiple threads to allocate concurrently using lock-free techniques:

Use atomic compare-and-swap for free list head
Handle ABA problem with tagged pointers or hazard pointers
Benchmark against mutex-based version

Challenge: The ABA problem occurs when Thread A sees value A, Thread B changes A to B and back to A, then Thread A’s CAS succeeds incorrectly.

8.2 Extension: Growing Pool

Implement a pool that can grow when exhausted:

Allocate additional chunks on demand
Maintain linked list of chunks
Handle deallocation across chunks
Consider fragmentation implications

8.3 Extension: Memory-Mapped File Pool

Implement a pool backed by a memory-mapped file:

Persist allocations across process restarts
Handle crash recovery
Useful for maintaining order book state

8.4 Extension: NUMA-Aware Allocation

For multi-socket servers:

Detect NUMA topology
Allocate pools on local NUMA node
Bind threads to cores near their memory

8.5 Extension: Defragmentation

For long-running systems:

Track allocation patterns
Periodically compact pools
Handle live object relocation (requires cooperation from caller)

8.6 Challenge: Rust GlobalAlloc Implementation

Implement your allocator as a Rust GlobalAlloc:

#[global_allocator]
static ALLOC: PoolGlobalAlloc = PoolGlobalAlloc::new();

unsafe impl GlobalAlloc for PoolGlobalAlloc {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ... }
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { ... }
}

Challenge: GlobalAlloc must be thread-safe and handle any size/alignment.

9. Real-World Connections

9.1 Industry Usage

Company/Project	Allocator Approach
Jane Street	Custom OCaml allocators, arena patterns
HRT	Custom C++ allocators, huge pages
Tower Research	Per-thread pools, zero-copy parsing
Game Engines	Frame allocators (arena per frame)
Databases	Buffer pool managers, slab allocators

9.2 Production Allocators to Study

Allocator	Key Innovation
jemalloc	Thread caching, size classes, arenas
tcmalloc	Per-thread caches, central free list
mimalloc	Free list sharding, segment-based
Hoard	Superblock-based, scalable

9.3 Your Allocator vs Production

Your allocator is simpler but faster for your use case:

Production allocators handle ANY size; yours handles fixed sizes
Production allocators are general-purpose; yours is application-specific
Production allocators trade some speed for flexibility

This is the HFT philosophy: specialize for speed.

9.4 Integration with Your Other Projects

Project	How Allocator Helps
P01: Order Book	Pool allocator for Order and PriceLevel objects
P02: Lock-Free Queue	Arena for parsing buffers, pool for queue nodes
P03: Matching Engine	Thread-local pools for per-connection state
P06: Full System	All of the above, plus statistics for monitoring

10. Resources

10.1 Primary References

Resource	Use For	Specific Sections
“Computer Systems: A Programmer’s Perspective”	Memory fundamentals	Ch. 9 (Virtual Memory), Ch. 6 (Cache)
“The Linux Programming Interface”	mmap, system calls	Ch. 7, Ch. 49
“Fluent C” by Preschern	Pool allocator pattern	Ch. 6 (Memory Pool)
“Building Low Latency Applications with C++”	HFT context	Ch. on memory management

10.2 Articles and Talks

Resource	Description
Andrei Alexandrescu: std::allocator Is to Allocation What std::vector Is to Vexation	Why std::allocator is limited
CppCon: Memory Allocation Strategies	Various CppCon talks
What Every Programmer Should Know About Memory	Ulrich Drepper’s classic paper

10.3 Code References

Repository	What to Learn
jemalloc source	Size classes, arenas, thread caching
tcmalloc source	Per-thread caches, central free list
mimalloc source	Segment-based, free list sharding

10.4 Tools

Tool	Purpose
`perf`	Performance profiling, cache analysis
`valgrind`	Memory error detection
`heaptrack`	Heap profiling
`massif`	Heap visualization
AddressSanitizer	Runtime memory error detection

11. Self-Assessment Checklist

You have mastered this project when you can:

Conceptual Understanding:

Explain why malloc is unsuitable for HFT hot paths (latency, determinism)
Describe the trade-offs between pool, arena, and multi-pool allocators
Explain false sharing and how to prevent it
Describe the ABA problem in lock-free free lists

Implementation Skills:

Implement a pool allocator from scratch without reference
Implement an arena allocator with proper alignment
Create thread-local pools without lock contention
Integrate your allocator with std::vector using the Allocator concept

Performance Verification:

Benchmark your allocator vs malloc, demonstrating 10x+ improvement
Show deterministic p99 latency (< 2x p50)
Verify zero system calls after initialization using perf

Debugging Ability:

Add debug checks that detect double-free and use-after-free
Use AddressSanitizer to catch memory errors
Diagnose false sharing using cache miss counters

Real-World Application:

Integrate your allocator into the order book from Project 1
Measure end-to-end latency improvement in a realistic scenario

12. Submission / Completion Criteria

Your project is complete when you have:

12.1 Code Deliverables

pool_allocator.hpp: Fixed-size pool with O(1) alloc/free
arena_allocator.hpp: Bump allocator with reset
multi_pool.hpp: Size-class routing to pools
thread_local_pool.hpp: Per-thread pool wrapper
stl_adapter.hpp: C++ Allocator concept implementation
Comprehensive unit tests (>90% coverage)
Benchmark suite comparing to malloc

12.2 Documentation

README with build instructions
API documentation for each allocator type
Benchmark results with analysis

12.3 Performance Criteria

Pool allocator: <15ns average allocation
Pool allocator: p99 < 2x p50 (deterministic)
Zero system calls after initialization (verified with perf)
Multi-threaded benchmark shows zero lock contention

12.4 Integration

Demonstrated integration with order book or matching engine
Measured latency improvement in realistic scenario

Summary

This project teaches you one of the most important skills in HFT development: eliminating malloc from your hot path. The difference between a 250ns malloc and an 8ns pool allocation may seem small, but multiplied across thousands of operations per second, it becomes the difference between profit and loss.

Key takeaways:

General-purpose allocators sacrifice speed for flexibility - you can do the opposite
Pool allocators achieve O(1) through pre-allocation and free lists - no searching, no fragmentation
Arena allocators are even faster - just bump a pointer, reset all at once
Thread-local pools eliminate contention - essential for multi-threaded HFT
Determinism matters as much as speed - p99 latency kills trading strategies

When you complete this project, you will have built infrastructure that is foundational to every serious HFT system. More importantly, you will understand why these techniques work and when to apply them.

“In HFT, memory allocation is not a solved problem - it’s a problem you solve for your specific use case. The best allocator is the one that does exactly what you need and nothing more.”

Project 5: Custom Memory Allocator for HFT

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

Why malloc is Slow for HFT

Pool Allocators: Fixed-Size Block Freedom

Arena Allocators: The Simplest Fast Allocator

Thread-Local Pools: Eliminating Contention

2.2 Why This Matters in HFT

2.3 Historical Context

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

Pool Allocator Internal Structure

Arena Allocator Internal Structure

Multi-Pool Size Class Routing

4.4 Algorithm Overview

Pool Allocator Algorithms

Arena Allocator Algorithms

Thread-Local Pool Access

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 Hints in Layers

5.8 The Interview Questions They’ll Ask

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Basic Fixed-Size Pool (Days 1-2)

Phase 2: Arena Allocator (Day 3)

Phase 3: Multi-Pool Size Classes (Day 4)

Phase 4: Thread-Local Wrapper (Day 5)

Phase 5: C++ Allocator Integration (Day 6)

Phase 6: Benchmarking and Statistics (Day 7)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Unit Tests

6.2 Stress Tests

6.3 Property-Based Tests

6.4 Integration Tests

7. Common Pitfalls & Debugging

7.1 Pitfall: Incorrect Free List Initialization

7.2 Pitfall: Alignment Violations

7.3 Pitfall: Use-After-Free Not Detected

7.4 Pitfall: Double Free Corruption

7.5 Pitfall: False Sharing in Thread-Local Pools

7.6 Pitfall: Returning Dangling Pointer After Reset

7.7 Debug Checklist

8. Extensions & Challenges

8.1 Extension: Lock-Free Multi-Producer Pool

8.2 Extension: Growing Pool

8.3 Extension: Memory-Mapped File Pool

8.4 Extension: NUMA-Aware Allocation

8.5 Extension: Defragmentation

8.6 Challenge: Rust GlobalAlloc Implementation

9. Real-World Connections

9.1 Industry Usage

9.2 Production Allocators to Study

9.3 Your Allocator vs Production

9.4 Integration with Your Other Projects

10. Resources

10.1 Primary References

10.2 Articles and Talks

10.3 Code References

10.4 Tools

11. Self-Assessment Checklist

12. Submission / Completion Criteria

12.1 Code Deliverables