Project 5: Custom Memory Allocator for HFT

Build a specialized memory allocator (arena/pool allocator) that eliminates malloc from your hot path - because malloc is a latency killer in HFT systems.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1 week
Language C or C++ (Rust for GlobalAlloc)
Prerequisites Understanding of pointers, basic memory concepts
Key Topics Pool allocators, arena allocators, thread-local storage, fragmentation

1. Learning Objectives

By completing this project, you will be able to:

  1. Explain why malloc is unacceptable for HFT hot paths and quantify the latency difference between general-purpose and specialized allocators (250ns vs 8ns).

  2. Implement a fixed-size pool allocator with O(1) allocation and deallocation using a free list data structure.

  3. Design an arena allocator (bump allocator) for ultra-fast allocation when bulk deallocation is acceptable.

  4. Create multiple size-class pools that can handle varying object sizes without external fragmentation.

  5. Implement thread-local pools to eliminate contention in multi-threaded HFT systems.

  6. Integrate your allocator with C++ std::allocator or Rust’s GlobalAlloc trait for seamless use with standard containers.

  7. Benchmark and verify determinism by measuring allocation latency distributions and comparing p50/p99/p999 with malloc.

  8. Identify and avoid common pitfalls including memory leaks, use-after-free, and false sharing in concurrent allocators.


2. Theoretical Foundation

2.1 Core Concepts

Why malloc is Slow for HFT

The general-purpose malloc implementation (glibc, jemalloc, tcmalloc) is optimized for a wide range of use cases, but this flexibility comes at a cost:

Why malloc Has Unpredictable Latency
┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  1. SYSTEM CALLS (worst case)                                     │
│     ┌──────────┐     ┌──────────┐     ┌──────────┐                │
│     │  malloc  │────►│  sbrk/   │────►│  Kernel  │                │
│     │          │     │  mmap    │     │  Context │                │
│     └──────────┘     └──────────┘     │  Switch  │                │
│           │                            └──────────┘                │
│           │              +10-50μs (microseconds!)                  │
│           │                                                        │
│  2. FRAGMENTATION OVERHEAD                                        │
│     ┌──────────────────────────────────────────┐                  │
│     │ Free: [16B] [32B] [16B] [64B] [8B] ...   │                  │
│     │ Request 24B → Search free list → Split   │                  │
│     │ Coalescing on free → More overhead       │                  │
│     └──────────────────────────────────────────┘                  │
│                                                                    │
│  3. LOCKING (multi-threaded)                                      │
│     ┌─────────┐  Lock contention  ┌─────────┐                     │
│     │ Thread1 │◄─────────────────►│ Thread2 │                     │
│     └─────────┘                   └─────────┘                     │
│           │                             │                          │
│           └──────── Mutex wait ─────────┘                          │
│                    +1-10μs per contention                          │
│                                                                    │
│  4. NON-DETERMINISTIC LATENCY                                     │
│     Best case:  ~50ns  (fast path, no contention)                 │
│     Typical:    ~250ns (normal workload)                          │
│     Worst case: ~50μs  (mmap, high contention)                    │
│                                                                    │
│     HFT cannot tolerate this variance!                            │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The Four Killers of malloc Performance:

  1. System Calls: When malloc’s internal pools are exhausted, it must call sbrk() or mmap() to get more memory from the OS. A single system call can take 10-50 microseconds - an eternity in HFT.

  2. Fragmentation Management: General-purpose allocators must handle any size request. This requires maintaining complex data structures (free lists, size classes, bins) and performing operations like coalescing adjacent free blocks.

  3. Thread Synchronization: Multi-threaded programs require locks on shared allocator state. Even with optimizations like per-thread caches (tcmalloc), contention still occurs during cache replenishment.

  4. Non-Determinism: The combination above means malloc has highly variable latency. HFT systems need predictable, worst-case latency guarantees.

Pool Allocators: Fixed-Size Block Freedom

A pool allocator pre-allocates a large chunk of memory and divides it into fixed-size blocks. The key insight: if all blocks are the same size, no fragmentation can occur.

Pool Allocator Structure
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  MEMORY POOL (single contiguous allocation)                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                             │   │
│  │  Block 0    Block 1    Block 2    Block 3    Block 4  ...   │   │
│  │  ┌──────┐   ┌──────┐   ┌──────┐   ┌──────┐   ┌──────┐       │   │
│  │  │ next │   │ next │   │ next │   │ next │   │ next │       │   │
│  │  │  ──────► │  ──────► │  ──────► │  ──────► │ NULL │       │   │
│  │  │      │   │      │   │      │   │      │   │      │       │   │
│  │  └──────┘   └──────┘   └──────┘   └──────┘   └──────┘       │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│        ▲                                                           │
│        │                                                           │
│  FREE LIST HEAD                                                    │
│  (points to first free block)                                      │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  ALLOCATION: O(1)                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  1. block = free_list_head                                  │   │
│  │  2. free_list_head = block->next                            │   │
│  │  3. return block                                            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  DEALLOCATION: O(1)                                                │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  1. block->next = free_list_head                            │   │
│  │  2. free_list_head = block                                  │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Properties of Pool Allocators:

  • O(1) Allocation: Just pop from the free list head
  • O(1) Deallocation: Just push to the free list head
  • Zero Fragmentation: All blocks are identical size
  • No System Calls: Memory is pre-allocated at startup
  • Cache Friendly: Blocks are contiguous in memory

Trade-off: You must know the block size at compile time or pool creation time.

Arena Allocators: The Simplest Fast Allocator

An arena allocator (also called a bump allocator or linear allocator) is even simpler than a pool allocator:

Arena Allocator (Bump Pointer)
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  ARENA MEMORY                                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ ██████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   │
│  │ ◄── allocated ──────►◄──────────── free ─────────────────► │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                          ▲                                    ▲     │
│                          │                                    │     │
│                     bump_ptr                               end      │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  ALLOCATION: O(1) - just bump the pointer!                         │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  1. result = bump_ptr                                       │   │
│  │  2. bump_ptr += size                                        │   │
│  │  3. return result                                           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  DEALLOCATION: Not supported individually!                         │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  reset(): bump_ptr = start  // Free everything at once     │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  USE CASES:                                                        │
│  • Per-request allocations that all get freed together             │
│  • Game frame allocations (allocate during frame, reset at end)    │
│  • Order processing (allocate for order, reset after complete)     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Arena Allocator Trade-offs:

Advantage Disadvantage
Fastest possible allocation (single pointer bump) Cannot free individual objects
Zero overhead per allocation Must reset entire arena at once
Perfect cache locality May waste memory if objects vary in size
No fragmentation Fixed maximum capacity

HFT Use Case: Processing a batch of market data updates. Allocate temporary structures for each message, then reset the arena when the batch is complete.

Thread-Local Pools: Eliminating Contention

In multi-threaded HFT systems, each thread typically handles different tasks:

  • Thread 1: Market data processing
  • Thread 2: Strategy execution
  • Thread 3: Order management

Sharing a single allocator between threads creates contention. The solution: thread-local pools.

Thread-Local Storage (TLS) Pattern
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  TRADITIONAL SHARED ALLOCATOR (BAD for HFT)                        │
│                                                                     │
│  ┌────────┐     ┌────────────┐     ┌────────┐                      │
│  │Thread 1│────►│   Shared   │◄────│Thread 2│                      │
│  └────────┘     │  Allocator │     └────────┘                      │
│       ▲         │   (LOCK!)  │          ▲                          │
│       │         └────────────┘          │                          │
│       └──────── CONTENTION ─────────────┘                          │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  THREAD-LOCAL POOLS (GOOD for HFT)                                 │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ Thread 1                                                   │    │
│  │ ┌─────────────────────────────────────────────────────┐   │    │
│  │ │ TLS Pool: [████░░░░░░░░░░░░░░░░░░░]                  │   │    │
│  │ └─────────────────────────────────────────────────────┘   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ Thread 2                                                   │    │
│  │ ┌─────────────────────────────────────────────────────┐   │    │
│  │ │ TLS Pool: [██████████░░░░░░░░░░░░░]                  │   │    │
│  │ └─────────────────────────────────────────────────────┘   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ Thread 3                                                   │    │
│  │ ┌─────────────────────────────────────────────────────┐   │    │
│  │ │ TLS Pool: [████████░░░░░░░░░░░░░░░]                  │   │    │
│  │ └─────────────────────────────────────────────────────┘   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  NO CONTENTION! Each thread has its own pool.                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Avoiding False Sharing:

Even with thread-local pools, you must avoid false sharing - when two threads modify data on the same cache line:

False Sharing Problem
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  CACHE LINE (64 bytes)                                             │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ Thread1_head │ Thread2_head │         unused                 │  │
│  │   8 bytes    │   8 bytes    │        48 bytes                │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         ▲                ▲                                          │
│         │                │                                          │
│     Thread 1         Thread 2                                       │
│     writes           writes                                         │
│                                                                     │
│  PROBLEM: Both threads invalidate each other's cache line!         │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                     │
│  SOLUTION: Pad to cache line boundary                              │
│                                                                     │
│  CACHE LINE 1                        CACHE LINE 2                  │
│  ┌────────────────────────────┐     ┌────────────────────────────┐ │
│  │ Thread1_head │   padding   │     │ Thread2_head │   padding   │ │
│  │   8 bytes    │  56 bytes   │     │   8 bytes    │  56 bytes   │ │
│  └────────────────────────────┘     └────────────────────────────┘ │
│                                                                     │
│  NO FALSE SHARING! Each thread owns its cache line.                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2.2 Why This Matters in HFT

HFT systems operate under extreme constraints:

Constraint Requirement Why Allocators Matter
Latency <10μs tick-to-trade malloc’s 250ns+ adds up across many allocations
Determinism Predictable worst case malloc’s p99 can be 100x its p50
Throughput 1M+ messages/second Lock contention destroys throughput
Memory Minimize cache misses Scattered allocations harm locality

Real-world impact: A trading strategy that allocates during the hot path will have inconsistent performance. On the day it matters most (high volatility), it will perform worst (more allocations, more contention).

The HFT rule: Never call malloc in the hot path.

Instead:

  1. Pre-allocate all memory at startup
  2. Use pool allocators for fixed-size objects (orders, quotes, trade records)
  3. Use arena allocators for temporary per-message processing
  4. Use thread-local pools to eliminate contention

2.3 Historical Context

Memory allocation has evolved significantly:

Era Approach Characteristics
1970s Simple first-fit O(n) search, severe fragmentation
1980s Buddy allocators Power-of-2 sizes, reduced fragmentation
1990s Doug Lea’s malloc (dlmalloc) Size classes, boundary tags
2000s tcmalloc, jemalloc Thread caching, per-CPU caches
2010s+ Custom HFT allocators Application-specific, zero overhead

Key insight: General-purpose allocators keep adding features to handle more use cases. HFT goes the opposite direction - strip away everything not needed for the specific use case.

2.4 Common Misconceptions

Misconception 1: “jemalloc/tcmalloc is fast enough”

Reality: They are faster than glibc malloc but still have unpredictable latency. Thread cache replenishment can stall, and their complexity means more cache pollution.

Misconception 2: “I can just use stack allocation”

Reality: Stack allocation is fast but limited in size (typically 1-8MB) and lifetime. You cannot return stack-allocated data from a function, and deep call stacks compete for stack space.

Misconception 3: “Object pools are only for memory savings”

Reality: The primary benefit in HFT is latency determinism, not memory efficiency. Pools guarantee O(1) allocation with no system calls.

Misconception 4: “Rust’s ownership system eliminates the need for custom allocators”

Reality: Rust’s safety guarantees help prevent bugs, but Box::new() still calls the global allocator (malloc). For HFT Rust code, you still need custom allocators.


3. Project Specification

3.1 What You Will Build

A custom memory allocation library that provides:

  1. Fixed-size pool allocator: Pre-allocated blocks of a single size with O(1) alloc/free
  2. Multi-pool allocator: Multiple pools for different size classes
  3. Arena allocator: Bump-pointer allocation with bulk reset
  4. Thread-local wrapper: Per-thread pools without locks
  5. C++ std::allocator integration: Use with std::vector, std::list, etc.
  6. Comprehensive benchmarks: Compare against malloc, jemalloc

3.2 Functional Requirements

Requirement Description
FR1 Pool allocator with configurable block size and count
FR2 O(1) allocation and deallocation for pool allocator
FR3 Arena allocator with configurable size and reset capability
FR4 Multi-pool supporting at least 4 size classes (e.g., 32B, 64B, 128B, 256B)
FR5 Thread-local pool storage without locks for same-thread alloc/free
FR6 Statistics tracking: allocations, frees, high watermark, current usage
FR7 Debug mode: detect double-free, use-after-free (at cost of performance)
FR8 Integration with C++ STL containers via std::allocator interface

3.3 Non-Functional Requirements

Requirement Target
NFR1 Pool allocation: <15ns average latency
NFR2 Pool deallocation: <10ns average latency
NFR3 Deterministic: p99 < 2x p50 for pool operations
NFR4 Zero system calls after initialization
NFR5 Memory overhead: <5% for block metadata
NFR6 Thread-local pools: zero contention between threads

3.4 Example Usage / Output

Benchmark Output:

=== ALLOCATOR BENCHMARK ===
Running 1,000,000 allocation/deallocation cycles...

malloc (glibc 2.31):
  alloc avg:  247ns   p50: 180ns   p99: 1.2μs   p999: 8.5μs
  free avg:   125ns   p50: 95ns    p99: 820ns   p999: 5.2μs

pool_alloc (fixed 64B blocks):
  alloc avg:  8ns     p50: 7ns     p99: 12ns    p999: 15ns
  free avg:   6ns     p50: 5ns     p99: 9ns     p999: 11ns

arena_alloc (bump pointer):
  alloc avg:  3ns     p50: 3ns     p99: 4ns     p999: 5ns
  reset avg:  2ns     (bulk free)

Improvement over malloc:
  Pool allocation:  30x faster average, 100x better p99
  Arena allocation: 80x faster average, 300x better p99

=== INTEGRATION TEST: ORDER BOOK ===
Order book with std::allocator (malloc):
  Insert order avg: 850ns   p99: 12μs
  Remove order avg: 420ns   p99: 8μs

Order book with pool_allocator:
  Insert order avg: 320ns   p99: 450ns
  Remove order avg: 180ns   p99: 280ns

Improvement: 62% lower latency, 96% lower tail latency

=== MEMORY STATS ===
Pool: 64B blocks
  Total capacity: 65536 blocks (4MB)
  Current used:   12847 blocks
  Peak usage:     24531 blocks
  Allocations:    1,847,293
  Frees:          1,834,446

Arena: 1MB
  Current offset: 524288 bytes
  Resets:         1,423
  Allocations:    14,847,291

Code Usage Example (C++):

// Example of how your allocator will be used
#include "pool_allocator.hpp"

// Create a pool for Order objects
using OrderPool = PoolAllocator<Order, 10000>;
OrderPool order_pool;

// Allocate without malloc
Order* order = order_pool.allocate();
order->price = 100.50;
order->quantity = 100;

// Use in hot path with zero malloc calls
process_order(order);

// Free back to pool
order_pool.deallocate(order);

// Use with STL containers
std::vector<Order, OrderPool> orders(order_pool);
orders.push_back(Order{...});  // Uses pool, not malloc

3.5 Real World Outcome

After completing this project, you will have:

  1. A working allocator library you can integrate into your order book from Project 1
  2. Measurable latency improvements with benchmark data showing 10-100x improvement
  3. Understanding of memory systems that applies to any performance-critical code
  4. Interview-ready knowledge about memory allocation internals

This allocator will directly integrate with your other HFT projects:

  • Project 1 (Order Book): Use pool allocator for Order and PriceLevel objects
  • Project 2 (Lock-Free Queue): Use arena allocator for message parsing
  • Project 3 (Matching Engine): Use thread-local pools for per-connection state

4. Solution Architecture

4.1 High-Level Design

Custom Allocator System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  APPLICATION LAYER                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  std::vector<T, PoolAllocator<T>>                           │   │
│  │  Order* order = pool.allocate()                             │   │
│  │  Message* msg = arena.allocate(sizeof(Message))             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ALLOCATOR INTERFACE                                        │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐   │   │
│  │  │ allocate()  │ │deallocate() │ │ reset() (arena)     │   │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│          ┌───────────────────┼───────────────────┐                 │
│          ▼                   ▼                   ▼                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │  POOL        │    │  ARENA       │    │  MULTI-POOL  │          │
│  │  ALLOCATOR   │    │  ALLOCATOR   │    │  ALLOCATOR   │          │
│  │              │    │              │    │              │          │
│  │ Fixed-size   │    │ Bump pointer │    │ Size-class   │          │
│  │ Free list    │    │ Bulk reset   │    │ routing      │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│          │                   │                   │                  │
│          └───────────────────┼───────────────────┘                 │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  THREAD-LOCAL WRAPPER (optional)                            │   │
│  │                                                             │   │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐               │   │
│  │  │ Thread 1  │  │ Thread 2  │  │ Thread 3  │               │   │
│  │  │   Pool    │  │   Pool    │  │   Pool    │               │   │
│  │  └───────────┘  └───────────┘  └───────────┘               │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  BACKING MEMORY                                             │   │
│  │                                                             │   │
│  │  Single large allocation at startup (mmap or new char[])   │   │
│  │  ┌─────────────────────────────────────────────────────┐   │   │
│  │  │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│   │   │
│  │  └─────────────────────────────────────────────────────┘   │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component Responsibility Key Data Structures
PoolAllocator Fixed-size block allocation Free list, block array
ArenaAllocator Bump-pointer allocation Bump pointer, base pointer
MultiPoolAllocator Route to size-appropriate pool Array of pools, size-to-pool mapping
ThreadLocalPool Per-thread pool access Thread-local storage (TLS)
Statistics Track usage metrics Atomic counters
DebugAllocator Detect memory errors Guard bytes, allocation tracking

4.3 Data Structures

Pool Allocator Internal Structure

Pool Allocator Memory Layout
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  METADATA (stored separately from blocks for cache efficiency)     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ block_size: 64                                              │    │
│  │ num_blocks: 10000                                           │    │
│  │ free_count: 7342                                            │    │
│  │ free_list_head: ──────────────────────────────┐             │    │
│  │ memory_base: ──────────────────────┐          │             │    │
│  └────────────────────────────────────│──────────│─────────────┘    │
│                                       │          │                  │
│                                       ▼          │                  │
│  BLOCK ARRAY                                     │                  │
│  ┌───────────────────────────────────────────────│─────────────┐   │
│  │                                               │             │   │
│  │  Block 0      Block 1      Block 2           │             │   │
│  │  ┌────────┐   ┌────────┐   ┌────────┐   ◄────┘             │   │
│  │  │ IN USE │   │ next ──────► next ──────► NULL             │   │
│  │  │ (data) │   │ (free) │   │ (free) │                      │   │
│  │  │        │   │        │   │        │                      │   │
│  │  └────────┘   └────────┘   └────────┘   ...                │   │
│  │     64B         64B          64B                            │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  INTRUSIVE FREE LIST (no separate allocation for list nodes!)     │
│                                                                     │
│  When a block is FREE, its first 8 bytes store the 'next' pointer │
│  When a block is IN USE, user data overwrites the 'next' pointer  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Arena Allocator Internal Structure

Arena Allocator Memory Layout
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  METADATA                                                          │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ base:     0x7f8b4c000000                                     │  │
│  │ end:      0x7f8b4c100000  (base + 1MB)                       │  │
│  │ current:  0x7f8b4c002800  (next allocation position)         │  │
│  │ allocs:   142                                                │  │
│  │ resets:   3                                                  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  MEMORY REGION                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│  │
│  │                           ▲                                 ▲│  │
│  │                           │                                 ││  │
│  │                        current                            end│  │
│  │                                                              │  │
│  │  ◄────── allocated ──────►◄────────── available ────────────►│  │
│  │                                                              │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  ALLOCATION (returns current, bumps pointer)                       │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ Before: current = 0x2800                                     │  │
│  │ Request: 128 bytes (aligned to 8)                            │  │
│  │ After:  current = 0x2880                                     │  │
│  │ Return: 0x2800                                               │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  RESET (restore to beginning)                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ current = base                                               │  │
│  │ // All previous allocations now invalid                      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Multi-Pool Size Class Routing

Multi-Pool Allocator Size Class Design
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  SIZE CLASSES (power of 2 for simplicity)                          │
│                                                                     │
│  Request Size    Rounded To    Pool Index    Wasted Bytes (max)    │
│  ┌────────────┬─────────────┬────────────┬──────────────────────┐  │
│  │   1-32     │     32      │     0      │      31 (97%)        │  │
│  │  33-64     │     64      │     1      │      31 (48%)        │  │
│  │  65-128    │    128      │     2      │      63 (49%)        │  │
│  │ 129-256    │    256      │     3      │     127 (50%)        │  │
│  │ 257-512    │    512      │     4      │     255 (50%)        │  │
│  │ 513-1024   │   1024      │     5      │     511 (50%)        │  │
│  │  >1024     │   fallback to malloc     │                      │  │
│  └────────────┴─────────────┴────────────┴──────────────────────┘  │
│                                                                     │
│  ROUTING FUNCTION (must be O(1)!)                                  │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ // Efficient power-of-2 routing using bit manipulation       │  │
│  │ pool_index = (size <= 32) ? 0 :                              │  │
│  │              (63 - count_leading_zeros(size - 1)) - 4;       │  │
│  │                                                              │  │
│  │ // Or use lookup table for small sizes (cache-friendly)      │  │
│  │ pool_index = size_to_pool_table[size];                       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  POOL ARRAY                                                        │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ pools[0]: PoolAllocator<32>   ────► [32B blocks...]          │  │
│  │ pools[1]: PoolAllocator<64>   ────► [64B blocks...]          │  │
│  │ pools[2]: PoolAllocator<128>  ────► [128B blocks...]         │  │
│  │ pools[3]: PoolAllocator<256>  ────► [256B blocks...]         │  │
│  │ pools[4]: PoolAllocator<512>  ────► [512B blocks...]         │  │
│  │ pools[5]: PoolAllocator<1024> ────► [1024B blocks...]        │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

4.4 Algorithm Overview

Pool Allocator Algorithms

Initialization:

  1. Allocate contiguous memory for all blocks: memory = mmap(block_size * num_blocks)
  2. Initialize free list by linking all blocks: block[i].next = &block[i+1]
  3. Set free_list_head = &block[0]

Allocation (O(1)):

  1. Check free_list_head != nullptr (pool exhausted check)
  2. block = free_list_head
  3. free_list_head = block->next
  4. Return block

Deallocation (O(1)):

  1. block->next = free_list_head
  2. free_list_head = block

Arena Allocator Algorithms

Initialization:

  1. Allocate contiguous memory: memory = mmap(arena_size)
  2. Set current = memory, end = memory + arena_size

Allocation (O(1)):

  1. aligned_size = align_up(size, 8) // 8-byte alignment
  2. Check current + aligned_size <= end (capacity check)
  3. result = current
  4. current += aligned_size
  5. Return result

Reset (O(1)):

  1. current = base
  2. Note: All previous pointers are now invalid!

Thread-Local Pool Access

Get Thread-Local Pool:

  1. Access TLS variable: thread_local PoolAllocator* tls_pool
  2. If tls_pool == nullptr, initialize for this thread
  3. Return tls_pool

Key Insight: No locks needed because each thread has its own pool. Cross-thread free (freeing memory allocated by another thread) requires special handling - typically a lock-free return queue.


5. Implementation Guide

5.1 Development Environment Setup

Required Tools:

Tool Purpose Installation
GCC 12+ or Clang 15+ C++20 support apt install g++-12 or apt install clang-15
CMake 3.20+ Build system apt install cmake
perf Benchmarking apt install linux-tools-generic
Valgrind Memory debugging apt install valgrind
Google Benchmark Microbenchmarks apt install libbenchmark-dev

Recommended Compiler Flags:

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_FLAGS "-O3 -march=native -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -g -fsanitize=address,undefined")

5.2 Project Structure

hft-allocator/
├── CMakeLists.txt
├── include/
│   ├── hft/
│   │   ├── pool_allocator.hpp      # Fixed-size pool
│   │   ├── arena_allocator.hpp     # Bump allocator
│   │   ├── multi_pool.hpp          # Size-class routing
│   │   ├── thread_local_pool.hpp   # TLS wrapper
│   │   ├── stl_adapter.hpp         # std::allocator interface
│   │   └── allocator_stats.hpp     # Statistics tracking
├── src/
│   └── allocator_stats.cpp         # Statistics implementation
├── tests/
│   ├── pool_test.cpp               # Unit tests
│   ├── arena_test.cpp
│   ├── multi_pool_test.cpp
│   └── thread_local_test.cpp
├── benchmarks/
│   ├── alloc_benchmark.cpp         # Performance comparison
│   └── integration_benchmark.cpp   # With order book
└── examples/
    ├── basic_usage.cpp
    └── orderbook_integration.cpp

5.3 The Core Question You’re Answering

“How do you provide memory allocation with O(1) performance and zero fragmentation for a known set of object sizes?”

This question forces you to understand:

  • Why general-purpose allocators are slow (flexibility has cost)
  • How to trade flexibility for speed (fixed sizes, pre-allocation)
  • The difference between internal and external fragmentation
  • How to achieve deterministic performance (no system calls, no locks)

5.4 Concepts You Must Understand First

Before implementing, verify you understand these concepts:

Concept Self-Assessment Question Resource if Stuck
Pointers and Memory What is the difference between &x and *p? What does p + 1 mean for a pointer to a 64-byte struct? CS:APP Ch. 3
Memory Layout Where are stack, heap, and static variables stored? Which grows up vs down? CS:APP Ch. 9
Alignment Why must a double be aligned to 8 bytes on x86-64? What happens if it’s not? CS:APP Ch. 3.9
Cache Lines How big is a cache line? Why does accessing arr[0] also load arr[1-7]? CS:APP Ch. 6
System Calls What happens when you call malloc? When does it call sbrk? TLPI Ch. 7
Thread-Local Storage What is thread_local in C++? How is it different from global variables? C++ reference

5.5 Questions to Guide Your Design

Pool Allocator Design:

  1. How will you store the free list? Separate allocation or intrusive in the blocks?
  2. What happens when the pool is exhausted? Fail fast, grow, or fallback?
  3. How will you handle alignment requirements for different types?
  4. Should you zero memory on allocation? On deallocation?

Arena Allocator Design:

  1. What alignment will you use for allocations? Why 8 bytes?
  2. How will you detect arena overflow? Check on every allocation?
  3. When should an arena be reset? Who is responsible?
  4. Can you have multiple arenas for different purposes?

Thread-Local Design:

  1. When will thread-local pools be initialized? Lazy or at thread start?
  2. What happens if Thread A allocates and Thread B tries to free?
  3. How will you handle thread termination? Leak the pool or reclaim?

Integration Design:

  1. How will your allocator satisfy the C++ Allocator concept requirements?
  2. How will you enable/disable statistics tracking? Compile-time or runtime?
  3. What debug checks should be enabled in development builds?

5.6 Thinking Exercise

Before writing code, trace through this scenario by hand:

Scenario: Pool with 4 blocks of 32 bytes each.

Initial State:
  memory: [Block0][Block1][Block2][Block3]
  free_list: Block0 -> Block1 -> Block2 -> Block3 -> null

Operations:
1. p1 = allocate()   -- Draw the state after this
2. p2 = allocate()   -- Draw the state after this
3. deallocate(p1)    -- Draw the state after this
4. p3 = allocate()   -- Where does p3 point? Why?
5. p4 = allocate()   -- Draw the state after this
6. p5 = allocate()   -- What happens? Why?

Work through this on paper before implementing. Understanding the free list mechanics is essential.

5.7 Hints in Layers

Hint 1: Starting Point (Conceptual Direction) Start with the simplest possible pool allocator - a fixed block size known at compile time, single-threaded, no statistics. Get this working and benchmarked before adding complexity.

Hint 2: Next Level (More Specific Guidance) Use an intrusive free list - store the next pointer inside the free block itself. This eliminates separate allocation for list nodes. The key insight: when a block is free, you can use its memory for bookkeeping.

Hint 3: Technical Details (Approach/Pseudocode)

Pool allocator structure:
- void* memory_base     // mmap'd or new char[]
- void* free_list_head  // points to first free block
- size_t block_size     // constant
- size_t num_blocks     // constant

allocate():
  if (free_list_head == nullptr) return nullptr;
  void* result = free_list_head;
  free_list_head = *((void**)free_list_head);  // Follow the intrusive pointer
  return result;

deallocate(void* ptr):
  *((void**)ptr) = free_list_head;  // Store old head in freed block
  free_list_head = ptr;             // New head is the freed block

Hint 4: Tools and Debugging (Verification Methods)

  • Use AddressSanitizer (-fsanitize=address) to catch use-after-free and buffer overflows
  • Use perf stat to verify zero system calls after init: perf stat -e syscalls:sys_enter_* ./benchmark
  • Add guard bytes (0xDEADBEEF) before/after blocks in debug mode to detect overflow
  • Print allocation traces with __builtin_return_address(0) to debug double-frees

5.8 The Interview Questions They’ll Ask

After completing this project, be prepared to answer:

  1. “Why is malloc slow for low-latency applications?”
    • Expected: Discuss system calls, fragmentation, locking, non-determinism
    • Bonus: Mention specific latency numbers (250ns typical, 50us worst case)
  2. “How does a pool allocator achieve O(1) allocation?”
    • Expected: Explain free list, intrusive pointers, pre-allocation
    • Bonus: Discuss cache implications of contiguous blocks
  3. “What is the trade-off between pool and arena allocators?”
    • Expected: Pool allows individual free, arena requires bulk free
    • Bonus: Discuss use cases for each in a trading system
  4. “How do you avoid lock contention in a multi-threaded allocator?”
    • Expected: Thread-local storage, per-thread pools
    • Bonus: Discuss cross-thread free problem and solutions
  5. “What is false sharing and how do you prevent it?”
    • Expected: Cache line sharing, padding to 64 bytes
    • Bonus: Explain alignas(64) and __attribute__((aligned(64)))
  6. “How would you integrate a custom allocator with STL containers?”
    • Expected: Implement Allocator concept (allocate, deallocate, rebind)
    • Bonus: Discuss stateful vs stateless allocators
  7. “What debugging features would you add to a custom allocator?”
    • Expected: Guard bytes, double-free detection, allocation tracking
    • Bonus: Discuss compile-time vs runtime debug mode toggle

5.9 Books That Will Help

Topic Book Specific Chapters
Memory allocation internals “Computer Systems: A Programmer’s Perspective” (CS:APP) Ch. 9 (Virtual Memory)
Cache hierarchy CS:APP Ch. 6 (Memory Hierarchy)
Data structure alignment CS:APP Ch. 3.9 (Heterogeneous Data Structures)
System calls (mmap) “The Linux Programming Interface” (TLPI) Ch. 49 (Memory Mappings)
Pool allocator pattern “Fluent C” by Preschern Ch. 6 (Memory Pool)
C++ allocator interface “Effective Modern C++” by Meyers Item 21 (Custom allocators)
Lock-free programming “C++ Concurrency in Action” by Williams Ch. 7 (Lock-free data structures)
Thread-local storage “Rust Atomics and Locks” by Bos Ch. 8 (Thread-local storage)

5.10 Implementation Phases

Phase 1: Basic Fixed-Size Pool (Days 1-2)

Milestone: Single-threaded pool allocator that passes basic tests.

Tasks:

  1. Define PoolAllocator class with template parameter for block size
  2. Implement constructor that allocates backing memory with mmap or new char[]
  3. Initialize intrusive free list linking all blocks
  4. Implement allocate() - pop from free list
  5. Implement deallocate() - push to free list
  6. Write unit tests:
    • Allocate all blocks, verify all unique pointers
    • Free all blocks, allocate again, verify reuse
    • Allocate one more than capacity, verify nullptr return

Verification:

# All tests pass
./pool_test

# Check no memory leaks
valgrind ./pool_test

Phase 2: Arena Allocator (Day 3)

Milestone: Bump-pointer allocator with reset capability.

Tasks:

  1. Define ArenaAllocator class with size parameter
  2. Implement constructor with backing memory allocation
  3. Implement allocate(size_t n) - bump pointer with alignment
  4. Implement reset() - restore pointer to beginning
  5. Add overflow checking
  6. Write unit tests:
    • Allocate various sizes, verify alignment
    • Fill arena, verify nullptr on overflow
    • Reset, verify can allocate again

Verification:

./arena_test

Phase 3: Multi-Pool Size Classes (Day 4)

Milestone: Allocator that routes to appropriate pool based on size.

Tasks:

  1. Define size classes (32, 64, 128, 256 bytes minimum)
  2. Create array of pools, one per size class
  3. Implement size-to-pool routing (bit manipulation or lookup table)
  4. Implement allocate(size_t n) with routing
  5. Implement deallocate(void* p, size_t n) - note: size must be provided!
  6. Handle oversized allocations (fallback to malloc or reject)
  7. Write unit tests for various sizes

Design Decision: How to know which pool a pointer came from when freeing?

  • Option A: Require size on deallocate (fastest, used by jemalloc)
  • Option B: Store header before each block (overhead per allocation)
  • Option C: Check pointer ranges (one comparison per pool)

Phase 4: Thread-Local Wrapper (Day 5)

Milestone: Per-thread pools without locks.

Tasks:

  1. Create ThreadLocalPool wrapper using thread_local
  2. Implement lazy initialization on first access
  3. Test with multiple threads allocating concurrently
  4. Verify zero lock contention with benchmarks

Complication - Cross-Thread Free: What if Thread A allocates and Thread B frees? Options:

  • Option A: Disallow (assert same thread)
  • Option B: Lock-free return queue per pool
  • Option C: Lock only for cross-thread free (rare path)

For HFT, Option A is usually correct - memory should be freed by the same thread that allocated it.

Phase 5: C++ Allocator Integration (Day 6)

Milestone: Use pool allocator with std::vector<T, PoolAllocator<T>>.

Tasks:

  1. Implement C++17 Allocator concept requirements:
    • value_type typedef
    • allocate(n) returning T*
    • deallocate(T* p, n)
  2. Implement rebind for container node types
  3. Handle containers that allocate different sizes (e.g., std::list nodes)
  4. Test with std::vector, std::list, std::map

Challenge: Containers may allocate auxiliary data (e.g., map nodes) of different sizes than T. Your allocator must handle this via rebinding or multi-pool.

Phase 6: Benchmarking and Statistics (Day 7)

Milestone: Demonstrate 10-100x improvement over malloc.

Tasks:

  1. Add statistics tracking: allocs, frees, current usage, high watermark
  2. Create benchmark comparing:
    • Pool allocator vs malloc
    • Arena allocator vs malloc
    • Thread-local pool vs shared pool
  3. Measure p50, p99, p999 latencies
  4. Integrate with order book from Project 1
  5. Document results

Benchmark Framework:

// Use Google Benchmark for consistent measurements
static void BM_MallocAlloc(benchmark::State& state) {
    for (auto _ : state) {
        void* p = malloc(64);
        benchmark::DoNotOptimize(p);
        free(p);
    }
}
BENCHMARK(BM_MallocAlloc);

static void BM_PoolAlloc(benchmark::State& state) {
    PoolAllocator<64, 10000> pool;
    for (auto _ : state) {
        void* p = pool.allocate();
        benchmark::DoNotOptimize(p);
        pool.deallocate(p);
    }
}
BENCHMARK(BM_PoolAlloc);

5.11 Key Implementation Decisions

Decision Recommended Choice Rationale
Backing memory mmap with MAP_PRIVATE | MAP_ANONYMOUS Avoid heap fragmentation, easy to return to OS
Block size Template parameter Compile-time optimization, no runtime overhead
Alignment 8 bytes minimum, configurable Match x86-64 requirements, allow AVX alignment
Overflow behavior Return nullptr Fail fast, let caller decide
Debug checks Compile-time flag Zero overhead in release builds
Statistics Optional, runtime flag Allow measurement without recompile
Cross-thread free Assert same thread Simplest correct behavior for HFT

6. Testing Strategy

6.1 Unit Tests

Pool Allocator Tests:

TEST(PoolAllocator, AllocatesUniquePointers) {
    PoolAllocator<64, 100> pool;
    std::set<void*> ptrs;
    for (int i = 0; i < 100; ++i) {
        void* p = pool.allocate();
        ASSERT_NE(p, nullptr);
        ASSERT_EQ(ptrs.count(p), 0);  // Must be unique
        ptrs.insert(p);
    }
}

TEST(PoolAllocator, ReturnsNullWhenExhausted) {
    PoolAllocator<64, 10> pool;
    for (int i = 0; i < 10; ++i) pool.allocate();
    ASSERT_EQ(pool.allocate(), nullptr);
}

TEST(PoolAllocator, ReusesFreedBlocks) {
    PoolAllocator<64, 10> pool;
    void* p1 = pool.allocate();
    pool.deallocate(p1);
    void* p2 = pool.allocate();
    ASSERT_EQ(p1, p2);  // LIFO reuse
}

TEST(PoolAllocator, AlignmentCorrect) {
    PoolAllocator<64, 100> pool;
    for (int i = 0; i < 100; ++i) {
        void* p = pool.allocate();
        ASSERT_EQ(reinterpret_cast<uintptr_t>(p) % 8, 0);
    }
}

Arena Allocator Tests:

TEST(ArenaAllocator, BumpPointerAdvances) {
    ArenaAllocator<1024> arena;
    void* p1 = arena.allocate(32);
    void* p2 = arena.allocate(32);
    ASSERT_EQ(static_cast<char*>(p2) - static_cast<char*>(p1), 32);
}

TEST(ArenaAllocator, ResetRestoresCapacity) {
    ArenaAllocator<64> arena;
    for (int i = 0; i < 8; ++i) arena.allocate(8);
    ASSERT_EQ(arena.allocate(8), nullptr);  // Full
    arena.reset();
    ASSERT_NE(arena.allocate(8), nullptr);  // Available again
}

TEST(ArenaAllocator, AlignmentRoundsUp) {
    ArenaAllocator<1024> arena;
    void* p1 = arena.allocate(3);  // Rounds to 8
    void* p2 = arena.allocate(1);
    ASSERT_EQ(static_cast<char*>(p2) - static_cast<char*>(p1), 8);
}

6.2 Stress Tests

TEST(PoolAllocator, StressTestMillionCycles) {
    PoolAllocator<64, 1000> pool;
    std::vector<void*> ptrs;

    for (int cycle = 0; cycle < 1000000; ++cycle) {
        // Randomly allocate or free
        if (ptrs.empty() || (rand() % 2 == 0 && ptrs.size() < 1000)) {
            void* p = pool.allocate();
            if (p) ptrs.push_back(p);
        } else {
            size_t idx = rand() % ptrs.size();
            pool.deallocate(ptrs[idx]);
            ptrs.erase(ptrs.begin() + idx);
        }
    }

    // Free remaining
    for (void* p : ptrs) pool.deallocate(p);
}

TEST(ThreadLocalPool, ConcurrentAccess) {
    ThreadLocalPool<64, 1000> pool;
    std::vector<std::thread> threads;
    std::atomic<int> total_allocs{0};

    for (int t = 0; t < 8; ++t) {
        threads.emplace_back([&]() {
            for (int i = 0; i < 100000; ++i) {
                void* p = pool.allocate();
                ASSERT_NE(p, nullptr);
                pool.deallocate(p);
                total_allocs++;
            }
        });
    }

    for (auto& t : threads) t.join();
    ASSERT_EQ(total_allocs, 800000);
}

6.3 Property-Based Tests

Test invariants that should always hold:

// After any sequence of alloc/free, free_count + allocated_count == total_blocks
// All returned pointers are within the pool's memory range
// All returned pointers are aligned correctly
// Double-free should be detected (in debug mode)

6.4 Integration Tests

TEST(Integration, VectorWithPoolAllocator) {
    using OrderVector = std::vector<Order, PoolAllocator<Order, 1000>>;

    OrderVector orders;
    for (int i = 0; i < 100; ++i) {
        orders.push_back(Order{.id = i, .price = 100.0 + i});
    }

    ASSERT_EQ(orders.size(), 100);
    ASSERT_EQ(orders[50].id, 50);
}

7. Common Pitfalls & Debugging

7.1 Pitfall: Incorrect Free List Initialization

Symptom: Crash on first allocation, or same pointer returned multiple times.

Cause: Free list not properly initialized; blocks not linked correctly.

Fix: After allocating backing memory, iterate through all blocks and set each block’s next pointer:

for (size_t i = 0; i < num_blocks - 1; ++i) {
    *reinterpret_cast<void**>(block_at(i)) = block_at(i + 1);
}
*reinterpret_cast<void**>(block_at(num_blocks - 1)) = nullptr;
free_list_head = block_at(0);

Verification: Print free list after initialization, verify correct linking.

7.2 Pitfall: Alignment Violations

Symptom: SIGBUS on ARM, performance degradation on x86, incorrect data.

Cause: Blocks not aligned to required boundary (8 bytes for most types, 16/32 for SIMD).

Fix: Ensure block_size is a multiple of required alignment. Use alignas or align backing memory:

void* memory = aligned_alloc(64, total_size);  // 64-byte alignment for cache

Verification: Assert (uintptr_t)ptr % alignment == 0 on every allocation.

7.3 Pitfall: Use-After-Free Not Detected

Symptom: Mysterious data corruption, non-deterministic crashes.

Cause: Freed block reused while old pointer still in use.

Fix (debug mode): Fill freed blocks with poison pattern:

void deallocate(void* ptr) {
    #ifdef DEBUG
    memset(ptr, 0xDE, block_size);  // Poison pattern
    #endif
    // ... normal free logic
}

Verification: Enable AddressSanitizer during development.

7.4 Pitfall: Double Free Corruption

Symptom: Free list corruption, infinite loops, crashes.

Cause: Same pointer passed to deallocate twice.

Fix (debug mode): Maintain allocated set or use sentinel values:

void deallocate(void* ptr) {
    #ifdef DEBUG
    // Check first 8 bytes aren't already our free-list marker
    if (*reinterpret_cast<uint64_t*>(ptr) == FREE_MARKER) {
        abort();  // Double free!
    }
    *reinterpret_cast<uint64_t*>(ptr) = FREE_MARKER;
    #endif
    // ... normal free logic
}

7.5 Pitfall: False Sharing in Thread-Local Pools

Symptom: Thread-local pools slower than expected, high cache miss rates.

Cause: Pool metadata from different threads on same cache line.

Fix: Pad thread-local pool structures to cache line boundaries:

struct alignas(64) ThreadLocalPool {
    PoolAllocator<64, 1000> pool;
    char padding[64 - sizeof(PoolAllocator<64, 1000>) % 64];
};

Verification: Use perf stat -e cache-misses to compare before/after.

7.6 Pitfall: Returning Dangling Pointer After Reset

Symptom: Arena reset invalidates pointers, but caller still uses them.

Cause: Arena allocator semantics not understood by caller.

Fix: Document clearly that reset() invalidates ALL previous allocations. Consider:

// Return a scope guard that resets on destruction
auto scope = arena.scope_guard();  // Resets arena when scope ends

7.7 Debug Checklist

When something goes wrong:

  1. Enable AddressSanitizer: -fsanitize=address
  2. Add debug logging to allocate/deallocate
  3. Verify free list integrity after each operation
  4. Check pointer is within pool range before free
  5. Use Valgrind for detailed memory analysis
  6. Add guard bytes before/after blocks
  7. Print allocation/free traces with stack traces

8. Extensions & Challenges

8.1 Extension: Lock-Free Multi-Producer Pool

Implement a pool that allows multiple threads to allocate concurrently using lock-free techniques:

  • Use atomic compare-and-swap for free list head
  • Handle ABA problem with tagged pointers or hazard pointers
  • Benchmark against mutex-based version

Challenge: The ABA problem occurs when Thread A sees value A, Thread B changes A to B and back to A, then Thread A’s CAS succeeds incorrectly.

8.2 Extension: Growing Pool

Implement a pool that can grow when exhausted:

  • Allocate additional chunks on demand
  • Maintain linked list of chunks
  • Handle deallocation across chunks
  • Consider fragmentation implications

8.3 Extension: Memory-Mapped File Pool

Implement a pool backed by a memory-mapped file:

  • Persist allocations across process restarts
  • Handle crash recovery
  • Useful for maintaining order book state

8.4 Extension: NUMA-Aware Allocation

For multi-socket servers:

  • Detect NUMA topology
  • Allocate pools on local NUMA node
  • Bind threads to cores near their memory

8.5 Extension: Defragmentation

For long-running systems:

  • Track allocation patterns
  • Periodically compact pools
  • Handle live object relocation (requires cooperation from caller)

8.6 Challenge: Rust GlobalAlloc Implementation

Implement your allocator as a Rust GlobalAlloc:

#[global_allocator]
static ALLOC: PoolGlobalAlloc = PoolGlobalAlloc::new();

unsafe impl GlobalAlloc for PoolGlobalAlloc {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ... }
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { ... }
}

Challenge: GlobalAlloc must be thread-safe and handle any size/alignment.


9. Real-World Connections

9.1 Industry Usage

Company/Project Allocator Approach
Jane Street Custom OCaml allocators, arena patterns
HRT Custom C++ allocators, huge pages
Tower Research Per-thread pools, zero-copy parsing
Game Engines Frame allocators (arena per frame)
Databases Buffer pool managers, slab allocators

9.2 Production Allocators to Study

Allocator Key Innovation
jemalloc Thread caching, size classes, arenas
tcmalloc Per-thread caches, central free list
mimalloc Free list sharding, segment-based
Hoard Superblock-based, scalable

9.3 Your Allocator vs Production

Your allocator is simpler but faster for your use case:

  • Production allocators handle ANY size; yours handles fixed sizes
  • Production allocators are general-purpose; yours is application-specific
  • Production allocators trade some speed for flexibility

This is the HFT philosophy: specialize for speed.

9.4 Integration with Your Other Projects

Project How Allocator Helps
P01: Order Book Pool allocator for Order and PriceLevel objects
P02: Lock-Free Queue Arena for parsing buffers, pool for queue nodes
P03: Matching Engine Thread-local pools for per-connection state
P06: Full System All of the above, plus statistics for monitoring

10. Resources

10.1 Primary References

Resource Use For Specific Sections
“Computer Systems: A Programmer’s Perspective” Memory fundamentals Ch. 9 (Virtual Memory), Ch. 6 (Cache)
“The Linux Programming Interface” mmap, system calls Ch. 7, Ch. 49
“Fluent C” by Preschern Pool allocator pattern Ch. 6 (Memory Pool)
“Building Low Latency Applications with C++” HFT context Ch. on memory management

10.2 Articles and Talks

Resource Description
Andrei Alexandrescu: std::allocator Is to Allocation What std::vector Is to Vexation Why std::allocator is limited
CppCon: Memory Allocation Strategies Various CppCon talks
What Every Programmer Should Know About Memory Ulrich Drepper’s classic paper

10.3 Code References

Repository What to Learn
jemalloc source Size classes, arenas, thread caching
tcmalloc source Per-thread caches, central free list
mimalloc source Segment-based, free list sharding

10.4 Tools

Tool Purpose
perf Performance profiling, cache analysis
valgrind Memory error detection
heaptrack Heap profiling
massif Heap visualization
AddressSanitizer Runtime memory error detection

11. Self-Assessment Checklist

You have mastered this project when you can:

Conceptual Understanding:

  • Explain why malloc is unsuitable for HFT hot paths (latency, determinism)
  • Describe the trade-offs between pool, arena, and multi-pool allocators
  • Explain false sharing and how to prevent it
  • Describe the ABA problem in lock-free free lists

Implementation Skills:

  • Implement a pool allocator from scratch without reference
  • Implement an arena allocator with proper alignment
  • Create thread-local pools without lock contention
  • Integrate your allocator with std::vector using the Allocator concept

Performance Verification:

  • Benchmark your allocator vs malloc, demonstrating 10x+ improvement
  • Show deterministic p99 latency (< 2x p50)
  • Verify zero system calls after initialization using perf

Debugging Ability:

  • Add debug checks that detect double-free and use-after-free
  • Use AddressSanitizer to catch memory errors
  • Diagnose false sharing using cache miss counters

Real-World Application:

  • Integrate your allocator into the order book from Project 1
  • Measure end-to-end latency improvement in a realistic scenario

12. Submission / Completion Criteria

Your project is complete when you have:

12.1 Code Deliverables

  • pool_allocator.hpp: Fixed-size pool with O(1) alloc/free
  • arena_allocator.hpp: Bump allocator with reset
  • multi_pool.hpp: Size-class routing to pools
  • thread_local_pool.hpp: Per-thread pool wrapper
  • stl_adapter.hpp: C++ Allocator concept implementation
  • Comprehensive unit tests (>90% coverage)
  • Benchmark suite comparing to malloc

12.2 Documentation

  • README with build instructions
  • API documentation for each allocator type
  • Benchmark results with analysis

12.3 Performance Criteria

  • Pool allocator: <15ns average allocation
  • Pool allocator: p99 < 2x p50 (deterministic)
  • Zero system calls after initialization (verified with perf)
  • Multi-threaded benchmark shows zero lock contention

12.4 Integration

  • Demonstrated integration with order book or matching engine
  • Measured latency improvement in realistic scenario

Summary

This project teaches you one of the most important skills in HFT development: eliminating malloc from your hot path. The difference between a 250ns malloc and an 8ns pool allocation may seem small, but multiplied across thousands of operations per second, it becomes the difference between profit and loss.

Key takeaways:

  1. General-purpose allocators sacrifice speed for flexibility - you can do the opposite
  2. Pool allocators achieve O(1) through pre-allocation and free lists - no searching, no fragmentation
  3. Arena allocators are even faster - just bump a pointer, reset all at once
  4. Thread-local pools eliminate contention - essential for multi-threaded HFT
  5. Determinism matters as much as speed - p99 latency kills trading strategies

When you complete this project, you will have built infrastructure that is foundational to every serious HFT system. More importantly, you will understand why these techniques work and when to apply them.


“In HFT, memory allocation is not a solved problem - it’s a problem you solve for your specific use case. The best allocator is the one that does exactly what you need and nothing more.”