Project 5: Custom Memory Allocator for HFT
Build a specialized memory allocator (arena/pool allocator) that eliminates malloc from your hot path - because malloc is a latency killer in HFT systems.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 week |
| Language | C or C++ (Rust for GlobalAlloc) |
| Prerequisites | Understanding of pointers, basic memory concepts |
| Key Topics | Pool allocators, arena allocators, thread-local storage, fragmentation |
1. Learning Objectives
By completing this project, you will be able to:
-
Explain why malloc is unacceptable for HFT hot paths and quantify the latency difference between general-purpose and specialized allocators (250ns vs 8ns).
-
Implement a fixed-size pool allocator with O(1) allocation and deallocation using a free list data structure.
-
Design an arena allocator (bump allocator) for ultra-fast allocation when bulk deallocation is acceptable.
-
Create multiple size-class pools that can handle varying object sizes without external fragmentation.
-
Implement thread-local pools to eliminate contention in multi-threaded HFT systems.
-
Integrate your allocator with C++ std::allocator or Rust’s GlobalAlloc trait for seamless use with standard containers.
-
Benchmark and verify determinism by measuring allocation latency distributions and comparing p50/p99/p999 with malloc.
-
Identify and avoid common pitfalls including memory leaks, use-after-free, and false sharing in concurrent allocators.
2. Theoretical Foundation
2.1 Core Concepts
Why malloc is Slow for HFT
The general-purpose malloc implementation (glibc, jemalloc, tcmalloc) is optimized for a wide range of use cases, but this flexibility comes at a cost:
Why malloc Has Unpredictable Latency
┌────────────────────────────────────────────────────────────────────┐
│ │
│ 1. SYSTEM CALLS (worst case) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ malloc │────►│ sbrk/ │────►│ Kernel │ │
│ │ │ │ mmap │ │ Context │ │
│ └──────────┘ └──────────┘ │ Switch │ │
│ │ └──────────┘ │
│ │ +10-50μs (microseconds!) │
│ │ │
│ 2. FRAGMENTATION OVERHEAD │
│ ┌──────────────────────────────────────────┐ │
│ │ Free: [16B] [32B] [16B] [64B] [8B] ... │ │
│ │ Request 24B → Search free list → Split │ │
│ │ Coalescing on free → More overhead │ │
│ └──────────────────────────────────────────┘ │
│ │
│ 3. LOCKING (multi-threaded) │
│ ┌─────────┐ Lock contention ┌─────────┐ │
│ │ Thread1 │◄─────────────────►│ Thread2 │ │
│ └─────────┘ └─────────┘ │
│ │ │ │
│ └──────── Mutex wait ─────────┘ │
│ +1-10μs per contention │
│ │
│ 4. NON-DETERMINISTIC LATENCY │
│ Best case: ~50ns (fast path, no contention) │
│ Typical: ~250ns (normal workload) │
│ Worst case: ~50μs (mmap, high contention) │
│ │
│ HFT cannot tolerate this variance! │
│ │
└────────────────────────────────────────────────────────────────────┘
The Four Killers of malloc Performance:
-
System Calls: When malloc’s internal pools are exhausted, it must call
sbrk()ormmap()to get more memory from the OS. A single system call can take 10-50 microseconds - an eternity in HFT. -
Fragmentation Management: General-purpose allocators must handle any size request. This requires maintaining complex data structures (free lists, size classes, bins) and performing operations like coalescing adjacent free blocks.
-
Thread Synchronization: Multi-threaded programs require locks on shared allocator state. Even with optimizations like per-thread caches (tcmalloc), contention still occurs during cache replenishment.
-
Non-Determinism: The combination above means malloc has highly variable latency. HFT systems need predictable, worst-case latency guarantees.
Pool Allocators: Fixed-Size Block Freedom
A pool allocator pre-allocates a large chunk of memory and divides it into fixed-size blocks. The key insight: if all blocks are the same size, no fragmentation can occur.
Pool Allocator Structure
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ MEMORY POOL (single contiguous allocation) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Block 0 Block 1 Block 2 Block 3 Block 4 ... │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │ next │ │ next │ │ next │ │ next │ │ next │ │ │
│ │ │ ──────► │ ──────► │ ──────► │ ──────► │ NULL │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ │
│ FREE LIST HEAD │
│ (points to first free block) │
│ │
│ ═══════════════════════════════════════════════════════════════ │
│ │
│ ALLOCATION: O(1) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 1. block = free_list_head │ │
│ │ 2. free_list_head = block->next │ │
│ │ 3. return block │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ DEALLOCATION: O(1) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 1. block->next = free_list_head │ │
│ │ 2. free_list_head = block │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Properties of Pool Allocators:
- O(1) Allocation: Just pop from the free list head
- O(1) Deallocation: Just push to the free list head
- Zero Fragmentation: All blocks are identical size
- No System Calls: Memory is pre-allocated at startup
- Cache Friendly: Blocks are contiguous in memory
Trade-off: You must know the block size at compile time or pool creation time.
Arena Allocators: The Simplest Fast Allocator
An arena allocator (also called a bump allocator or linear allocator) is even simpler than a pool allocator:
Arena Allocator (Bump Pointer)
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ARENA MEMORY │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ██████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │
│ │ ◄── allocated ──────►◄──────────── free ─────────────────► │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ bump_ptr end │
│ │
│ ═══════════════════════════════════════════════════════════════ │
│ │
│ ALLOCATION: O(1) - just bump the pointer! │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 1. result = bump_ptr │ │
│ │ 2. bump_ptr += size │ │
│ │ 3. return result │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ DEALLOCATION: Not supported individually! │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ reset(): bump_ptr = start // Free everything at once │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ USE CASES: │
│ • Per-request allocations that all get freed together │
│ • Game frame allocations (allocate during frame, reset at end) │
│ • Order processing (allocate for order, reset after complete) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Arena Allocator Trade-offs:
| Advantage | Disadvantage |
|---|---|
| Fastest possible allocation (single pointer bump) | Cannot free individual objects |
| Zero overhead per allocation | Must reset entire arena at once |
| Perfect cache locality | May waste memory if objects vary in size |
| No fragmentation | Fixed maximum capacity |
HFT Use Case: Processing a batch of market data updates. Allocate temporary structures for each message, then reset the arena when the batch is complete.
Thread-Local Pools: Eliminating Contention
In multi-threaded HFT systems, each thread typically handles different tasks:
- Thread 1: Market data processing
- Thread 2: Strategy execution
- Thread 3: Order management
Sharing a single allocator between threads creates contention. The solution: thread-local pools.
Thread-Local Storage (TLS) Pattern
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ TRADITIONAL SHARED ALLOCATOR (BAD for HFT) │
│ │
│ ┌────────┐ ┌────────────┐ ┌────────┐ │
│ │Thread 1│────►│ Shared │◄────│Thread 2│ │
│ └────────┘ │ Allocator │ └────────┘ │
│ ▲ │ (LOCK!) │ ▲ │
│ │ └────────────┘ │ │
│ └──────── CONTENTION ─────────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════ │
│ │
│ THREAD-LOCAL POOLS (GOOD for HFT) │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Thread 1 │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ TLS Pool: [████░░░░░░░░░░░░░░░░░░░] │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Thread 2 │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ TLS Pool: [██████████░░░░░░░░░░░░░] │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Thread 3 │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ TLS Pool: [████████░░░░░░░░░░░░░░░] │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ NO CONTENTION! Each thread has its own pool. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Avoiding False Sharing:
Even with thread-local pools, you must avoid false sharing - when two threads modify data on the same cache line:
False Sharing Problem
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ CACHE LINE (64 bytes) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Thread1_head │ Thread2_head │ unused │ │
│ │ 8 bytes │ 8 bytes │ 48 bytes │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ Thread 1 Thread 2 │
│ writes writes │
│ │
│ PROBLEM: Both threads invalidate each other's cache line! │
│ │
│ ═══════════════════════════════════════════════════════════════ │
│ │
│ SOLUTION: Pad to cache line boundary │
│ │
│ CACHE LINE 1 CACHE LINE 2 │
│ ┌────────────────────────────┐ ┌────────────────────────────┐ │
│ │ Thread1_head │ padding │ │ Thread2_head │ padding │ │
│ │ 8 bytes │ 56 bytes │ │ 8 bytes │ 56 bytes │ │
│ └────────────────────────────┘ └────────────────────────────┘ │
│ │
│ NO FALSE SHARING! Each thread owns its cache line. │
│ │
└─────────────────────────────────────────────────────────────────────┘
2.2 Why This Matters in HFT
HFT systems operate under extreme constraints:
| Constraint | Requirement | Why Allocators Matter |
|---|---|---|
| Latency | <10μs tick-to-trade | malloc’s 250ns+ adds up across many allocations |
| Determinism | Predictable worst case | malloc’s p99 can be 100x its p50 |
| Throughput | 1M+ messages/second | Lock contention destroys throughput |
| Memory | Minimize cache misses | Scattered allocations harm locality |
Real-world impact: A trading strategy that allocates during the hot path will have inconsistent performance. On the day it matters most (high volatility), it will perform worst (more allocations, more contention).
The HFT rule: Never call malloc in the hot path.
Instead:
- Pre-allocate all memory at startup
- Use pool allocators for fixed-size objects (orders, quotes, trade records)
- Use arena allocators for temporary per-message processing
- Use thread-local pools to eliminate contention
2.3 Historical Context
Memory allocation has evolved significantly:
| Era | Approach | Characteristics |
|---|---|---|
| 1970s | Simple first-fit | O(n) search, severe fragmentation |
| 1980s | Buddy allocators | Power-of-2 sizes, reduced fragmentation |
| 1990s | Doug Lea’s malloc (dlmalloc) | Size classes, boundary tags |
| 2000s | tcmalloc, jemalloc | Thread caching, per-CPU caches |
| 2010s+ | Custom HFT allocators | Application-specific, zero overhead |
Key insight: General-purpose allocators keep adding features to handle more use cases. HFT goes the opposite direction - strip away everything not needed for the specific use case.
2.4 Common Misconceptions
Misconception 1: “jemalloc/tcmalloc is fast enough”
Reality: They are faster than glibc malloc but still have unpredictable latency. Thread cache replenishment can stall, and their complexity means more cache pollution.
Misconception 2: “I can just use stack allocation”
Reality: Stack allocation is fast but limited in size (typically 1-8MB) and lifetime. You cannot return stack-allocated data from a function, and deep call stacks compete for stack space.
Misconception 3: “Object pools are only for memory savings”
Reality: The primary benefit in HFT is latency determinism, not memory efficiency. Pools guarantee O(1) allocation with no system calls.
Misconception 4: “Rust’s ownership system eliminates the need for custom allocators”
Reality: Rust’s safety guarantees help prevent bugs, but Box::new() still calls the global allocator (malloc). For HFT Rust code, you still need custom allocators.
3. Project Specification
3.1 What You Will Build
A custom memory allocation library that provides:
- Fixed-size pool allocator: Pre-allocated blocks of a single size with O(1) alloc/free
- Multi-pool allocator: Multiple pools for different size classes
- Arena allocator: Bump-pointer allocation with bulk reset
- Thread-local wrapper: Per-thread pools without locks
- C++ std::allocator integration: Use with std::vector, std::list, etc.
- Comprehensive benchmarks: Compare against malloc, jemalloc
3.2 Functional Requirements
| Requirement | Description |
|---|---|
| FR1 | Pool allocator with configurable block size and count |
| FR2 | O(1) allocation and deallocation for pool allocator |
| FR3 | Arena allocator with configurable size and reset capability |
| FR4 | Multi-pool supporting at least 4 size classes (e.g., 32B, 64B, 128B, 256B) |
| FR5 | Thread-local pool storage without locks for same-thread alloc/free |
| FR6 | Statistics tracking: allocations, frees, high watermark, current usage |
| FR7 | Debug mode: detect double-free, use-after-free (at cost of performance) |
| FR8 | Integration with C++ STL containers via std::allocator interface |
3.3 Non-Functional Requirements
| Requirement | Target |
|---|---|
| NFR1 | Pool allocation: <15ns average latency |
| NFR2 | Pool deallocation: <10ns average latency |
| NFR3 | Deterministic: p99 < 2x p50 for pool operations |
| NFR4 | Zero system calls after initialization |
| NFR5 | Memory overhead: <5% for block metadata |
| NFR6 | Thread-local pools: zero contention between threads |
3.4 Example Usage / Output
Benchmark Output:
=== ALLOCATOR BENCHMARK ===
Running 1,000,000 allocation/deallocation cycles...
malloc (glibc 2.31):
alloc avg: 247ns p50: 180ns p99: 1.2μs p999: 8.5μs
free avg: 125ns p50: 95ns p99: 820ns p999: 5.2μs
pool_alloc (fixed 64B blocks):
alloc avg: 8ns p50: 7ns p99: 12ns p999: 15ns
free avg: 6ns p50: 5ns p99: 9ns p999: 11ns
arena_alloc (bump pointer):
alloc avg: 3ns p50: 3ns p99: 4ns p999: 5ns
reset avg: 2ns (bulk free)
Improvement over malloc:
Pool allocation: 30x faster average, 100x better p99
Arena allocation: 80x faster average, 300x better p99
=== INTEGRATION TEST: ORDER BOOK ===
Order book with std::allocator (malloc):
Insert order avg: 850ns p99: 12μs
Remove order avg: 420ns p99: 8μs
Order book with pool_allocator:
Insert order avg: 320ns p99: 450ns
Remove order avg: 180ns p99: 280ns
Improvement: 62% lower latency, 96% lower tail latency
=== MEMORY STATS ===
Pool: 64B blocks
Total capacity: 65536 blocks (4MB)
Current used: 12847 blocks
Peak usage: 24531 blocks
Allocations: 1,847,293
Frees: 1,834,446
Arena: 1MB
Current offset: 524288 bytes
Resets: 1,423
Allocations: 14,847,291
Code Usage Example (C++):
// Example of how your allocator will be used
#include "pool_allocator.hpp"
// Create a pool for Order objects
using OrderPool = PoolAllocator<Order, 10000>;
OrderPool order_pool;
// Allocate without malloc
Order* order = order_pool.allocate();
order->price = 100.50;
order->quantity = 100;
// Use in hot path with zero malloc calls
process_order(order);
// Free back to pool
order_pool.deallocate(order);
// Use with STL containers
std::vector<Order, OrderPool> orders(order_pool);
orders.push_back(Order{...}); // Uses pool, not malloc
3.5 Real World Outcome
After completing this project, you will have:
- A working allocator library you can integrate into your order book from Project 1
- Measurable latency improvements with benchmark data showing 10-100x improvement
- Understanding of memory systems that applies to any performance-critical code
- Interview-ready knowledge about memory allocation internals
This allocator will directly integrate with your other HFT projects:
- Project 1 (Order Book): Use pool allocator for Order and PriceLevel objects
- Project 2 (Lock-Free Queue): Use arena allocator for message parsing
- Project 3 (Matching Engine): Use thread-local pools for per-connection state
4. Solution Architecture
4.1 High-Level Design
Custom Allocator System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ APPLICATION LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ std::vector<T, PoolAllocator<T>> │ │
│ │ Order* order = pool.allocate() │ │
│ │ Message* msg = arena.allocate(sizeof(Message)) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ALLOCATOR INTERFACE │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ allocate() │ │deallocate() │ │ reset() (arena) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ POOL │ │ ARENA │ │ MULTI-POOL │ │
│ │ ALLOCATOR │ │ ALLOCATOR │ │ ALLOCATOR │ │
│ │ │ │ │ │ │ │
│ │ Fixed-size │ │ Bump pointer │ │ Size-class │ │
│ │ Free list │ │ Bulk reset │ │ routing │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ THREAD-LOCAL WRAPPER (optional) │ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Thread 1 │ │ Thread 2 │ │ Thread 3 │ │ │
│ │ │ Pool │ │ Pool │ │ Pool │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BACKING MEMORY │ │
│ │ │ │
│ │ Single large allocation at startup (mmap or new char[]) │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
4.2 Key Components
| Component | Responsibility | Key Data Structures |
|---|---|---|
| PoolAllocator | Fixed-size block allocation | Free list, block array |
| ArenaAllocator | Bump-pointer allocation | Bump pointer, base pointer |
| MultiPoolAllocator | Route to size-appropriate pool | Array of pools, size-to-pool mapping |
| ThreadLocalPool | Per-thread pool access | Thread-local storage (TLS) |
| Statistics | Track usage metrics | Atomic counters |
| DebugAllocator | Detect memory errors | Guard bytes, allocation tracking |
4.3 Data Structures
Pool Allocator Internal Structure
Pool Allocator Memory Layout
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ METADATA (stored separately from blocks for cache efficiency) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ block_size: 64 │ │
│ │ num_blocks: 10000 │ │
│ │ free_count: 7342 │ │
│ │ free_list_head: ──────────────────────────────┐ │ │
│ │ memory_base: ──────────────────────┐ │ │ │
│ └────────────────────────────────────│──────────│─────────────┘ │
│ │ │ │
│ ▼ │ │
│ BLOCK ARRAY │ │
│ ┌───────────────────────────────────────────────│─────────────┐ │
│ │ │ │ │
│ │ Block 0 Block 1 Block 2 │ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ◄────┘ │ │
│ │ │ IN USE │ │ next ──────► next ──────► NULL │ │
│ │ │ (data) │ │ (free) │ │ (free) │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └────────┘ └────────┘ └────────┘ ... │ │
│ │ 64B 64B 64B │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ INTRUSIVE FREE LIST (no separate allocation for list nodes!) │
│ │
│ When a block is FREE, its first 8 bytes store the 'next' pointer │
│ When a block is IN USE, user data overwrites the 'next' pointer │
│ │
└─────────────────────────────────────────────────────────────────────┘
Arena Allocator Internal Structure
Arena Allocator Memory Layout
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ METADATA │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ base: 0x7f8b4c000000 │ │
│ │ end: 0x7f8b4c100000 (base + 1MB) │ │
│ │ current: 0x7f8b4c002800 (next allocation position) │ │
│ │ allocs: 142 │ │
│ │ resets: 3 │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ MEMORY REGION │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ │
│ │ ▲ ▲│ │
│ │ │ ││ │
│ │ current end│ │
│ │ │ │
│ │ ◄────── allocated ──────►◄────────── available ────────────►│ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ALLOCATION (returns current, bumps pointer) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Before: current = 0x2800 │ │
│ │ Request: 128 bytes (aligned to 8) │ │
│ │ After: current = 0x2880 │ │
│ │ Return: 0x2800 │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ RESET (restore to beginning) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ current = base │ │
│ │ // All previous allocations now invalid │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Multi-Pool Size Class Routing
Multi-Pool Allocator Size Class Design
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ SIZE CLASSES (power of 2 for simplicity) │
│ │
│ Request Size Rounded To Pool Index Wasted Bytes (max) │
│ ┌────────────┬─────────────┬────────────┬──────────────────────┐ │
│ │ 1-32 │ 32 │ 0 │ 31 (97%) │ │
│ │ 33-64 │ 64 │ 1 │ 31 (48%) │ │
│ │ 65-128 │ 128 │ 2 │ 63 (49%) │ │
│ │ 129-256 │ 256 │ 3 │ 127 (50%) │ │
│ │ 257-512 │ 512 │ 4 │ 255 (50%) │ │
│ │ 513-1024 │ 1024 │ 5 │ 511 (50%) │ │
│ │ >1024 │ fallback to malloc │ │ │
│ └────────────┴─────────────┴────────────┴──────────────────────┘ │
│ │
│ ROUTING FUNCTION (must be O(1)!) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ // Efficient power-of-2 routing using bit manipulation │ │
│ │ pool_index = (size <= 32) ? 0 : │ │
│ │ (63 - count_leading_zeros(size - 1)) - 4; │ │
│ │ │ │
│ │ // Or use lookup table for small sizes (cache-friendly) │ │
│ │ pool_index = size_to_pool_table[size]; │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ POOL ARRAY │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ pools[0]: PoolAllocator<32> ────► [32B blocks...] │ │
│ │ pools[1]: PoolAllocator<64> ────► [64B blocks...] │ │
│ │ pools[2]: PoolAllocator<128> ────► [128B blocks...] │ │
│ │ pools[3]: PoolAllocator<256> ────► [256B blocks...] │ │
│ │ pools[4]: PoolAllocator<512> ────► [512B blocks...] │ │
│ │ pools[5]: PoolAllocator<1024> ────► [1024B blocks...] │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
4.4 Algorithm Overview
Pool Allocator Algorithms
Initialization:
- Allocate contiguous memory for all blocks:
memory = mmap(block_size * num_blocks) - Initialize free list by linking all blocks:
block[i].next = &block[i+1] - Set
free_list_head = &block[0]
Allocation (O(1)):
- Check
free_list_head != nullptr(pool exhausted check) block = free_list_headfree_list_head = block->next- Return
block
Deallocation (O(1)):
block->next = free_list_headfree_list_head = block
Arena Allocator Algorithms
Initialization:
- Allocate contiguous memory:
memory = mmap(arena_size) - Set
current = memory,end = memory + arena_size
Allocation (O(1)):
aligned_size = align_up(size, 8)// 8-byte alignment- Check
current + aligned_size <= end(capacity check) result = currentcurrent += aligned_size- Return
result
Reset (O(1)):
current = base- Note: All previous pointers are now invalid!
Thread-Local Pool Access
Get Thread-Local Pool:
- Access TLS variable:
thread_local PoolAllocator* tls_pool - If
tls_pool == nullptr, initialize for this thread - Return
tls_pool
Key Insight: No locks needed because each thread has its own pool. Cross-thread free (freeing memory allocated by another thread) requires special handling - typically a lock-free return queue.
5. Implementation Guide
5.1 Development Environment Setup
Required Tools:
| Tool | Purpose | Installation |
|---|---|---|
| GCC 12+ or Clang 15+ | C++20 support | apt install g++-12 or apt install clang-15 |
| CMake 3.20+ | Build system | apt install cmake |
| perf | Benchmarking | apt install linux-tools-generic |
| Valgrind | Memory debugging | apt install valgrind |
| Google Benchmark | Microbenchmarks | apt install libbenchmark-dev |
Recommended Compiler Flags:
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_FLAGS "-O3 -march=native -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -g -fsanitize=address,undefined")
5.2 Project Structure
hft-allocator/
├── CMakeLists.txt
├── include/
│ ├── hft/
│ │ ├── pool_allocator.hpp # Fixed-size pool
│ │ ├── arena_allocator.hpp # Bump allocator
│ │ ├── multi_pool.hpp # Size-class routing
│ │ ├── thread_local_pool.hpp # TLS wrapper
│ │ ├── stl_adapter.hpp # std::allocator interface
│ │ └── allocator_stats.hpp # Statistics tracking
├── src/
│ └── allocator_stats.cpp # Statistics implementation
├── tests/
│ ├── pool_test.cpp # Unit tests
│ ├── arena_test.cpp
│ ├── multi_pool_test.cpp
│ └── thread_local_test.cpp
├── benchmarks/
│ ├── alloc_benchmark.cpp # Performance comparison
│ └── integration_benchmark.cpp # With order book
└── examples/
├── basic_usage.cpp
└── orderbook_integration.cpp
5.3 The Core Question You’re Answering
“How do you provide memory allocation with O(1) performance and zero fragmentation for a known set of object sizes?”
This question forces you to understand:
- Why general-purpose allocators are slow (flexibility has cost)
- How to trade flexibility for speed (fixed sizes, pre-allocation)
- The difference between internal and external fragmentation
- How to achieve deterministic performance (no system calls, no locks)
5.4 Concepts You Must Understand First
Before implementing, verify you understand these concepts:
| Concept | Self-Assessment Question | Resource if Stuck |
|---|---|---|
| Pointers and Memory | What is the difference between &x and *p? What does p + 1 mean for a pointer to a 64-byte struct? |
CS:APP Ch. 3 |
| Memory Layout | Where are stack, heap, and static variables stored? Which grows up vs down? | CS:APP Ch. 9 |
| Alignment | Why must a double be aligned to 8 bytes on x86-64? What happens if it’s not? |
CS:APP Ch. 3.9 |
| Cache Lines | How big is a cache line? Why does accessing arr[0] also load arr[1-7]? |
CS:APP Ch. 6 |
| System Calls | What happens when you call malloc? When does it call sbrk? |
TLPI Ch. 7 |
| Thread-Local Storage | What is thread_local in C++? How is it different from global variables? |
C++ reference |
5.5 Questions to Guide Your Design
Pool Allocator Design:
- How will you store the free list? Separate allocation or intrusive in the blocks?
- What happens when the pool is exhausted? Fail fast, grow, or fallback?
- How will you handle alignment requirements for different types?
- Should you zero memory on allocation? On deallocation?
Arena Allocator Design:
- What alignment will you use for allocations? Why 8 bytes?
- How will you detect arena overflow? Check on every allocation?
- When should an arena be reset? Who is responsible?
- Can you have multiple arenas for different purposes?
Thread-Local Design:
- When will thread-local pools be initialized? Lazy or at thread start?
- What happens if Thread A allocates and Thread B tries to free?
- How will you handle thread termination? Leak the pool or reclaim?
Integration Design:
- How will your allocator satisfy the C++ Allocator concept requirements?
- How will you enable/disable statistics tracking? Compile-time or runtime?
- What debug checks should be enabled in development builds?
5.6 Thinking Exercise
Before writing code, trace through this scenario by hand:
Scenario: Pool with 4 blocks of 32 bytes each.
Initial State:
memory: [Block0][Block1][Block2][Block3]
free_list: Block0 -> Block1 -> Block2 -> Block3 -> null
Operations:
1. p1 = allocate() -- Draw the state after this
2. p2 = allocate() -- Draw the state after this
3. deallocate(p1) -- Draw the state after this
4. p3 = allocate() -- Where does p3 point? Why?
5. p4 = allocate() -- Draw the state after this
6. p5 = allocate() -- What happens? Why?
Work through this on paper before implementing. Understanding the free list mechanics is essential.
5.7 Hints in Layers
Hint 1: Starting Point (Conceptual Direction) Start with the simplest possible pool allocator - a fixed block size known at compile time, single-threaded, no statistics. Get this working and benchmarked before adding complexity.
Hint 2: Next Level (More Specific Guidance)
Use an intrusive free list - store the next pointer inside the free block itself. This eliminates separate allocation for list nodes. The key insight: when a block is free, you can use its memory for bookkeeping.
Hint 3: Technical Details (Approach/Pseudocode)
Pool allocator structure:
- void* memory_base // mmap'd or new char[]
- void* free_list_head // points to first free block
- size_t block_size // constant
- size_t num_blocks // constant
allocate():
if (free_list_head == nullptr) return nullptr;
void* result = free_list_head;
free_list_head = *((void**)free_list_head); // Follow the intrusive pointer
return result;
deallocate(void* ptr):
*((void**)ptr) = free_list_head; // Store old head in freed block
free_list_head = ptr; // New head is the freed block
Hint 4: Tools and Debugging (Verification Methods)
- Use AddressSanitizer (
-fsanitize=address) to catch use-after-free and buffer overflows - Use
perf statto verify zero system calls after init:perf stat -e syscalls:sys_enter_* ./benchmark - Add guard bytes (0xDEADBEEF) before/after blocks in debug mode to detect overflow
- Print allocation traces with
__builtin_return_address(0)to debug double-frees
5.8 The Interview Questions They’ll Ask
After completing this project, be prepared to answer:
- “Why is malloc slow for low-latency applications?”
- Expected: Discuss system calls, fragmentation, locking, non-determinism
- Bonus: Mention specific latency numbers (250ns typical, 50us worst case)
- “How does a pool allocator achieve O(1) allocation?”
- Expected: Explain free list, intrusive pointers, pre-allocation
- Bonus: Discuss cache implications of contiguous blocks
- “What is the trade-off between pool and arena allocators?”
- Expected: Pool allows individual free, arena requires bulk free
- Bonus: Discuss use cases for each in a trading system
- “How do you avoid lock contention in a multi-threaded allocator?”
- Expected: Thread-local storage, per-thread pools
- Bonus: Discuss cross-thread free problem and solutions
- “What is false sharing and how do you prevent it?”
- Expected: Cache line sharing, padding to 64 bytes
- Bonus: Explain
alignas(64)and__attribute__((aligned(64)))
- “How would you integrate a custom allocator with STL containers?”
- Expected: Implement Allocator concept (allocate, deallocate, rebind)
- Bonus: Discuss stateful vs stateless allocators
- “What debugging features would you add to a custom allocator?”
- Expected: Guard bytes, double-free detection, allocation tracking
- Bonus: Discuss compile-time vs runtime debug mode toggle
5.9 Books That Will Help
| Topic | Book | Specific Chapters |
|---|---|---|
| Memory allocation internals | “Computer Systems: A Programmer’s Perspective” (CS:APP) | Ch. 9 (Virtual Memory) |
| Cache hierarchy | CS:APP | Ch. 6 (Memory Hierarchy) |
| Data structure alignment | CS:APP | Ch. 3.9 (Heterogeneous Data Structures) |
| System calls (mmap) | “The Linux Programming Interface” (TLPI) | Ch. 49 (Memory Mappings) |
| Pool allocator pattern | “Fluent C” by Preschern | Ch. 6 (Memory Pool) |
| C++ allocator interface | “Effective Modern C++” by Meyers | Item 21 (Custom allocators) |
| Lock-free programming | “C++ Concurrency in Action” by Williams | Ch. 7 (Lock-free data structures) |
| Thread-local storage | “Rust Atomics and Locks” by Bos | Ch. 8 (Thread-local storage) |
5.10 Implementation Phases
Phase 1: Basic Fixed-Size Pool (Days 1-2)
Milestone: Single-threaded pool allocator that passes basic tests.
Tasks:
- Define
PoolAllocatorclass with template parameter for block size - Implement constructor that allocates backing memory with
mmapornew char[] - Initialize intrusive free list linking all blocks
- Implement
allocate()- pop from free list - Implement
deallocate()- push to free list - Write unit tests:
- Allocate all blocks, verify all unique pointers
- Free all blocks, allocate again, verify reuse
- Allocate one more than capacity, verify nullptr return
Verification:
# All tests pass
./pool_test
# Check no memory leaks
valgrind ./pool_test
Phase 2: Arena Allocator (Day 3)
Milestone: Bump-pointer allocator with reset capability.
Tasks:
- Define
ArenaAllocatorclass with size parameter - Implement constructor with backing memory allocation
- Implement
allocate(size_t n)- bump pointer with alignment - Implement
reset()- restore pointer to beginning - Add overflow checking
- Write unit tests:
- Allocate various sizes, verify alignment
- Fill arena, verify nullptr on overflow
- Reset, verify can allocate again
Verification:
./arena_test
Phase 3: Multi-Pool Size Classes (Day 4)
Milestone: Allocator that routes to appropriate pool based on size.
Tasks:
- Define size classes (32, 64, 128, 256 bytes minimum)
- Create array of pools, one per size class
- Implement size-to-pool routing (bit manipulation or lookup table)
- Implement
allocate(size_t n)with routing - Implement
deallocate(void* p, size_t n)- note: size must be provided! - Handle oversized allocations (fallback to malloc or reject)
- Write unit tests for various sizes
Design Decision: How to know which pool a pointer came from when freeing?
- Option A: Require size on deallocate (fastest, used by jemalloc)
- Option B: Store header before each block (overhead per allocation)
- Option C: Check pointer ranges (one comparison per pool)
Phase 4: Thread-Local Wrapper (Day 5)
Milestone: Per-thread pools without locks.
Tasks:
- Create
ThreadLocalPoolwrapper usingthread_local - Implement lazy initialization on first access
- Test with multiple threads allocating concurrently
- Verify zero lock contention with benchmarks
Complication - Cross-Thread Free: What if Thread A allocates and Thread B frees? Options:
- Option A: Disallow (assert same thread)
- Option B: Lock-free return queue per pool
- Option C: Lock only for cross-thread free (rare path)
For HFT, Option A is usually correct - memory should be freed by the same thread that allocated it.
Phase 5: C++ Allocator Integration (Day 6)
Milestone: Use pool allocator with std::vector<T, PoolAllocator<T>>.
Tasks:
- Implement C++17 Allocator concept requirements:
value_typetypedefallocate(n)returningT*deallocate(T* p, n)
- Implement
rebindfor container node types - Handle containers that allocate different sizes (e.g.,
std::listnodes) - Test with
std::vector,std::list,std::map
Challenge: Containers may allocate auxiliary data (e.g., map nodes) of different sizes than T. Your allocator must handle this via rebinding or multi-pool.
Phase 6: Benchmarking and Statistics (Day 7)
Milestone: Demonstrate 10-100x improvement over malloc.
Tasks:
- Add statistics tracking: allocs, frees, current usage, high watermark
- Create benchmark comparing:
- Pool allocator vs malloc
- Arena allocator vs malloc
- Thread-local pool vs shared pool
- Measure p50, p99, p999 latencies
- Integrate with order book from Project 1
- Document results
Benchmark Framework:
// Use Google Benchmark for consistent measurements
static void BM_MallocAlloc(benchmark::State& state) {
for (auto _ : state) {
void* p = malloc(64);
benchmark::DoNotOptimize(p);
free(p);
}
}
BENCHMARK(BM_MallocAlloc);
static void BM_PoolAlloc(benchmark::State& state) {
PoolAllocator<64, 10000> pool;
for (auto _ : state) {
void* p = pool.allocate();
benchmark::DoNotOptimize(p);
pool.deallocate(p);
}
}
BENCHMARK(BM_PoolAlloc);
5.11 Key Implementation Decisions
| Decision | Recommended Choice | Rationale |
|---|---|---|
| Backing memory | mmap with MAP_PRIVATE | MAP_ANONYMOUS |
Avoid heap fragmentation, easy to return to OS |
| Block size | Template parameter | Compile-time optimization, no runtime overhead |
| Alignment | 8 bytes minimum, configurable | Match x86-64 requirements, allow AVX alignment |
| Overflow behavior | Return nullptr | Fail fast, let caller decide |
| Debug checks | Compile-time flag | Zero overhead in release builds |
| Statistics | Optional, runtime flag | Allow measurement without recompile |
| Cross-thread free | Assert same thread | Simplest correct behavior for HFT |
6. Testing Strategy
6.1 Unit Tests
Pool Allocator Tests:
TEST(PoolAllocator, AllocatesUniquePointers) {
PoolAllocator<64, 100> pool;
std::set<void*> ptrs;
for (int i = 0; i < 100; ++i) {
void* p = pool.allocate();
ASSERT_NE(p, nullptr);
ASSERT_EQ(ptrs.count(p), 0); // Must be unique
ptrs.insert(p);
}
}
TEST(PoolAllocator, ReturnsNullWhenExhausted) {
PoolAllocator<64, 10> pool;
for (int i = 0; i < 10; ++i) pool.allocate();
ASSERT_EQ(pool.allocate(), nullptr);
}
TEST(PoolAllocator, ReusesFreedBlocks) {
PoolAllocator<64, 10> pool;
void* p1 = pool.allocate();
pool.deallocate(p1);
void* p2 = pool.allocate();
ASSERT_EQ(p1, p2); // LIFO reuse
}
TEST(PoolAllocator, AlignmentCorrect) {
PoolAllocator<64, 100> pool;
for (int i = 0; i < 100; ++i) {
void* p = pool.allocate();
ASSERT_EQ(reinterpret_cast<uintptr_t>(p) % 8, 0);
}
}
Arena Allocator Tests:
TEST(ArenaAllocator, BumpPointerAdvances) {
ArenaAllocator<1024> arena;
void* p1 = arena.allocate(32);
void* p2 = arena.allocate(32);
ASSERT_EQ(static_cast<char*>(p2) - static_cast<char*>(p1), 32);
}
TEST(ArenaAllocator, ResetRestoresCapacity) {
ArenaAllocator<64> arena;
for (int i = 0; i < 8; ++i) arena.allocate(8);
ASSERT_EQ(arena.allocate(8), nullptr); // Full
arena.reset();
ASSERT_NE(arena.allocate(8), nullptr); // Available again
}
TEST(ArenaAllocator, AlignmentRoundsUp) {
ArenaAllocator<1024> arena;
void* p1 = arena.allocate(3); // Rounds to 8
void* p2 = arena.allocate(1);
ASSERT_EQ(static_cast<char*>(p2) - static_cast<char*>(p1), 8);
}
6.2 Stress Tests
TEST(PoolAllocator, StressTestMillionCycles) {
PoolAllocator<64, 1000> pool;
std::vector<void*> ptrs;
for (int cycle = 0; cycle < 1000000; ++cycle) {
// Randomly allocate or free
if (ptrs.empty() || (rand() % 2 == 0 && ptrs.size() < 1000)) {
void* p = pool.allocate();
if (p) ptrs.push_back(p);
} else {
size_t idx = rand() % ptrs.size();
pool.deallocate(ptrs[idx]);
ptrs.erase(ptrs.begin() + idx);
}
}
// Free remaining
for (void* p : ptrs) pool.deallocate(p);
}
TEST(ThreadLocalPool, ConcurrentAccess) {
ThreadLocalPool<64, 1000> pool;
std::vector<std::thread> threads;
std::atomic<int> total_allocs{0};
for (int t = 0; t < 8; ++t) {
threads.emplace_back([&]() {
for (int i = 0; i < 100000; ++i) {
void* p = pool.allocate();
ASSERT_NE(p, nullptr);
pool.deallocate(p);
total_allocs++;
}
});
}
for (auto& t : threads) t.join();
ASSERT_EQ(total_allocs, 800000);
}
6.3 Property-Based Tests
Test invariants that should always hold:
// After any sequence of alloc/free, free_count + allocated_count == total_blocks
// All returned pointers are within the pool's memory range
// All returned pointers are aligned correctly
// Double-free should be detected (in debug mode)
6.4 Integration Tests
TEST(Integration, VectorWithPoolAllocator) {
using OrderVector = std::vector<Order, PoolAllocator<Order, 1000>>;
OrderVector orders;
for (int i = 0; i < 100; ++i) {
orders.push_back(Order{.id = i, .price = 100.0 + i});
}
ASSERT_EQ(orders.size(), 100);
ASSERT_EQ(orders[50].id, 50);
}
7. Common Pitfalls & Debugging
7.1 Pitfall: Incorrect Free List Initialization
Symptom: Crash on first allocation, or same pointer returned multiple times.
Cause: Free list not properly initialized; blocks not linked correctly.
Fix: After allocating backing memory, iterate through all blocks and set each block’s next pointer:
for (size_t i = 0; i < num_blocks - 1; ++i) {
*reinterpret_cast<void**>(block_at(i)) = block_at(i + 1);
}
*reinterpret_cast<void**>(block_at(num_blocks - 1)) = nullptr;
free_list_head = block_at(0);
Verification: Print free list after initialization, verify correct linking.
7.2 Pitfall: Alignment Violations
Symptom: SIGBUS on ARM, performance degradation on x86, incorrect data.
Cause: Blocks not aligned to required boundary (8 bytes for most types, 16/32 for SIMD).
Fix: Ensure block_size is a multiple of required alignment. Use alignas or align backing memory:
void* memory = aligned_alloc(64, total_size); // 64-byte alignment for cache
Verification: Assert (uintptr_t)ptr % alignment == 0 on every allocation.
7.3 Pitfall: Use-After-Free Not Detected
Symptom: Mysterious data corruption, non-deterministic crashes.
Cause: Freed block reused while old pointer still in use.
Fix (debug mode): Fill freed blocks with poison pattern:
void deallocate(void* ptr) {
#ifdef DEBUG
memset(ptr, 0xDE, block_size); // Poison pattern
#endif
// ... normal free logic
}
Verification: Enable AddressSanitizer during development.
7.4 Pitfall: Double Free Corruption
Symptom: Free list corruption, infinite loops, crashes.
Cause: Same pointer passed to deallocate twice.
Fix (debug mode): Maintain allocated set or use sentinel values:
void deallocate(void* ptr) {
#ifdef DEBUG
// Check first 8 bytes aren't already our free-list marker
if (*reinterpret_cast<uint64_t*>(ptr) == FREE_MARKER) {
abort(); // Double free!
}
*reinterpret_cast<uint64_t*>(ptr) = FREE_MARKER;
#endif
// ... normal free logic
}
7.5 Pitfall: False Sharing in Thread-Local Pools
Symptom: Thread-local pools slower than expected, high cache miss rates.
Cause: Pool metadata from different threads on same cache line.
Fix: Pad thread-local pool structures to cache line boundaries:
struct alignas(64) ThreadLocalPool {
PoolAllocator<64, 1000> pool;
char padding[64 - sizeof(PoolAllocator<64, 1000>) % 64];
};
Verification: Use perf stat -e cache-misses to compare before/after.
7.6 Pitfall: Returning Dangling Pointer After Reset
Symptom: Arena reset invalidates pointers, but caller still uses them.
Cause: Arena allocator semantics not understood by caller.
Fix: Document clearly that reset() invalidates ALL previous allocations. Consider:
// Return a scope guard that resets on destruction
auto scope = arena.scope_guard(); // Resets arena when scope ends
7.7 Debug Checklist
When something goes wrong:
- Enable AddressSanitizer:
-fsanitize=address - Add debug logging to allocate/deallocate
- Verify free list integrity after each operation
- Check pointer is within pool range before free
- Use Valgrind for detailed memory analysis
- Add guard bytes before/after blocks
- Print allocation/free traces with stack traces
8. Extensions & Challenges
8.1 Extension: Lock-Free Multi-Producer Pool
Implement a pool that allows multiple threads to allocate concurrently using lock-free techniques:
- Use atomic compare-and-swap for free list head
- Handle ABA problem with tagged pointers or hazard pointers
- Benchmark against mutex-based version
Challenge: The ABA problem occurs when Thread A sees value A, Thread B changes A to B and back to A, then Thread A’s CAS succeeds incorrectly.
8.2 Extension: Growing Pool
Implement a pool that can grow when exhausted:
- Allocate additional chunks on demand
- Maintain linked list of chunks
- Handle deallocation across chunks
- Consider fragmentation implications
8.3 Extension: Memory-Mapped File Pool
Implement a pool backed by a memory-mapped file:
- Persist allocations across process restarts
- Handle crash recovery
- Useful for maintaining order book state
8.4 Extension: NUMA-Aware Allocation
For multi-socket servers:
- Detect NUMA topology
- Allocate pools on local NUMA node
- Bind threads to cores near their memory
8.5 Extension: Defragmentation
For long-running systems:
- Track allocation patterns
- Periodically compact pools
- Handle live object relocation (requires cooperation from caller)
8.6 Challenge: Rust GlobalAlloc Implementation
Implement your allocator as a Rust GlobalAlloc:
#[global_allocator]
static ALLOC: PoolGlobalAlloc = PoolGlobalAlloc::new();
unsafe impl GlobalAlloc for PoolGlobalAlloc {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ... }
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { ... }
}
Challenge: GlobalAlloc must be thread-safe and handle any size/alignment.
9. Real-World Connections
9.1 Industry Usage
| Company/Project | Allocator Approach |
|---|---|
| Jane Street | Custom OCaml allocators, arena patterns |
| HRT | Custom C++ allocators, huge pages |
| Tower Research | Per-thread pools, zero-copy parsing |
| Game Engines | Frame allocators (arena per frame) |
| Databases | Buffer pool managers, slab allocators |
9.2 Production Allocators to Study
| Allocator | Key Innovation |
|---|---|
| jemalloc | Thread caching, size classes, arenas |
| tcmalloc | Per-thread caches, central free list |
| mimalloc | Free list sharding, segment-based |
| Hoard | Superblock-based, scalable |
9.3 Your Allocator vs Production
Your allocator is simpler but faster for your use case:
- Production allocators handle ANY size; yours handles fixed sizes
- Production allocators are general-purpose; yours is application-specific
- Production allocators trade some speed for flexibility
This is the HFT philosophy: specialize for speed.
9.4 Integration with Your Other Projects
| Project | How Allocator Helps |
|---|---|
| P01: Order Book | Pool allocator for Order and PriceLevel objects |
| P02: Lock-Free Queue | Arena for parsing buffers, pool for queue nodes |
| P03: Matching Engine | Thread-local pools for per-connection state |
| P06: Full System | All of the above, plus statistics for monitoring |
10. Resources
10.1 Primary References
| Resource | Use For | Specific Sections |
|---|---|---|
| “Computer Systems: A Programmer’s Perspective” | Memory fundamentals | Ch. 9 (Virtual Memory), Ch. 6 (Cache) |
| “The Linux Programming Interface” | mmap, system calls | Ch. 7, Ch. 49 |
| “Fluent C” by Preschern | Pool allocator pattern | Ch. 6 (Memory Pool) |
| “Building Low Latency Applications with C++” | HFT context | Ch. on memory management |
10.2 Articles and Talks
| Resource | Description |
|---|---|
| Andrei Alexandrescu: std::allocator Is to Allocation What std::vector Is to Vexation | Why std::allocator is limited |
| CppCon: Memory Allocation Strategies | Various CppCon talks |
| What Every Programmer Should Know About Memory | Ulrich Drepper’s classic paper |
10.3 Code References
| Repository | What to Learn |
|---|---|
| jemalloc source | Size classes, arenas, thread caching |
| tcmalloc source | Per-thread caches, central free list |
| mimalloc source | Segment-based, free list sharding |
10.4 Tools
| Tool | Purpose |
|---|---|
perf |
Performance profiling, cache analysis |
valgrind |
Memory error detection |
heaptrack |
Heap profiling |
massif |
Heap visualization |
| AddressSanitizer | Runtime memory error detection |
11. Self-Assessment Checklist
You have mastered this project when you can:
Conceptual Understanding:
- Explain why malloc is unsuitable for HFT hot paths (latency, determinism)
- Describe the trade-offs between pool, arena, and multi-pool allocators
- Explain false sharing and how to prevent it
- Describe the ABA problem in lock-free free lists
Implementation Skills:
- Implement a pool allocator from scratch without reference
- Implement an arena allocator with proper alignment
- Create thread-local pools without lock contention
- Integrate your allocator with std::vector using the Allocator concept
Performance Verification:
- Benchmark your allocator vs malloc, demonstrating 10x+ improvement
- Show deterministic p99 latency (< 2x p50)
- Verify zero system calls after initialization using perf
Debugging Ability:
- Add debug checks that detect double-free and use-after-free
- Use AddressSanitizer to catch memory errors
- Diagnose false sharing using cache miss counters
Real-World Application:
- Integrate your allocator into the order book from Project 1
- Measure end-to-end latency improvement in a realistic scenario
12. Submission / Completion Criteria
Your project is complete when you have:
12.1 Code Deliverables
pool_allocator.hpp: Fixed-size pool with O(1) alloc/freearena_allocator.hpp: Bump allocator with resetmulti_pool.hpp: Size-class routing to poolsthread_local_pool.hpp: Per-thread pool wrapperstl_adapter.hpp: C++ Allocator concept implementation- Comprehensive unit tests (>90% coverage)
- Benchmark suite comparing to malloc
12.2 Documentation
- README with build instructions
- API documentation for each allocator type
- Benchmark results with analysis
12.3 Performance Criteria
- Pool allocator: <15ns average allocation
- Pool allocator: p99 < 2x p50 (deterministic)
- Zero system calls after initialization (verified with perf)
- Multi-threaded benchmark shows zero lock contention
12.4 Integration
- Demonstrated integration with order book or matching engine
- Measured latency improvement in realistic scenario
Summary
This project teaches you one of the most important skills in HFT development: eliminating malloc from your hot path. The difference between a 250ns malloc and an 8ns pool allocation may seem small, but multiplied across thousands of operations per second, it becomes the difference between profit and loss.
Key takeaways:
- General-purpose allocators sacrifice speed for flexibility - you can do the opposite
- Pool allocators achieve O(1) through pre-allocation and free lists - no searching, no fragmentation
- Arena allocators are even faster - just bump a pointer, reset all at once
- Thread-local pools eliminate contention - essential for multi-threaded HFT
- Determinism matters as much as speed - p99 latency kills trading strategies
When you complete this project, you will have built infrastructure that is foundational to every serious HFT system. More importantly, you will understand why these techniques work and when to apply them.
“In HFT, memory allocation is not a solved problem - it’s a problem you solve for your specific use case. The best allocator is the one that does exactly what you need and nothing more.”