Project 3: Custom Arena Allocator (Memory Locality)
Project 3: Custom Arena Allocator (Memory Locality)
From the Advanced Rust Ecosystem Deep Dive
- Main Programming Language: Rust
- Coolness Level: Level 3: Genuinely Clever
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Management / Unsafe
- Time Estimate: 1 week
What You Will Build
A high-performance โBump Allocatorโ (Arena). You will pre-allocate a large chunk of memory and provide an alloc<T> method that returns a reference to a part of that memory, without calling the global allocator for every object.
Learning Objectives
By the end of this project, you will:
- Understand OS-level memory allocation - How
brk,sbrk, andmmapwork at the kernel level - Master memory alignment - Why CPUs require aligned access and how to calculate proper alignment
- Implement unsafe Rust safely - Use raw pointers while maintaining Rustโs safety guarantees through invariants
- Understand allocator tradeoffs - Why general-purpose allocators are slow and when to use specialized ones
- Implement the bump allocation pattern - The simplest and fastest allocation strategy
- Handle lifetimes with raw pointers - Ensure arena-allocated references donโt outlive the arena
- Build production-quality unsafe code - Use
Layout,align_offset, and proper drop semantics - Benchmark allocation performance - Measure and compare against
Box::new() - Connect theory to practice - Understand how
bumpalo,rustc, and game engines use arenas - Debug memory issues - Identify and fix common arena allocator bugs
Deep Theoretical Foundation
Before writing any code, you must understand the layers of abstraction between your program and physical RAM. This section will take you from the kernel level up to Rustโs allocation APIs.
Part 1: Memory Allocation at the Operating System Level
When your Rust program calls Box::new(42), what actually happens? Letโs trace the path from your code down to the kernel.
YOUR CODE: Box::new(42)
|
v
RUST GLOBAL ALLOCATOR (jemalloc/System)
|
v
LIBC (malloc/free)
|
v
KERNEL SYSCALLS (brk/mmap)
|
v
PHYSICAL MEMORY (RAM)
The brk and sbrk System Calls
In traditional Unix systems, a process has a โdata segmentโ that grows upward from the programโs initial data. The โbreakโ is the address where this segment ends.
PROCESS MEMORY LAYOUT (Traditional)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ STACK โ
โ โ โ
โ grows down โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ (free space) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ HEAP โ
โ โ โ
โ grows up โ
โ โ
โ โ โ โ โ โ โ โ โ โ โ "break" โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ DATA SEGMENT โ
โ (initialized global vars) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ TEXT SEGMENT โ
โ (your program code) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Address 0x0 Higher Addresses
Book Reference: โThe Linux Programming Interfaceโ by Michael Kerrisk, Chapter 7: Memory Allocation, provides an in-depth explanation of brk/sbrk and the process memory layout.
The sbrk(n) system call moves the โbreakโ by n bytes, effectively allocating that memory:
// Simplified example of how sbrk works (C code)
void* my_malloc(size_t size) {
void* ptr = sbrk(0); // Get current break
if (sbrk(size) == -1) { // Move break up by 'size' bytes
return NULL; // Allocation failed
}
return ptr; // Return pointer to new memory
}
However, brk/sbrk have a critical limitation: you can only free memory at the โtopโ of the heap. If you allocate A, B, C, and then free B, you cannot reclaim that memory until A is also freed.
The mmap System Call
Modern allocators prefer mmap for large allocations. Unlike brk, mmap can allocate memory anywhere in the virtual address space and release it independently.
MODERN PROCESS MEMORY LAYOUT (with mmap)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STACK โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ MMAP REGION (libraries, large allocations) โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ libc.so โ โ your alloc โ โ another.so โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ HEAP (brk area) โ
โ (small allocations) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ DATA + BSS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ TEXT SEGMENT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Insight: mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) asks the kernel for size bytes of fresh memory, not backed by any file. This is what Rustโs std::alloc::alloc() ultimately calls for large allocations.
Part 2: General-Purpose Allocators and Why Theyโre Slow
When you call Box::new() in Rust, youโre using a general-purpose allocator. Rust has historically used jemalloc (a high-performance allocator from FreeBSD) and now defaults to the system allocator. Letโs understand why these are โslowโ for certain workloads.
The Three Costs of General-Purpose Allocation
1. Fragmentation Management
EXTERNAL FRAGMENTATION EXAMPLE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Initial State:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FREE (1000 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
After allocating A(100), B(200), C(100):
โโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ A:100 โ B:200 โ C:100 โ FREE (600 bytes) โ
โโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
After freeing B:
โโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ A:100 โ FREE:200 โ C:100 โ FREE (600 bytes) โ
โโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Now you want to allocate D(500):
ERROR! You have 800 bytes free, but the largest contiguous block is only 600!
This is EXTERNAL FRAGMENTATION.
General-purpose allocators spend significant CPU cycles managing free lists, coalescing adjacent free blocks, and searching for the best-fit hole.
2. Metadata Overhead
Every allocation needs bookkeeping:
TYPICAL ALLOCATION BLOCK (simplified)
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HEADER (8-16 B) โ USER DATA โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ - size โ โ
โ - prev/next ptr โ This is what you get back from malloc โ
โ - flags โ โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
^
Hidden from user but consumes memory
If youโre allocating millions of small objects (say, 16 bytes each), this 8-16 byte overhead can DOUBLE your memory usage!
3. Thread-Safety Locking
In a multithreaded program, allocators must synchronize access:
THREAD 1 THREAD 2
โ โ
โผ โผ
malloc(100) malloc(200)
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LOCK (global heap lock) โ โ CONTENTION!
โ Only one thread at a time โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Modern allocators like jemalloc use thread-local caches to reduce lock contention, but the overhead is still non-zero.
Book Reference: โComputer Systems: A Programmerโs Perspectiveโ by Bryant and OโHallaron, Chapter 9.9: Dynamic Memory Allocation, covers these concepts with detailed examples.
Part 3: Arena (Bump) Allocators - The Simplest Pattern
An arena allocator is radically simple: allocate by โbumpingโ a pointer forward. Never free individual objects. Free everything at once when the arena is dropped.
ARENA (BUMP) ALLOCATOR - THE CORE IDEA
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Initial State:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ARENA MEMORY โ
โ โ
โ base โโโบโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ FREE SPACE โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโ end โ
โ โ
โ ptr (bump pointer) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบโ โ
โ (starts at base) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
After alloc(A, 32 bytes):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ base โโโบโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ A โ โ โ
โ โ 32 bytes โ FREE SPACE โ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โโโ end โ
โ โ โ
โ ptr โโโโโโโโโโโโโโโโโโ โ
โ (moved forward 32 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
After alloc(B, 16 bytes):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ base โโโบโโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ A โ B โ โ โ
โ โ 32 bytes โ 16 B โ FREE SPACE โ โ
โ โ โ โ โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โโโ end โ
โ โ โ
โ ptr โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ (moved forward another 16 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
After alloc(C, 8 bytes):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ base โโโบโโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ A โ B โ C โ โ โ
โ โ 32 bytes โ 16 B โ 8 B โ FREE SPACE โ โ
โ โ โ โ โ โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โโโ end โ
โ โ โ
โ ptr โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ (moved forward another 8 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
FREE OPERATION (individual):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ THERE IS NO FREE. You cannot deallocate individual objects.โ
โ This is the trade-off for speed. โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RESET OPERATION (everything at once):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ base โโโบโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ ALL MEMORY NOW FREE (Objects A, B, C gone) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โโโ end โ
โ โ โ
โ ptr โโโโโ โ
โ (reset back to base) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why Arena Allocation is Fast
| Operation | General-Purpose | Arena |
|---|---|---|
| Allocation | Search free list, split block, update metadata | Bump pointer, done |
| Time Complexity | O(n) or O(log n) | O(1) - constant time |
| Locking (multi-threaded) | Global or per-arena lock | Per-arena lock (smaller contention) |
| Metadata per object | 8-16 bytes | 0 bytes |
| Deallocation | Update free list, possibly coalesce | No-op (or reset all) |
Part 4: Memory Alignment - Why CPUs Need It
Modern CPUs donโt read memory one byte at a time. They read in โwordsโ (typically 4 or 8 bytes on modern systems). When data is โaligned,โ it starts at an address thatโs a multiple of its size.
MEMORY ALIGNMENT VISUALIZATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Physical memory addresses:
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ 0 โ 1 โ 2 โ 3 โ 4 โ 5 โ 6 โ 7 โ 8 โ 9 โ 10 โ 11 โ 12 โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
u32 (4 bytes) ALIGNED at address 4:
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโชโโโโโชโโโโโชโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ โ โ โ โ u32 value here โ โ โ โ โ โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
^
Address 4 is divisible by 4 โ ALIGNED
u32 (4 bytes) MISALIGNED at address 3:
โโโโโโฌโโโโโฌโโโโโชโโโโโชโโโโโชโโโโโชโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ โ โ โ u32 spans two cache lines โ โ โ โ โ
โโโโโโดโโโโโดโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
^
Address 3 is NOT divisible by 4 โ MISALIGNED
The Cost of Misalignment
On x86/x64: Misaligned access is allowed but SLOW:
- Aligned access: 1 memory operation
- Misaligned access: 2 memory operations + CPU overhead to combine them
On ARM (and others): Misaligned access may cause:
- A hardware trap (exception)
- The kernel emulating the access (VERY slow)
- On some CPUs, a crash
ALIGNED u64 READ (address 8):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CPU Cache Line Boundary
โ
โโโโโโโโโโโดโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ Line 0 โ Line 1 โ
โ [0-7] โ [8-15] โ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโ
โ u64 @ addr 8 โ
โ (8 bytes) โ
โโโโโโโโโโโโโโโโโโโโโ
CPU reads Line 1 โ Gets all 8 bytes in ONE operation โ
MISALIGNED u64 READ (address 5):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CPU Cache Line Boundary
โ
โโโโโโโโโโโดโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ Line 0 โ Line 1 โ
โ [0-7] โ [8-15] โ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ u64 @ addr 5 โ
โ (8 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
3 bytes 5 bytes
from from
Line 0 Line 1
CPU reads Line 0 AND Line 1 โ TWO operations + combine โ
Book Reference: โComputer Systems: A Programmerโs Perspectiveโ Chapter 3.9: Heterogeneous Data Structures covers alignment in detail.
Part 5: Rustโs Layout and Alignment APIs
Rust provides the std::alloc::Layout type to handle alignment correctly:
use std::alloc::Layout;
// Create a layout for a single i32
let layout = Layout::new::<i32>();
assert_eq!(layout.size(), 4); // 4 bytes
assert_eq!(layout.align(), 4); // 4-byte alignment
// Create a layout for a single u8
let layout = Layout::new::<u8>();
assert_eq!(layout.size(), 1); // 1 byte
assert_eq!(layout.align(), 1); // 1-byte alignment
// Create a layout for a struct
#[repr(C)]
struct Example {
a: u8, // 1 byte, 1-align
b: u32, // 4 bytes, 4-align
c: u16, // 2 bytes, 2-align
}
let layout = Layout::new::<Example>();
// Size is 12 (with padding), align is 4 (largest field alignment)
size_of vs align_of
size_of<T>: How many bytes the value OCCUPIES in memory
align_of<T>: What address boundary the value MUST start on
EXAMPLE: A struct with mixed types
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#[repr(C)] // Use C layout (predictable)
struct Mixed {
a: u8, // size=1, align=1
b: u32, // size=4, align=4
c: u8, // size=1, align=1
}
Memory layout (repr(C)):
โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโ
โ 0 โ 1 โ 2 โ 3 โ 4 โ 5 โ 6 โ 7 โ 8 โ 9 โ 10 โ 11 โ
โโโโโโโโผโโโโโโโดโโโโโโโดโโโโโโโผโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโผโโโโโโโผโโโโโโโดโโโโโโโดโโโโโโโค
โ a โ PADDING โ b โ c โ PADDING โ
โ u8 โ (3 bytes) โ u32 โ u8 โ (3 bytes) โ
โโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
size_of::<Mixed>() = 12 bytes (includes padding)
align_of::<Mixed>() = 4 bytes (largest member alignment)
WITHOUT repr(C), Rust may reorder fields:
โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโ
โ 0 โ 1 โ 2 โ 3 โ 4 โ 5 โ 6 โ 7 โ
โโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโผโโโโโโโผโโโโโโโผโโโโโโโดโโโโโโโค
โ b โ a โ c โ PADDING โ
โ u32 โ u8 โ u8 โ (2 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโโโโโโโโ
size_of::<Mixed>() = 8 bytes (less padding!)
align_of::<Mixed>() = 4 bytes (same)
Part 6: The Alignment Calculation for Bump Allocation
When bumping a pointer, you must ensure the result is properly aligned for the type youโre allocating. Hereโs the formula:
ALIGNMENT FORMULA
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Given:
- current_ptr: The current position in the arena
- align: The required alignment for the type
To find the next aligned address:
aligned_ptr = (current_ptr + align - 1) & !(align - 1)
EXAMPLE: align a pointer to 8-byte boundary
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
current_ptr = 13
align = 8
Step 1: current_ptr + align - 1 = 13 + 8 - 1 = 20
Step 2: align - 1 = 7 = 0b00000111 (binary)
Step 3: !(align - 1) = !7 = 0b11111000 (inverts bits)
Step 4: 20 & 0b11111000 = 16
Result: 16 is the next address divisible by 8 โ
VISUAL:
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ 0 โ 1 โ 2 โ 3 โ 4 โ 5 โ 6 โ 7 โ 8 โ 9 โ 10 โ 11 โ 12 โ 13 โ 14 โ 15 โ 16 โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
โ โ โ
8-aligned current=13 next aligned=16
(skip 3 bytes of padding)
Rust provides pointer::align_offset() to do this calculation safely:
let ptr: *mut u8 = /* ... */;
let align = std::mem::align_of::<u64>(); // 8
let offset = ptr.align_offset(align); // How many bytes to skip
if offset == usize::MAX {
// Alignment is impossible (rare, usually for weird types)
panic!("Cannot align");
}
let aligned_ptr = ptr.add(offset);
Part 7: Arena vs Pool vs Slab Allocators
Different allocation strategies are optimized for different use cases:
ALLOCATOR COMPARISON
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ARENA (BUMP) ALLOCATOR โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Pattern: Objects allocated together, freed together โ
โ โ
โ โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ A โ B โ C โ D โ E โ F โ FREE SPACE โ โ
โ โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ bump pointer โ
โ โ
โ + Fastest allocation (O(1)) โ
โ + Zero metadata overhead โ
โ + Cache-friendly (contiguous) โ
โ - Cannot free individual objects โ
โ - Wastes memory if objects have varied lifetimes โ
โ โ
โ Use cases: Compilers (AST nodes), game frame allocations, โ
โ request-scoped web servers โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ POOL ALLOCATOR โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Pattern: Fixed-size objects, frequent alloc/free โ
โ โ
โ โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ โ
โ โUSEDโFREEโUSEDโUSEDโFREEโFREEโUSEDโFREEโUSEDโFREEโ โ
โ โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ โ
โ โ โ โ โ โ โ
โ โโโโโโโโโโโโโโโโดโโโโโดโโโโโโโโโโดโโโโโโโโโโ โ
โ Free list (linked) โ
โ โ
โ + Fast allocation (O(1)) โ
โ + CAN free individual objects โ
โ + No fragmentation (fixed size) โ
โ - Only works for one size โ
โ - Slight overhead for free list pointers โ
โ โ
โ Use cases: ECS entity storage, particle systems, โ
โ network connection handles โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SLAB ALLOCATOR โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Pattern: Objects of same TYPE, kernel-style โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ SLAB CACHE โ โ
โ โ (For struct: task_struct) โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ SLAB (full) โ SLAB (partial) โ SLAB (empty) โ โ โ
โ โ โ โโโโโโโโโโโโ โ โโโโโโโโโโโโ โ โโโโโโโโโโโโ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ + Objects are pre-initialized (construction cost amortized) โ
โ + Excellent cache locality โ
โ + Color optimization to reduce cache conflicts โ
โ - Complex implementation โ
โ - Type-specific caches โ
โ โ
โ Use cases: Linux kernel object caches, high-frequency โ
โ object creation (network packets, inodes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
WHEN TO USE EACH:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Can objects be freed โ
โ individually? โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
โผ โผ
NO YES
โ โ
โผ โ
โโโโโโโโโโโโโโโโ โ
โ ARENA โ โ
โ (Bump) โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
โ Are all objects the โ
โ same size? โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
โผ โผ
YES NO
โ โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ โโโโโโโโโโโโดโโโโโโโโโโโ
โ Do objects need โ โ GENERAL PURPOSE โ
โ pre-initialization? โ โ (malloc/jemalloc) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโดโโโโโโโโโโโ
โผ โผ
YES NO
โ โ
โโโโโโโโดโโโโโโโ โโโโโโโโดโโโโโโโ
โ SLAB โ โ POOL โ
โ (with ctor) โ โ (free list) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Part 8: The Allocator Trait (Unstable)
Rustโs Allocator trait (currently unstable, available with #![feature(allocator_api)]) defines the interface for custom allocators:
#![feature(allocator_api)]
use std::alloc::{Allocator, AllocError, Layout};
use core::ptr::NonNull;
pub trait Allocator {
fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError>;
unsafe fn deallocate(&self, ptr: NonNull<u8>, layout: Layout);
// Optional methods with default implementations:
fn allocate_zeroed(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> { ... }
unsafe fn grow(...) -> Result<NonNull<[u8]>, AllocError> { ... }
unsafe fn shrink(...) -> Result<NonNull<[u8]>, AllocError> { ... }
}
This trait allows you to use your arena with standard collections:
#![feature(allocator_api)]
use std::vec::Vec;
let arena = MyArena::new(1024);
// Create a Vec that uses your arena!
let mut vec: Vec<i32, &MyArena> = Vec::new_in(&arena);
vec.push(1);
vec.push(2);
vec.push(3);
// All allocations come from the arena, not the global allocator
Part 9: Deallocation Strategies in Arenas
Arenas typically use one of these deallocation approaches:
DEALLOCATION STRATEGIES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. NO DEALLOCATION (Most Common)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Objects are never freed individually.
The entire arena is freed when dropped.
alloc() โ bump pointer โ return reference
drop(arena) โ dealloc the entire backing memory
Pros: Simplest, fastest
Cons: Can't reclaim memory until arena dies
2. RESET (Reuse Arena)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Reset the bump pointer to start, reuse all memory.
Before reset:
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ A โ B โ C โ D โ (free) โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโโโโโโโโโโโโโโโโโ
โ
bump ptr
After reset():
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (all free again) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
bump ptr (back to start)
WARNING: All previous references are now INVALID!
Must use unsafe or ensure all refs are dropped.
3. LIFO DEALLOCATION (Stack-like)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Can deallocate the LAST allocation only.
alloc(A) โ alloc(B) โ alloc(C)
โโโโโโฌโโโโโฌโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ A โ B โ C โ (free) โ
โโโโโโดโโโโโดโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
bump ptr
dealloc(C):
โโโโโโฌโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ A โ B โ (free) โ
โโโโโโดโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
bump ptr (moved back)
dealloc(B):
โโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ A โ (free) โ
โโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
bump ptr
dealloc(A): ERROR! Cannot dealloc A before B is deallocated.
4. GENERATION-BASED (Scope Markers)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Save the current position, allocate, then restore.
let marker = arena.save(); // marker = current bump ptr position
arena.alloc(A);
arena.alloc(B);
arena.alloc(C);
arena.restore(marker); // bump ptr = marker, A/B/C are "freed"
Useful for: Scratch allocations, temporary work buffers
Real World Outcome
You will have an Arena struct that can allocate objects 10-50x faster than Box::new(). You will use this to build a simple linked list that is cache-friendly because all nodes are contiguous in memory.
Example Benchmark:
$ cargo bench
Standard Box: 45.2 ns/alloc
Arena Alloc: 1.8 ns/alloc (25x SPEED UP!)
Full Example Session:
$ cargo new --lib arena-allocator
Created library `arena-allocator` package
$ cd arena-allocator
$ cargo add criterion --dev --features criterion/html_reports
Updating crates.io index
Adding criterion v0.5.1 to dev-dependencies
$ cargo build
Compiling arena-allocator v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.89s
$ cargo test
Compiling arena-allocator v0.1.0
Finished test [unoptimized + debuginfo] target(s) in 0.42s
Running unittests src/lib.rs
running 8 tests
test tests::test_basic_allocation ... ok
test tests::test_alignment_u8 ... ok
test tests::test_alignment_u32 ... ok
test tests::test_alignment_u64 ... ok
test tests::test_mixed_types ... ok
test tests::test_capacity_exhaustion ... ok
test tests::test_drop_behavior ... ok
test tests::test_lifetime_bounds ... ok
test result: ok. 8 passed; 0 failed; 0 ignored
$ cargo bench --bench allocation_comparison
Compiling arena-allocator v0.1.0
Finished bench [optimized] target(s) in 3.21s
Running benches/allocation_comparison.rs
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Allocation Performance Comparison โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ Testing: 10,000 allocations of various types โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Benchmarking Box::new(i32)
Warming up for 3.0000 s
Collecting 100 samples in estimated 5.2340 s
box_i32/10k_allocs time: [450.12 us 452.34 us 454.89 us]
thrpt: [21.98M allocs/s 22.11M allocs/s 22.22M allocs/s]
Memory Analysis:
Total allocations: 10,000
Bytes allocated: 160,000 (16 bytes per Box: 4 data + 8-16 metadata)
Allocation rate: 352.7 MB/s
Benchmarking Arena::alloc::<i32>()
Warming up for 3.0000 s
Collecting 100 samples in estimated 5.0123 s
arena_i32/10k_allocs time: [17.82 us 17.95 us 18.12 us]
thrpt: [551.9M allocs/s 557.1M allocs/s 561.4M allocs/s]
Memory Analysis:
Total allocations: 1 (initial arena only)
Bytes allocated: 40,000 (4 bytes per i32, no metadata)
Allocation rate: 0 MB/s (no per-object system calls!)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PERFORMANCE SUMMARY:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Metric โ Box::new โ Arena::allocโ Speedup โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ Time per 10k allocs โ 452 us โ 18 us โ 25.1x โ
โ Throughput โ 22.1M ops/s โ 557.1M ops/sโ 25.2x โ
โ System allocations โ 10,000 โ 1 โ 10000x โ
โ Memory overhead โ 60% โ 0% โ Better โ
โ Cache behavior โ Poor โ Excellent โ Better โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
$ cargo run --example linked_list
Compiling arena-allocator v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.42s
Running `target/debug/examples/linked_list`
=== Arena-Allocated Linked List Demo ===
[1] Creating arena with 4KB capacity...
Arena @ 0x7f8b4c000000
Capacity: 4096 bytes
[2] Allocating 100 linked list nodes...
Node 0 @ 0x7f8b4c000000
Node 1 @ 0x7f8b4c000018
Node 2 @ 0x7f8b4c000030
Node 3 @ 0x7f8b4c000048
...
Node 99 @ 0x7f8b4c000bb8
[3] Memory locality analysis:
All 100 nodes are CONTIGUOUS in memory!
Span: 3000 bytes (100 nodes * 24 bytes + alignment)
Cache lines touched: 47 (vs ~100 for Box-allocated)
[4] Traversing list...
Sum of values: 4950
Traversal time: 0.8 us
[5] Comparison with Box-allocated list:
Box traversal time: 2.1 us (2.6x slower due to cache misses)
[Summary]
All nodes allocated with ZERO system calls
Memory layout is cache-optimal
Arena drops all memory at once when it goes out of scope
The Core Question Youโre Answering
โCan I allocate memory faster than the Operating System?โ
Yes, you can. By asking the OS for one giant block and managing the โbumpingโ of a pointer yourself, you bypass the complex bookkeeping and locking of general-purpose allocators like jemalloc.
Concepts You Must Understand First
Stop and research these before coding:
- Memory Alignment
- Why canโt you just put a
u32at any address? (Hint: CPU bus errors). - What happens on ARM vs x86 for misaligned access?
- Book Reference: โComputer Systems: A Programmerโs Perspectiveโ Ch. 3.9
- Why canโt you just put a
- Raw Pointers and
Layout- How does
std::alloc::Layoutcalculate size and alignment? - Whatโs the difference between
*const Tand*mut T? - Book Reference: โThe Rust Programming Languageโ Ch. 19
- How does
- PhantomData & Lifetimes
- How do you ensure the references returned by the arena donโt outlive the arena itself?
- Why is
PhantomData<&'a T>sometimes needed? - Book Reference: โRust for Rustaceansโ Ch. 3
- Unsafe Rust Invariants
- What are the rules for
unsafeblocks? - When can you dereference a raw pointer?
- Book Reference: โThe Rustonomiconโ (online book)
- What are the rules for
Solution Architecture
Your arena allocator should have this basic structure:
ARENA STRUCT ARCHITECTURE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pub struct Arena {
/// Raw pointer to the start of the allocated memory block
base: *mut u8,
/// Current position in the arena (where next allocation starts)
ptr: Cell<*mut u8>,
/// Pointer to one past the end of the arena
end: *mut u8,
/// Total capacity in bytes
capacity: usize,
}
MEMORY LAYOUT:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ARENA MEMORY โ
โ โ
โ base โโโโโบโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ USED โ USED โ USED โ FREE โ โ
โ โ (A) โ (B) โ (C) โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โฒ โ
โ โ โ โ
โ ptr (current) end โ
โ โ
โ Invariants: โ
โ - base <= ptr <= end โ
โ - capacity = end - base โ
โ - used = ptr - base โ
โ - remaining = end - ptr โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ALLOC METHOD SIGNATURE:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pub fn alloc<T>(&self, value: T) -> &mut T
- Takes ownership of `value`
- Returns `&mut T` with lifetime tied to `&self`
- May panic if arena is out of space
ALIGNMENT CALCULATION:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
fn alloc_impl<T>(&self, value: T) -> &mut T {
let layout = Layout::new::<T>();
// 1. Get current pointer position
let ptr = self.ptr.get();
// 2. Calculate aligned position
let aligned = align_up(ptr as usize, layout.align()) as *mut u8;
// 3. Calculate new pointer position after allocation
let new_ptr = aligned.add(layout.size());
// 4. Check if we have enough space
if new_ptr > self.end {
panic!("Arena out of memory");
}
// 5. Update the bump pointer
self.ptr.set(new_ptr);
// 6. Write the value and return a reference
let slot = aligned as *mut T;
ptr::write(slot, value);
&mut *slot
}
DROP IMPLEMENTATION:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
impl Drop for Arena {
fn drop(&mut self) {
// Deallocate the entire backing memory
unsafe {
let layout = Layout::from_size_align_unchecked(
self.capacity,
1 // u8 alignment
);
std::alloc::dealloc(self.base, layout);
}
}
}
WARNING: This simple arena does NOT call Drop on allocated objects!
The objects are just "forgotten" when the arena is dropped.
For types with Drop, you need TypedArena or manual drop tracking.
Phased Implementation Guide
Phase 1: Request Memory from System Allocator
Goal: Allocate a large chunk of memory to use as your arena.
use std::alloc::{alloc, Layout};
use std::cell::Cell;
pub struct Arena {
base: *mut u8,
ptr: Cell<*mut u8>,
end: *mut u8,
capacity: usize,
}
impl Arena {
pub fn new(capacity: usize) -> Self {
let layout = Layout::from_size_align(capacity, 1)
.expect("Invalid layout");
let base = unsafe { alloc(layout) };
if base.is_null() {
panic!("Failed to allocate arena memory");
}
let end = unsafe { base.add(capacity) };
Arena {
base,
ptr: Cell::new(base),
end,
capacity,
}
}
}
Checkpoint: Verify you can create an Arena and it doesnโt crash.
Phase 2: Implement Basic Bump Allocation (No Alignment)
Goal: Allocate bytes without worrying about alignment.
impl Arena {
/// Allocate raw bytes (no alignment)
pub fn alloc_bytes(&self, size: usize) -> *mut u8 {
let ptr = self.ptr.get();
let new_ptr = unsafe { ptr.add(size) };
if new_ptr > self.end {
panic!("Arena out of memory");
}
self.ptr.set(new_ptr);
ptr
}
}
Checkpoint: Allocate some bytes, verify the pointer advances.
Phase 3: Add Proper Alignment Handling
Goal: Ensure allocations are properly aligned for their type.
/// Round `addr` up to the nearest multiple of `align`
fn align_up(addr: usize, align: usize) -> usize {
(addr + align - 1) & !(align - 1)
}
impl Arena {
pub fn alloc<T>(&self, value: T) -> &mut T {
let layout = Layout::new::<T>();
let ptr = self.ptr.get();
// Calculate aligned address
let aligned = align_up(ptr as usize, layout.align());
let aligned_ptr = aligned as *mut u8;
// Calculate end of allocation
let new_ptr = unsafe { aligned_ptr.add(layout.size()) };
if new_ptr > self.end {
panic!("Arena out of memory: needed {} bytes, have {}",
layout.size(),
self.end as usize - ptr as usize);
}
self.ptr.set(new_ptr);
// Write the value
let slot = aligned_ptr as *mut T;
unsafe {
std::ptr::write(slot, value);
&mut *slot
}
}
}
Checkpoint: Allocate different types (u8, u32, u64), verify alignment.
Phase 4: Add Lifetime Safety
Goal: Ensure references cannot outlive the arena.
use std::marker::PhantomData;
pub struct Arena<'a> {
base: *mut u8,
ptr: Cell<*mut u8>,
end: *mut u8,
capacity: usize,
_marker: PhantomData<&'a ()>,
}
impl<'a> Arena<'a> {
pub fn alloc<T>(&'a self, value: T) -> &'a mut T {
// ... same implementation ...
// The 'a lifetime ensures the reference can't outlive the arena
}
}
Checkpoint: Verify the compiler rejects code that would create dangling references.
Phase 5: Implement Drop and Add Benchmarks
Goal: Clean up memory properly and prove the performance benefit.
impl Drop for Arena<'_> {
fn drop(&mut self) {
unsafe {
let layout = Layout::from_size_align_unchecked(self.capacity, 1);
std::alloc::dealloc(self.base, layout);
}
}
}
// Create benchmark comparing Arena::alloc vs Box::new
#[cfg(test)]
mod benchmarks {
use super::*;
use std::time::Instant;
#[test]
fn benchmark_comparison() {
const N: usize = 100_000;
// Benchmark Box::new
let start = Instant::now();
let boxes: Vec<_> = (0..N).map(|i| Box::new(i as u64)).collect();
let box_time = start.elapsed();
drop(boxes);
// Benchmark Arena
let arena = Arena::new(N * 16); // 16 bytes per u64 with alignment
let start = Instant::now();
let refs: Vec<_> = (0..N).map(|i| arena.alloc(i as u64)).collect();
let arena_time = start.elapsed();
drop(refs);
println!("Box::new: {:?} ({:.2} ns/alloc)",
box_time, box_time.as_nanos() as f64 / N as f64);
println!("Arena::alloc: {:?} ({:.2} ns/alloc)",
arena_time, arena_time.as_nanos() as f64 / N as f64);
println!("Speedup: {:.1}x",
box_time.as_nanos() as f64 / arena_time.as_nanos() as f64);
}
}
Testing Strategy
1. Alignment Verification Tests
#[test]
fn test_alignment_u8() {
let arena = Arena::new(1024);
let val: &mut u8 = arena.alloc(42u8);
assert_eq!(val as *const _ as usize % 1, 0); // 1-byte aligned
}
#[test]
fn test_alignment_u32() {
let arena = Arena::new(1024);
let _: &mut u8 = arena.alloc(1u8); // Misalign by 1
let val: &mut u32 = arena.alloc(42u32);
assert_eq!(val as *const _ as usize % 4, 0); // Must be 4-byte aligned
}
#[test]
fn test_alignment_u64() {
let arena = Arena::new(1024);
let _: &mut u8 = arena.alloc(1u8);
let _: &mut u16 = arena.alloc(2u16);
let val: &mut u64 = arena.alloc(42u64);
assert_eq!(val as *const _ as usize % 8, 0); // Must be 8-byte aligned
}
2. Capacity Exhaustion Tests
#[test]
#[should_panic(expected = "out of memory")]
fn test_capacity_exhaustion() {
let arena = Arena::new(16); // Small arena
let _: &mut u64 = arena.alloc(1u64); // 8 bytes
let _: &mut u64 = arena.alloc(2u64); // 8 more bytes
let _: &mut u64 = arena.alloc(3u64); // PANIC! No room
}
#[test]
fn test_exact_fit() {
let arena = Arena::new(8);
let val: &mut u64 = arena.alloc(42u64);
assert_eq!(*val, 42);
// Should succeed - exactly 8 bytes for a u64
}
3. Drop Behavior Tests
use std::sync::atomic::{AtomicUsize, Ordering};
static DROP_COUNT: AtomicUsize = AtomicUsize::new(0);
struct DropCounter;
impl Drop for DropCounter {
fn drop(&mut self) {
DROP_COUNT.fetch_add(1, Ordering::SeqCst);
}
}
#[test]
fn test_objects_not_dropped() {
// Simple arena does NOT call Drop on allocated objects
DROP_COUNT.store(0, Ordering::SeqCst);
{
let arena = Arena::new(1024);
let _ = arena.alloc(DropCounter);
let _ = arena.alloc(DropCounter);
let _ = arena.alloc(DropCounter);
} // Arena dropped here
// Objects were NOT dropped (leaked)
assert_eq!(DROP_COUNT.load(Ordering::SeqCst), 0);
}
4. Benchmark Tests
#[test]
fn benchmark_arena_vs_box() {
// Run with: cargo test --release benchmark_arena_vs_box -- --nocapture
const N: usize = 1_000_000;
// Box benchmark
let start = std::time::Instant::now();
let boxes: Vec<_> = (0..N).map(|i| Box::new(i)).collect();
let box_duration = start.elapsed();
std::hint::black_box(&boxes);
drop(boxes);
// Arena benchmark
let arena = Arena::new(N * 16);
let start = std::time::Instant::now();
let arena_refs: Vec<_> = (0..N).map(|i| arena.alloc(i)).collect();
let arena_duration = start.elapsed();
std::hint::black_box(&arena_refs);
println!("\n=== Benchmark Results ===");
println!("Box::new: {:?}", box_duration);
println!("Arena::alloc: {:?}", arena_duration);
println!("Speedup: {:.1}x",
box_duration.as_secs_f64() / arena_duration.as_secs_f64());
assert!(arena_duration < box_duration, "Arena should be faster than Box");
}
Common Pitfalls and Debugging
Pitfall 1: Forgetting Alignment
// WRONG: This will cause misaligned access
fn bad_alloc<T>(&self, value: T) -> &mut T {
let ptr = self.ptr.get();
let new_ptr = unsafe { ptr.add(std::mem::size_of::<T>()) };
// ... forgot to align!
}
// On ARM, this might crash. On x86, it's just slow.
Fix: Always calculate aligned address before allocation.
Pitfall 2: Integer Overflow in Alignment
// WRONG: Can overflow for very large addresses
fn align_up_bad(addr: usize, align: usize) -> usize {
addr + align - 1 - (addr + align - 1) % align
}
// CORRECT: Use bitwise AND (no overflow possible if align is power of 2)
fn align_up(addr: usize, align: usize) -> usize {
debug_assert!(align.is_power_of_two());
(addr + align - 1) & !(align - 1)
}
Pitfall 3: Returning Raw Pointers Instead of References
// WRONG: User can use this after arena is dropped!
fn bad_alloc<T>(&self, value: T) -> *mut T {
// ...
ptr as *mut T // No lifetime connection to arena!
}
// CORRECT: Lifetime ties reference to arena
fn alloc<'a, T>(&'a self, value: T) -> &'a mut T {
// ...
unsafe { &mut *ptr } // Lifetime from &'a self
}
Pitfall 4: Not Handling Zero-Sized Types
// WRONG: Zero-sized types need special handling
fn alloc<T>(&self, value: T) -> &mut T {
let size = std::mem::size_of::<T>();
// If size == 0, we'd allocate 0 bytes and return... what?
}
// CORRECT: Handle ZSTs specially
fn alloc<T>(&self, value: T) -> &mut T {
if std::mem::size_of::<T>() == 0 {
// For ZST, just return a dangling-but-aligned pointer
return unsafe { &mut *(std::mem::align_of::<T>() as *mut T) };
}
// ... normal allocation
}
Pitfall 5: Calling Drop on Arena-Allocated Objects
// WRONG: Objects aren't dropped, might leak resources
struct FileHandle { fd: i32 }
impl Drop for FileHandle {
fn drop(&mut self) { unsafe { libc::close(self.fd); } }
}
let arena = Arena::new(1024);
let handle = arena.alloc(FileHandle { fd: open(...) });
drop(arena); // FileHandle::drop is NOT called! File descriptor leaked!
// SOLUTION 1: Manual cleanup
unsafe { std::ptr::drop_in_place(handle); }
drop(arena);
// SOLUTION 2: Use TypedArena that tracks Drop types
// (more complex implementation)
Pitfall 6: Thread Safety Issues
// WRONG: Cell is not thread-safe
pub struct Arena {
ptr: Cell<*mut u8>, // Not Send or Sync!
}
// This will fail to compile (correctly):
// std::thread::spawn(move || arena.alloc(42));
// SOLUTION: Use AtomicPtr for thread-safe arena
use std::sync::atomic::{AtomicPtr, Ordering};
pub struct ThreadSafeArena {
ptr: AtomicPtr<u8>,
// ... needs compare-and-swap logic in alloc
}
Questions to Guide Your Design
- Alignment Calculation
- How do you calculate the next aligned address?
(addr + align - 1) & !(align - 1)? - Why must alignment always be a power of 2?
- How do you calculate the next aligned address?
- Deallocation
- Why is
Dropusually a โno-opโ for objects in an arena? - How do you free the entire arena at once?
- What happens to objects that implement
Drop?
- Why is
- Memory Safety
- How do you ensure references returned by the arena are valid?
- What lifetime bounds are needed on the
allocmethod? - Why is
Cell<*mut u8>used for the bump pointer?
Thinking Exercise
Trace the Bump
Imagine your arena has 100 bytes starting at address 0x1000.
- You allocate a
u32(4 bytes, 4-align). - You allocate a
u8(1 byte, 1-align). - You allocate a
u64(8 bytes, 8-align).
Trace through the allocation:
Initial state:
base = 0x1000
ptr = 0x1000
end = 0x1064 (0x1000 + 100)
Step 1: alloc(u32)
align = 4, size = 4
aligned = align_up(0x1000, 4) = 0x1000 (already aligned)
new_ptr = 0x1000 + 4 = 0x1004
ptr = 0x1004
returned: &mut u32 at 0x1000
Step 2: alloc(u8)
align = 1, size = 1
aligned = align_up(0x1004, 1) = 0x1004 (already aligned)
new_ptr = 0x1004 + 1 = 0x1005
ptr = 0x1005
returned: &mut u8 at 0x1004
Step 3: alloc(u64)
align = 8, size = 8
aligned = align_up(0x1005, 8) = ?
0x1005 + 8 - 1 = 0x100C
0x100C & !(8-1) = 0x100C & !7 = 0x100C & 0xFFF8 = 0x1008
aligned = 0x1008
new_ptr = 0x1008 + 8 = 0x1010
ptr = 0x1010
returned: &mut u64 at 0x1008
PADDING: 3 bytes between u8 and u64 (0x1005, 0x1006, 0x1007)
Final memory layout:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 0x1000 0x1004 0x1005 0x1008 0x1010 0x1064โ
โ โโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโคโ
โ โ u32 โ u8 โ PAD โ u64 โ FREE โโ
โ โ(4B) โ (1B) โ (3B) โ (8B) โ (84 bytes) โโ
โ โโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โฒ โฒ โ
โ ptr end โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Answer: 3 bytes of padding were inserted between the u8 and u64.
Total used: 16 bytes (4 + 1 + 3 + 8)
The Interview Questions Theyโll Ask
-
โWhat is an arena allocator and why is it fast?โ
Answer: An arena allocator pre-allocates a large block of memory and allocates from it by bumping a pointer. Itโs fast because: (1) allocation is O(1) - just increment a pointer, (2) no per-object metadata overhead, (3) no free list to search, (4) no fragmentation management, (5) excellent cache locality.
-
โWhat happens if an arena runs out of memory?โ
Answer: Most implementations panic or return an error. More sophisticated arenas might allocate additional chunks and chain them together (like bumpalo does).
-
โHow do you handle the
Droptrait for objects inside an arena?โAnswer: Simple arenas donโt call Drop on their contents - this is a โmemory leakโ by design. TypedArena implementations track Drop types and call destructors before freeing memory. For most arena use cases (like compiler AST nodes), the types donโt implement Drop anyway.
-
โWhy is cache locality better with an arena?โ
Answer: Arena-allocated objects are contiguous in memory. When iterating through them, the CPU prefetcher can efficiently load cache lines. With Box-allocated objects scattered across the heap, each access might be a cache miss.
-
โExplain the alignment calculation formula.โ
Answer:
(addr + align - 1) & !(align - 1)works because: addingalign - 1ensures we โoverflowโ into the next aligned block, then the bitwise AND with!(align - 1)masks off the lower bits to round down to the alignment boundary. This only works when align is a power of 2.
Extensions and Challenges
Extension 1: Typed Arena (Arena)
Create an arena that only allocates objects of one type:
pub struct TypedArena<T> {
chunks: RefCell<Vec<Box<[MaybeUninit<T>]>>>,
ptr: Cell<*mut T>,
end: Cell<*mut T>,
}
impl<T> TypedArena<T> {
pub fn alloc(&self, value: T) -> &mut T {
if self.ptr == self.end {
self.grow();
}
// ... bump allocation
}
fn grow(&self) {
// Allocate a new chunk, update ptr and end
}
}
impl<T> Drop for TypedArena<T> {
fn drop(&mut self) {
// Call Drop on all allocated objects!
for chunk in self.chunks.borrow_mut().drain(..) {
for item in chunk.iter_mut() {
unsafe { std::ptr::drop_in_place(item.as_mut_ptr()) };
}
}
}
}
Extension 2: Growing Arena (Linked Chunks)
Instead of panicking when full, allocate a new chunk:
pub struct GrowingArena {
chunks: RefCell<Vec<Box<[u8]>>>,
current: Cell<*mut u8>,
end: Cell<*mut u8>,
chunk_size: usize,
}
impl GrowingArena {
fn grow(&self) {
let new_chunk = vec![0u8; self.chunk_size].into_boxed_slice();
let ptr = new_chunk.as_ptr() as *mut u8;
let end = unsafe { ptr.add(self.chunk_size) };
self.chunks.borrow_mut().push(new_chunk);
self.current.set(ptr);
self.end.set(end);
}
}
Extension 3: Thread-Local Arena
Create a thread-local arena for lock-free allocation:
use std::cell::RefCell;
thread_local! {
static ARENA: RefCell<Arena> = RefCell::new(Arena::new(1024 * 1024));
}
pub fn alloc<T>(value: T) -> &'static mut T {
ARENA.with(|arena| {
// SAFETY: This is actually unsound without careful lifetime management!
// Real implementation needs scoped allocation or guard types.
unsafe { std::mem::transmute(arena.borrow().alloc(value)) }
})
}
Real-World Connections
The bumpalo Crate
bumpalo is the most popular arena allocator in Rust:
use bumpalo::Bump;
let bump = Bump::new();
// Allocate values
let x = bump.alloc(1);
let y = bump.alloc([1, 2, 3]);
// Allocate strings
let s = bump.alloc_str("Hello");
// Use with collections
let mut vec = bumpalo::collections::Vec::new_in(&bump);
vec.push(1);
vec.push(2);
The Rust Compiler (rustc)
rustc uses arenas extensively for AST nodes, types, and other compiler data structures:
// From rustc_arena (simplified)
pub struct DroplessArena {
// Arena for types that don't need Drop
chunks: RefCell<Vec<Box<[MaybeUninit<u8>]>>>,
ptr: Cell<*mut u8>,
end: Cell<*mut u8>,
}
pub struct TypedArena<T> {
// Arena for types that DO need Drop
chunks: RefCell<Vec<Box<[MaybeUninit<T>]>>>,
ptr: Cell<*mut T>,
end: Cell<*mut T>,
}
ECS Game Engines (Bevy, Legion)
Game engines use arena allocation for components:
// Conceptual example from ECS design
struct ComponentArena<T> {
data: Arena<T>,
entity_map: HashMap<Entity, *mut T>,
}
impl<T> ComponentArena<T> {
fn insert(&mut self, entity: Entity, component: T) {
let ptr = self.data.alloc(component);
self.entity_map.insert(entity, ptr);
}
fn get(&self, entity: Entity) -> Option<&T> {
self.entity_map.get(&entity).map(|&ptr| unsafe { &*ptr })
}
}
Hints in Layers
Hint 1: The Base Pointer
Use std::alloc::alloc to get your initial chunk of memory. Store it as a *mut u8.
Hint 2: Alignment
Use pointer::align_offset to find how many bytes you need to skip to satisfy the alignment of T. Or implement the (addr + align - 1) & !(align - 1) formula.
Hint 3: Safety
Wrap everything in a safe method pub fn alloc<T>(&self, value: T) -> &mut T. Be very careful with the lifetimes - the returned reference must not outlive &self.
Hint 4: Cell for Interior Mutability
You need Cell<*mut u8> for the bump pointer because alloc takes &self, not &mut self. This allows multiple allocations without exclusive access.
Hint 5: Testing Alignment
Use std::ptr::addr_of!() or cast to usize to verify addresses are properly aligned: addr % align == 0.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Custom Allocators | โThe Linux Programming Interfaceโ by Kerrisk | Ch. 7: Memory Allocation |
| Alignment & Structs | โComputer Systems: A Programmerโs Perspectiveโ | Ch. 3.9: Heterogeneous Data Structures |
| Unsafe Memory | โProgramming Rustโ by Blandy & Orendorff | Ch. 21: Unsafe Code |
| Arena Patterns | โRust for Rustaceansโ by Gjengset | Ch. 3: Designing Interfaces |
| Allocator Internals | โThe Rustonomiconโ (online) | Memory Allocation |
Success Criteria
You have completed this project when:
- Your arena can allocate objects of any type with correct alignment
- Your arena is 10x+ faster than
Box::new()in benchmarks - All tests pass, including alignment verification
- You can explain the alignment formula without looking at notes
- You understand why Drop is not called on arena objects
- You can draw the memory layout after a sequence of allocations
- You could implement a growing arena as an extension
Next Steps
After completing this project, youโll be ready for:
- Project 6: Atomic Lock-Free Queue - Use your arena knowledge to build cache-aligned concurrent data structures
- Project 7: Zero-Copy Protocol Parser - Combine arena allocation with zero-copy parsing
- Project 12: High-Performance KV Store - Use arenas for the index layer of a database engine
This project is part of the Advanced Rust Ecosystem Deep Dive learning path.