Project 3: Custom Arena Allocator (Memory Locality)

From the Advanced Rust Ecosystem Deep Dive

Main Programming Language: Rust
Coolness Level: Level 3: Genuinely Clever
Difficulty: Level 3: Advanced
Knowledge Area: Memory Management / Unsafe
Time Estimate: 1 week

What You Will Build

A high-performance “Bump Allocator” (Arena). You will pre-allocate a large chunk of memory and provide an alloc<T> method that returns a reference to a part of that memory, without calling the global allocator for every object.

Learning Objectives

By the end of this project, you will:

Understand OS-level memory allocation - How brk, sbrk, and mmap work at the kernel level
Master memory alignment - Why CPUs require aligned access and how to calculate proper alignment
Implement unsafe Rust safely - Use raw pointers while maintaining Rust’s safety guarantees through invariants
Understand allocator tradeoffs - Why general-purpose allocators are slow and when to use specialized ones
Implement the bump allocation pattern - The simplest and fastest allocation strategy
Handle lifetimes with raw pointers - Ensure arena-allocated references don’t outlive the arena
Build production-quality unsafe code - Use Layout, align_offset, and proper drop semantics
Benchmark allocation performance - Measure and compare against Box::new()
Connect theory to practice - Understand how bumpalo, rustc, and game engines use arenas
Debug memory issues - Identify and fix common arena allocator bugs

Deep Theoretical Foundation

Before writing any code, you must understand the layers of abstraction between your program and physical RAM. This section will take you from the kernel level up to Rust’s allocation APIs.

Part 1: Memory Allocation at the Operating System Level

When your Rust program calls Box::new(42), what actually happens? Let’s trace the path from your code down to the kernel.

YOUR CODE: Box::new(42)
     |
     v
RUST GLOBAL ALLOCATOR (jemalloc/System)
     |
     v
LIBC (malloc/free)
     |
     v
KERNEL SYSCALLS (brk/mmap)
     |
     v
PHYSICAL MEMORY (RAM)

The `brk` and `sbrk` System Calls

In traditional Unix systems, a process has a “data segment” that grows upward from the program’s initial data. The “break” is the address where this segment ends.

PROCESS MEMORY LAYOUT (Traditional)
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│                         STACK                                   │
│                           ↓                                     │
│                      grows down                                 │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                     (free space)                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                         HEAP                                    │
│                           ↑                                     │
│                      grows up                                   │
│                                                                 │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ "break" ← ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                    DATA SEGMENT                                 │
│              (initialized global vars)                          │
├─────────────────────────────────────────────────────────────────┤
│                    TEXT SEGMENT                                 │
│                   (your program code)                           │
└─────────────────────────────────────────────────────────────────┘
   Address 0x0                                          Higher Addresses

Book Reference: “The Linux Programming Interface” by Michael Kerrisk, Chapter 7: Memory Allocation, provides an in-depth explanation of brk/sbrk and the process memory layout.

The sbrk(n) system call moves the “break” by n bytes, effectively allocating that memory:

// Simplified example of how sbrk works (C code)
void* my_malloc(size_t size) {
    void* ptr = sbrk(0);      // Get current break
    if (sbrk(size) == -1) {   // Move break up by 'size' bytes
        return NULL;          // Allocation failed
    }
    return ptr;               // Return pointer to new memory
}

However, brk/sbrk have a critical limitation: you can only free memory at the “top” of the heap. If you allocate A, B, C, and then free B, you cannot reclaim that memory until A is also freed.

The `mmap` System Call

Modern allocators prefer mmap for large allocations. Unlike brk, mmap can allocate memory anywhere in the virtual address space and release it independently.

MODERN PROCESS MEMORY LAYOUT (with mmap)
┌─────────────────────────────────────────────────────────────────┐
│                         STACK                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│           MMAP REGION (libraries, large allocations)           │
│    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│    │ libc.so     │  │ your alloc  │  │ another.so  │          │
│    └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                         HEAP (brk area)                         │
│                    (small allocations)                          │
├─────────────────────────────────────────────────────────────────┤
│                    DATA + BSS                                   │
├─────────────────────────────────────────────────────────────────┤
│                    TEXT SEGMENT                                 │
└─────────────────────────────────────────────────────────────────┘

Key Insight: mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) asks the kernel for size bytes of fresh memory, not backed by any file. This is what Rust’s std::alloc::alloc() ultimately calls for large allocations.

Part 2: General-Purpose Allocators and Why They’re Slow

When you call Box::new() in Rust, you’re using a general-purpose allocator. Rust has historically used jemalloc (a high-performance allocator from FreeBSD) and now defaults to the system allocator. Let’s understand why these are “slow” for certain workloads.

The Three Costs of General-Purpose Allocation

1. Fragmentation Management

EXTERNAL FRAGMENTATION EXAMPLE
────────────────────────────────────────────────────────────────

Initial State:
┌────────────────────────────────────────────────────────────────┐
│                       FREE (1000 bytes)                        │
└────────────────────────────────────────────────────────────────┘

After allocating A(100), B(200), C(100):
┌────────┬────────────┬────────┬─────────────────────────────────┐
│ A:100  │   B:200    │ C:100  │        FREE (600 bytes)         │
└────────┴────────────┴────────┴─────────────────────────────────┘

After freeing B:
┌────────┬────────────┬────────┬─────────────────────────────────┐
│ A:100  │ FREE:200   │ C:100  │        FREE (600 bytes)         │
└────────┴────────────┴────────┴─────────────────────────────────┘

Now you want to allocate D(500):
ERROR! You have 800 bytes free, but the largest contiguous block is only 600!

This is EXTERNAL FRAGMENTATION.

General-purpose allocators spend significant CPU cycles managing free lists, coalescing adjacent free blocks, and searching for the best-fit hole.

2. Metadata Overhead

Every allocation needs bookkeeping:

TYPICAL ALLOCATION BLOCK (simplified)
┌──────────────────┬────────────────────────────────────────────┐
│ HEADER (8-16 B)  │              USER DATA                     │
├──────────────────┼────────────────────────────────────────────┤
│ - size           │                                            │
│ - prev/next ptr  │    This is what you get back from malloc   │
│ - flags          │                                            │
└──────────────────┴────────────────────────────────────────────┘
   ^
   Hidden from user but consumes memory

If you’re allocating millions of small objects (say, 16 bytes each), this 8-16 byte overhead can DOUBLE your memory usage!

3. Thread-Safety Locking

In a multithreaded program, allocators must synchronize access:

THREAD 1                    THREAD 2
    │                           │
    ▼                           ▼
malloc(100)                 malloc(200)
    │                           │
    ▼                           ▼
┌─────────────────────────────────────────┐
│          LOCK (global heap lock)        │  ← CONTENTION!
│         Only one thread at a time       │
└─────────────────────────────────────────┘

Modern allocators like jemalloc use thread-local caches to reduce lock contention, but the overhead is still non-zero.

Book Reference: “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron, Chapter 9.9: Dynamic Memory Allocation, covers these concepts with detailed examples.

Part 3: Arena (Bump) Allocators - The Simplest Pattern

An arena allocator is radically simple: allocate by “bumping” a pointer forward. Never free individual objects. Free everything at once when the arena is dropped.

ARENA (BUMP) ALLOCATOR - THE CORE IDEA
════════════════════════════════════════════════════════════════════

Initial State:
┌────────────────────────────────────────────────────────────────────┐
│                           ARENA MEMORY                             │
│                                                                    │
│  base ──►┌──────────────────────────────────────────────────────┐ │
│          │                                                      │ │
│          │                    FREE SPACE                        │ │
│          │                                                      │ │
│          └──────────────────────────────────────────────────────┘ │
│                                                    ◄── end        │
│                                                                    │
│  ptr (bump pointer) ───────────────────────────────►│             │
│  (starts at base)                                                  │
└────────────────────────────────────────────────────────────────────┘

After alloc(A, 32 bytes):
┌────────────────────────────────────────────────────────────────────┐
│  base ──►┌────────────┬─────────────────────────────────────────┐ │
│          │     A      │                                         │ │
│          │  32 bytes  │              FREE SPACE                 │ │
│          │            │                                         │ │
│          └────────────┴─────────────────────────────────────────┘ │
│                       ▲                            ◄── end        │
│                       │                                            │
│  ptr ─────────────────┘                                           │
│  (moved forward 32 bytes)                                          │
└────────────────────────────────────────────────────────────────────┘

After alloc(B, 16 bytes):
┌────────────────────────────────────────────────────────────────────┐
│  base ──►┌────────────┬────────┬────────────────────────────────┐ │
│          │     A      │   B    │                                │ │
│          │  32 bytes  │ 16 B   │          FREE SPACE            │ │
│          │            │        │                                │ │
│          └────────────┴────────┴────────────────────────────────┘ │
│                                ▲                   ◄── end        │
│                                │                                   │
│  ptr ──────────────────────────┘                                  │
│  (moved forward another 16 bytes)                                  │
└────────────────────────────────────────────────────────────────────┘

After alloc(C, 8 bytes):
┌────────────────────────────────────────────────────────────────────┐
│  base ──►┌────────────┬────────┬────────┬───────────────────────┐ │
│          │     A      │   B    │   C    │                       │ │
│          │  32 bytes  │ 16 B   │  8 B   │      FREE SPACE       │ │
│          │            │        │        │                       │ │
│          └────────────┴────────┴────────┴───────────────────────┘ │
│                                         ▲          ◄── end        │
│                                         │                          │
│  ptr ───────────────────────────────────┘                         │
│  (moved forward another 8 bytes)                                   │
└────────────────────────────────────────────────────────────────────┘

FREE OPERATION (individual):
  ┌─────────────────────────────────────────────────────────────┐
  │  THERE IS NO FREE. You cannot deallocate individual objects.│
  │  This is the trade-off for speed.                           │
  └─────────────────────────────────────────────────────────────┘

RESET OPERATION (everything at once):
┌────────────────────────────────────────────────────────────────────┐
│  base ──►┌──────────────────────────────────────────────────────┐ │
│          │                                                      │ │
│          │    ALL MEMORY NOW FREE (Objects A, B, C gone)        │ │
│          │                                                      │ │
│          └──────────────────────────────────────────────────────┘ │
│          ▲                                         ◄── end        │
│          │                                                         │
│  ptr ────┘                                                        │
│  (reset back to base)                                              │
└────────────────────────────────────────────────────────────────────┘

Why Arena Allocation is Fast

Operation	General-Purpose	Arena
Allocation	Search free list, split block, update metadata	Bump pointer, done
Time Complexity	O(n) or O(log n)	O(1) - constant time
Locking (multi-threaded)	Global or per-arena lock	Per-arena lock (smaller contention)
Metadata per object	8-16 bytes	0 bytes
Deallocation	Update free list, possibly coalesce	No-op (or reset all)

Part 4: Memory Alignment - Why CPUs Need It

Modern CPUs don’t read memory one byte at a time. They read in “words” (typically 4 or 8 bytes on modern systems). When data is “aligned,” it starts at an address that’s a multiple of its size.

MEMORY ALIGNMENT VISUALIZATION
════════════════════════════════════════════════════════════════════

Physical memory addresses:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ 10 │ 11 │ 12 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

u32 (4 bytes) ALIGNED at address 4:
┌────┬────┬────┬────┬════╪════╪════╪════┬────┬────┬────┬────┬────┐
│    │    │    │    │ u32 value here    │    │    │    │    │    │
└────┴────┴────┴────┴═══════════════════┴────┴────┴────┴────┴────┘
                     ^
                     Address 4 is divisible by 4 ✓ ALIGNED

u32 (4 bytes) MISALIGNED at address 3:
┌────┬────┬────╪════╪════╪════╪════┬────┬────┬────┬────┬────┬────┐
│    │    │    │ u32 spans two cache lines │    │    │    │    │
└────┴────┴────┴═══════════════════════────┴────┴────┴────┴────┘
               ^
               Address 3 is NOT divisible by 4 ✗ MISALIGNED

The Cost of Misalignment

On x86/x64: Misaligned access is allowed but SLOW:

Aligned access: 1 memory operation
Misaligned access: 2 memory operations + CPU overhead to combine them

On ARM (and others): Misaligned access may cause:

A hardware trap (exception)
The kernel emulating the access (VERY slow)
On some CPUs, a crash

ALIGNED u64 READ (address 8):
────────────────────────────────────────────────────────────────

  CPU Cache Line Boundary
          │
┌─────────┴─────────┬─────────────────────┐
│    Line 0         │       Line 1        │
│  [0-7]            │     [8-15]          │
└───────────────────┴───────────────────────┘
                    │═══════════════════│
                    │   u64 @ addr 8    │
                    │   (8 bytes)       │
                    │═══════════════════│

CPU reads Line 1 → Gets all 8 bytes in ONE operation ✓


MISALIGNED u64 READ (address 5):
────────────────────────────────────────────────────────────────

  CPU Cache Line Boundary
          │
┌─────────┴─────────┬─────────────────────┐
│    Line 0         │       Line 1        │
│  [0-7]            │     [8-15]          │
└───────────────────┴───────────────────────┘
        │═══════════════════════│
        │     u64 @ addr 5      │
        │     (8 bytes)         │
        │═══════════════════════│
              │           │
              ▼           ▼
          3 bytes     5 bytes
          from        from
          Line 0      Line 1

CPU reads Line 0 AND Line 1 → TWO operations + combine ✗

Book Reference: “Computer Systems: A Programmer’s Perspective” Chapter 3.9: Heterogeneous Data Structures covers alignment in detail.

Part 5: Rust’s Layout and Alignment APIs

Rust provides the std::alloc::Layout type to handle alignment correctly:

use std::alloc::Layout;

// Create a layout for a single i32
let layout = Layout::new::<i32>();
assert_eq!(layout.size(), 4);     // 4 bytes
assert_eq!(layout.align(), 4);    // 4-byte alignment

// Create a layout for a single u8
let layout = Layout::new::<u8>();
assert_eq!(layout.size(), 1);     // 1 byte
assert_eq!(layout.align(), 1);    // 1-byte alignment

// Create a layout for a struct
#[repr(C)]
struct Example {
    a: u8,   // 1 byte, 1-align
    b: u32,  // 4 bytes, 4-align
    c: u16,  // 2 bytes, 2-align
}

let layout = Layout::new::<Example>();
// Size is 12 (with padding), align is 4 (largest field alignment)

size_of vs align_of

size_of<T>:  How many bytes the value OCCUPIES in memory
align_of<T>: What address boundary the value MUST start on

EXAMPLE: A struct with mixed types
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

#[repr(C)]  // Use C layout (predictable)
struct Mixed {
    a: u8,   // size=1, align=1
    b: u32,  // size=4, align=4
    c: u8,   // size=1, align=1
}

Memory layout (repr(C)):
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│  0   │  1   │  2   │  3   │  4   │  5   │  6   │  7   │  8   │  9   │  10  │  11  │
├──────┼──────┴──────┴──────┼──────┴──────┴──────┴──────┼──────┼──────┴──────┴──────┤
│  a   │    PADDING         │          b                │  c   │    PADDING        │
│ u8   │   (3 bytes)        │         u32               │ u8   │   (3 bytes)       │
└──────┴────────────────────┴───────────────────────────┴──────┴───────────────────┘

size_of::<Mixed>()  = 12 bytes (includes padding)
align_of::<Mixed>() = 4 bytes (largest member alignment)

WITHOUT repr(C), Rust may reorder fields:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│  0   │  1   │  2   │  3   │  4   │  5   │  6   │  7   │
├──────┴──────┴──────┴──────┼──────┼──────┼──────┴──────┤
│          b                │  a   │  c   │  PADDING    │
│         u32               │ u8   │ u8   │  (2 bytes)  │
└───────────────────────────┴──────┴──────┴─────────────┘

size_of::<Mixed>() = 8 bytes (less padding!)
align_of::<Mixed>() = 4 bytes (same)

Part 6: The Alignment Calculation for Bump Allocation

When bumping a pointer, you must ensure the result is properly aligned for the type you’re allocating. Here’s the formula:

ALIGNMENT FORMULA
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Given:
  - current_ptr: The current position in the arena
  - align: The required alignment for the type

To find the next aligned address:
  aligned_ptr = (current_ptr + align - 1) & !(align - 1)

EXAMPLE: align a pointer to 8-byte boundary
────────────────────────────────────────────

current_ptr = 13
align = 8

Step 1: current_ptr + align - 1 = 13 + 8 - 1 = 20

Step 2: align - 1 = 7 = 0b00000111 (binary)

Step 3: !(align - 1) = !7 = 0b11111000 (inverts bits)

Step 4: 20 & 0b11111000 = 16

Result: 16 is the next address divisible by 8 ✓

VISUAL:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ 10 │ 11 │ 12 │ 13 │ 14 │ 15 │ 16 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
                                    ↑              ↑                   ↑
                             8-aligned       current=13         next aligned=16
                                                        (skip 3 bytes of padding)

Rust provides pointer::align_offset() to do this calculation safely:

let ptr: *mut u8 = /* ... */;
let align = std::mem::align_of::<u64>();  // 8
let offset = ptr.align_offset(align);     // How many bytes to skip

if offset == usize::MAX {
    // Alignment is impossible (rare, usually for weird types)
    panic!("Cannot align");
}

let aligned_ptr = ptr.add(offset);

Part 7: Arena vs Pool vs Slab Allocators

Different allocation strategies are optimized for different use cases:

ALLOCATOR COMPARISON
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┌─────────────────────────────────────────────────────────────────────┐
│                        ARENA (BUMP) ALLOCATOR                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Pattern: Objects allocated together, freed together                │
│                                                                     │
│  ┌────┬────┬────┬────┬────┬────┬───────────────────────────────┐   │
│  │ A  │ B  │ C  │ D  │ E  │ F  │       FREE SPACE              │   │
│  └────┴────┴────┴────┴────┴────┴───────────────────────────────┘   │
│                                ↑                                    │
│                          bump pointer                               │
│                                                                     │
│  + Fastest allocation (O(1))                                        │
│  + Zero metadata overhead                                           │
│  + Cache-friendly (contiguous)                                      │
│  - Cannot free individual objects                                   │
│  - Wastes memory if objects have varied lifetimes                   │
│                                                                     │
│  Use cases: Compilers (AST nodes), game frame allocations,          │
│             request-scoped web servers                              │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                          POOL ALLOCATOR                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Pattern: Fixed-size objects, frequent alloc/free                   │
│                                                                     │
│  ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐              │
│  │USED│FREE│USED│USED│FREE│FREE│USED│FREE│USED│FREE│              │
│  └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘              │
│        │              │    │         │         │                    │
│        └──────────────┴────┴─────────┴─────────┘                    │
│                   Free list (linked)                                │
│                                                                     │
│  + Fast allocation (O(1))                                           │
│  + CAN free individual objects                                      │
│  + No fragmentation (fixed size)                                    │
│  - Only works for one size                                          │
│  - Slight overhead for free list pointers                           │
│                                                                     │
│  Use cases: ECS entity storage, particle systems,                   │
│             network connection handles                              │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                          SLAB ALLOCATOR                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Pattern: Objects of same TYPE, kernel-style                        │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │                      SLAB CACHE                             │    │
│  │  (For struct: task_struct)                                  │    │
│  │                                                             │    │
│  │  ┌─────────────────┬─────────────────┬─────────────────┐   │    │
│  │  │   SLAB (full)   │  SLAB (partial) │  SLAB (empty)   │   │    │
│  │  │  ████████████   │  ████░░░░░░░░   │  ░░░░░░░░░░░░   │   │    │
│  │  └─────────────────┴─────────────────┴─────────────────┘   │    │
│  │                                                             │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  + Objects are pre-initialized (construction cost amortized)        │
│  + Excellent cache locality                                         │
│  + Color optimization to reduce cache conflicts                     │
│  - Complex implementation                                           │
│  - Type-specific caches                                             │
│                                                                     │
│  Use cases: Linux kernel object caches, high-frequency              │
│             object creation (network packets, inodes)               │
└─────────────────────────────────────────────────────────────────────┘

WHEN TO USE EACH:
─────────────────────────────────────────────────────────────────────

                        ┌─────────────────────────┐
                        │  Can objects be freed   │
                        │     individually?       │
                        └───────────┬─────────────┘
                                    │
                 ┌──────────────────┴───────────────────┐
                 ▼                                      ▼
               NO                                     YES
                │                                       │
                ▼                                       │
         ┌──────────────┐                              │
         │    ARENA     │                              │
         │   (Bump)     │                              │
         └──────────────┘              ┌───────────────┴────────────────┐
                                       │  Are all objects the          │
                                       │  same size?                   │
                                       └───────────────┬────────────────┘
                                                       │
                                      ┌────────────────┴────────────────┐
                                      ▼                                 ▼
                                    YES                               NO
                                      │                                 │
                          ┌───────────┴───────────┐         ┌──────────┴──────────┐
                          │   Do objects need     │         │   GENERAL PURPOSE   │
                          │   pre-initialization? │         │   (malloc/jemalloc) │
                          └───────────┬───────────┘         └─────────────────────┘
                                      │
                           ┌──────────┴──────────┐
                           ▼                     ▼
                         YES                    NO
                           │                     │
                    ┌──────┴──────┐       ┌──────┴──────┐
                    │    SLAB     │       │    POOL     │
                    │ (with ctor) │       │ (free list) │
                    └─────────────┘       └─────────────┘

Part 8: The Allocator Trait (Unstable)

Rust’s Allocator trait (currently unstable, available with #![feature(allocator_api)]) defines the interface for custom allocators:

#![feature(allocator_api)]

use std::alloc::{Allocator, AllocError, Layout};
use core::ptr::NonNull;

pub trait Allocator {
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError>;

    unsafe fn deallocate(&self, ptr: NonNull<u8>, layout: Layout);

    // Optional methods with default implementations:
    fn allocate_zeroed(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> { ... }
    unsafe fn grow(...) -> Result<NonNull<[u8]>, AllocError> { ... }
    unsafe fn shrink(...) -> Result<NonNull<[u8]>, AllocError> { ... }
}

This trait allows you to use your arena with standard collections:

#![feature(allocator_api)]
use std::vec::Vec;

let arena = MyArena::new(1024);

// Create a Vec that uses your arena!
let mut vec: Vec<i32, &MyArena> = Vec::new_in(&arena);
vec.push(1);
vec.push(2);
vec.push(3);
// All allocations come from the arena, not the global allocator

Part 9: Deallocation Strategies in Arenas

Arenas typically use one of these deallocation approaches:

DEALLOCATION STRATEGIES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. NO DEALLOCATION (Most Common)
───────────────────────────────────────────────────────

   Objects are never freed individually.
   The entire arena is freed when dropped.

   alloc() → bump pointer → return reference
   drop(arena) → dealloc the entire backing memory

   Pros: Simplest, fastest
   Cons: Can't reclaim memory until arena dies

2. RESET (Reuse Arena)
───────────────────────────────────────────────────────

   Reset the bump pointer to start, reuse all memory.

   Before reset:
   ┌────┬────┬────┬────┬────────────────────┐
   │ A  │ B  │ C  │ D  │     (free)         │
   └────┴────┴────┴────┴────────────────────┘
                       ↑
                    bump ptr

   After reset():
   ┌────────────────────────────────────────┐
   │              (all free again)          │
   └────────────────────────────────────────┘
   ↑
   bump ptr (back to start)

   WARNING: All previous references are now INVALID!
   Must use unsafe or ensure all refs are dropped.

3. LIFO DEALLOCATION (Stack-like)
───────────────────────────────────────────────────────

   Can deallocate the LAST allocation only.

   alloc(A) → alloc(B) → alloc(C)
   ┌────┬────┬────┬─────────────────────────┐
   │ A  │ B  │ C  │        (free)           │
   └────┴────┴────┴─────────────────────────┘
                  ↑
               bump ptr

   dealloc(C):
   ┌────┬────┬─────────────────────────────┐
   │ A  │ B  │          (free)             │
   └────┴────┴─────────────────────────────┘
             ↑
          bump ptr (moved back)

   dealloc(B):
   ┌────┬─────────────────────────────────┐
   │ A  │             (free)              │
   └────┴─────────────────────────────────┘
        ↑
     bump ptr

   dealloc(A): ERROR! Cannot dealloc A before B is deallocated.

4. GENERATION-BASED (Scope Markers)
───────────────────────────────────────────────────────

   Save the current position, allocate, then restore.

   let marker = arena.save();  // marker = current bump ptr position

   arena.alloc(A);
   arena.alloc(B);
   arena.alloc(C);

   arena.restore(marker);  // bump ptr = marker, A/B/C are "freed"

   Useful for: Scratch allocations, temporary work buffers

Real World Outcome

You will have an Arena struct that can allocate objects 10-50x faster than Box::new(). You will use this to build a simple linked list that is cache-friendly because all nodes are contiguous in memory.

Example Benchmark:

$ cargo bench
Standard Box: 45.2 ns/alloc
Arena Alloc:  1.8 ns/alloc (25x SPEED UP!)

Full Example Session:

$ cargo new --lib arena-allocator
     Created library `arena-allocator` package

$ cd arena-allocator

$ cargo add criterion --dev --features criterion/html_reports
    Updating crates.io index
      Adding criterion v0.5.1 to dev-dependencies

$ cargo build
   Compiling arena-allocator v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.89s

$ cargo test
   Compiling arena-allocator v0.1.0
    Finished test [unoptimized + debuginfo] target(s) in 0.42s
     Running unittests src/lib.rs

running 8 tests
test tests::test_basic_allocation ... ok
test tests::test_alignment_u8 ... ok
test tests::test_alignment_u32 ... ok
test tests::test_alignment_u64 ... ok
test tests::test_mixed_types ... ok
test tests::test_capacity_exhaustion ... ok
test tests::test_drop_behavior ... ok
test tests::test_lifetime_bounds ... ok

test result: ok. 8 passed; 0 failed; 0 ignored

$ cargo bench --bench allocation_comparison
   Compiling arena-allocator v0.1.0
    Finished bench [optimized] target(s) in 3.21s
     Running benches/allocation_comparison.rs

╔══════════════════════════════════════════════════════════════════╗
║              Allocation Performance Comparison                   ║
╠══════════════════════════════════════════════════════════════════╣
║ Testing: 10,000 allocations of various types                    ║
╚══════════════════════════════════════════════════════════════════╝

Benchmarking Box::new(i32)
  Warming up for 3.0000 s
  Collecting 100 samples in estimated 5.2340 s

box_i32/10k_allocs     time:   [450.12 us 452.34 us 454.89 us]
                       thrpt:  [21.98M allocs/s 22.11M allocs/s 22.22M allocs/s]

Memory Analysis:
  Total allocations: 10,000
  Bytes allocated:   160,000 (16 bytes per Box: 4 data + 8-16 metadata)
  Allocation rate:   352.7 MB/s

Benchmarking Arena::alloc::<i32>()
  Warming up for 3.0000 s
  Collecting 100 samples in estimated 5.0123 s

arena_i32/10k_allocs   time:   [17.82 us 17.95 us 18.12 us]
                       thrpt:  [551.9M allocs/s 557.1M allocs/s 561.4M allocs/s]

Memory Analysis:
  Total allocations: 1 (initial arena only)
  Bytes allocated:   40,000 (4 bytes per i32, no metadata)
  Allocation rate:   0 MB/s (no per-object system calls!)

═══════════════════════════════════════════════════════════════════

PERFORMANCE SUMMARY:
╔═══════════════════════════════════════════════════════════════╗
║ Metric                 │ Box::new    │ Arena::alloc│ Speedup  ║
╠═══════════════════════════════════════════════════════════════╣
║ Time per 10k allocs    │ 452 us      │ 18 us       │ 25.1x    ║
║ Throughput             │ 22.1M ops/s │ 557.1M ops/s│ 25.2x    ║
║ System allocations     │ 10,000      │ 1           │ 10000x   ║
║ Memory overhead        │ 60%         │ 0%          │ Better   ║
║ Cache behavior         │ Poor        │ Excellent   │ Better   ║
╚═══════════════════════════════════════════════════════════════╝

$ cargo run --example linked_list
   Compiling arena-allocator v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.42s
     Running `target/debug/examples/linked_list`

=== Arena-Allocated Linked List Demo ===

[1] Creating arena with 4KB capacity...
    Arena @ 0x7f8b4c000000
    Capacity: 4096 bytes

[2] Allocating 100 linked list nodes...
    Node 0  @ 0x7f8b4c000000
    Node 1  @ 0x7f8b4c000018
    Node 2  @ 0x7f8b4c000030
    Node 3  @ 0x7f8b4c000048
    ...
    Node 99 @ 0x7f8b4c000bb8

[3] Memory locality analysis:
    All 100 nodes are CONTIGUOUS in memory!
    Span: 3000 bytes (100 nodes * 24 bytes + alignment)
    Cache lines touched: 47 (vs ~100 for Box-allocated)

[4] Traversing list...
    Sum of values: 4950
    Traversal time: 0.8 us

[5] Comparison with Box-allocated list:
    Box traversal time: 2.1 us (2.6x slower due to cache misses)

[Summary]
All nodes allocated with ZERO system calls
Memory layout is cache-optimal
Arena drops all memory at once when it goes out of scope

The Core Question You’re Answering

“Can I allocate memory faster than the Operating System?”

Yes, you can. By asking the OS for one giant block and managing the “bumping” of a pointer yourself, you bypass the complex bookkeeping and locking of general-purpose allocators like jemalloc.

Concepts You Must Understand First

Stop and research these before coding:

Memory Alignment
- Why can’t you just put a u32 at any address? (Hint: CPU bus errors).
- What happens on ARM vs x86 for misaligned access?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 3.9
Raw Pointers and Layout
- How does std::alloc::Layout calculate size and alignment?
- What’s the difference between *const T and *mut T?
- Book Reference: “The Rust Programming Language” Ch. 19
PhantomData & Lifetimes
- How do you ensure the references returned by the arena don’t outlive the arena itself?
- Why is PhantomData<&'a T> sometimes needed?
- Book Reference: “Rust for Rustaceans” Ch. 3
Unsafe Rust Invariants
- What are the rules for unsafe blocks?
- When can you dereference a raw pointer?
- Book Reference: “The Rustonomicon” (online book)

Solution Architecture

Your arena allocator should have this basic structure:

ARENA STRUCT ARCHITECTURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

pub struct Arena {
    /// Raw pointer to the start of the allocated memory block
    base: *mut u8,

    /// Current position in the arena (where next allocation starts)
    ptr: Cell<*mut u8>,

    /// Pointer to one past the end of the arena
    end: *mut u8,

    /// Total capacity in bytes
    capacity: usize,
}

MEMORY LAYOUT:
┌─────────────────────────────────────────────────────────────────┐
│                        ARENA MEMORY                             │
│                                                                 │
│  base ────►┌──────────────────────────────────────────────────┐ │
│            │  USED  │  USED  │  USED  │       FREE            │ │
│            │  (A)   │  (B)   │  (C)   │                       │ │
│            └──────────────────────────────────────────────────┘ │
│                              ▲                            ▲     │
│                              │                            │     │
│                         ptr (current)                   end     │
│                                                                 │
│  Invariants:                                                    │
│    - base <= ptr <= end                                         │
│    - capacity = end - base                                      │
│    - used = ptr - base                                          │
│    - remaining = end - ptr                                      │
└─────────────────────────────────────────────────────────────────┘

ALLOC METHOD SIGNATURE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

pub fn alloc<T>(&self, value: T) -> &mut T

- Takes ownership of `value`
- Returns `&mut T` with lifetime tied to `&self`
- May panic if arena is out of space

ALIGNMENT CALCULATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

fn alloc_impl<T>(&self, value: T) -> &mut T {
    let layout = Layout::new::<T>();

    // 1. Get current pointer position
    let ptr = self.ptr.get();

    // 2. Calculate aligned position
    let aligned = align_up(ptr as usize, layout.align()) as *mut u8;

    // 3. Calculate new pointer position after allocation
    let new_ptr = aligned.add(layout.size());

    // 4. Check if we have enough space
    if new_ptr > self.end {
        panic!("Arena out of memory");
    }

    // 5. Update the bump pointer
    self.ptr.set(new_ptr);

    // 6. Write the value and return a reference
    let slot = aligned as *mut T;
    ptr::write(slot, value);
    &mut *slot
}

DROP IMPLEMENTATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

impl Drop for Arena {
    fn drop(&mut self) {
        // Deallocate the entire backing memory
        unsafe {
            let layout = Layout::from_size_align_unchecked(
                self.capacity,
                1  // u8 alignment
            );
            std::alloc::dealloc(self.base, layout);
        }
    }
}

WARNING: This simple arena does NOT call Drop on allocated objects!
The objects are just "forgotten" when the arena is dropped.
For types with Drop, you need TypedArena or manual drop tracking.

Phased Implementation Guide

Phase 1: Request Memory from System Allocator

Goal: Allocate a large chunk of memory to use as your arena.

use std::alloc::{alloc, Layout};
use std::cell::Cell;

pub struct Arena {
    base: *mut u8,
    ptr: Cell<*mut u8>,
    end: *mut u8,
    capacity: usize,
}

impl Arena {
    pub fn new(capacity: usize) -> Self {
        let layout = Layout::from_size_align(capacity, 1)
            .expect("Invalid layout");

        let base = unsafe { alloc(layout) };
        if base.is_null() {
            panic!("Failed to allocate arena memory");
        }

        let end = unsafe { base.add(capacity) };

        Arena {
            base,
            ptr: Cell::new(base),
            end,
            capacity,
        }
    }
}

Checkpoint: Verify you can create an Arena and it doesn’t crash.

Phase 2: Implement Basic Bump Allocation (No Alignment)

Goal: Allocate bytes without worrying about alignment.

impl Arena {
    /// Allocate raw bytes (no alignment)
    pub fn alloc_bytes(&self, size: usize) -> *mut u8 {
        let ptr = self.ptr.get();
        let new_ptr = unsafe { ptr.add(size) };

        if new_ptr > self.end {
            panic!("Arena out of memory");
        }

        self.ptr.set(new_ptr);
        ptr
    }
}

Checkpoint: Allocate some bytes, verify the pointer advances.

Phase 3: Add Proper Alignment Handling

Goal: Ensure allocations are properly aligned for their type.

/// Round `addr` up to the nearest multiple of `align`
fn align_up(addr: usize, align: usize) -> usize {
    (addr + align - 1) & !(align - 1)
}

impl Arena {
    pub fn alloc<T>(&self, value: T) -> &mut T {
        let layout = Layout::new::<T>();
        let ptr = self.ptr.get();

        // Calculate aligned address
        let aligned = align_up(ptr as usize, layout.align());
        let aligned_ptr = aligned as *mut u8;

        // Calculate end of allocation
        let new_ptr = unsafe { aligned_ptr.add(layout.size()) };

        if new_ptr > self.end {
            panic!("Arena out of memory: needed {} bytes, have {}",
                   layout.size(),
                   self.end as usize - ptr as usize);
        }

        self.ptr.set(new_ptr);

        // Write the value
        let slot = aligned_ptr as *mut T;
        unsafe {
            std::ptr::write(slot, value);
            &mut *slot
        }
    }
}

Checkpoint: Allocate different types (u8, u32, u64), verify alignment.

Phase 4: Add Lifetime Safety

Goal: Ensure references cannot outlive the arena.

use std::marker::PhantomData;

pub struct Arena<'a> {
    base: *mut u8,
    ptr: Cell<*mut u8>,
    end: *mut u8,
    capacity: usize,
    _marker: PhantomData<&'a ()>,
}

impl<'a> Arena<'a> {
    pub fn alloc<T>(&'a self, value: T) -> &'a mut T {
        // ... same implementation ...
        // The 'a lifetime ensures the reference can't outlive the arena
    }
}

Checkpoint: Verify the compiler rejects code that would create dangling references.

Phase 5: Implement Drop and Add Benchmarks

Goal: Clean up memory properly and prove the performance benefit.

impl Drop for Arena<'_> {
    fn drop(&mut self) {
        unsafe {
            let layout = Layout::from_size_align_unchecked(self.capacity, 1);
            std::alloc::dealloc(self.base, layout);
        }
    }
}

// Create benchmark comparing Arena::alloc vs Box::new
#[cfg(test)]
mod benchmarks {
    use super::*;
    use std::time::Instant;

    #[test]
    fn benchmark_comparison() {
        const N: usize = 100_000;

        // Benchmark Box::new
        let start = Instant::now();
        let boxes: Vec<_> = (0..N).map(|i| Box::new(i as u64)).collect();
        let box_time = start.elapsed();
        drop(boxes);

        // Benchmark Arena
        let arena = Arena::new(N * 16);  // 16 bytes per u64 with alignment
        let start = Instant::now();
        let refs: Vec<_> = (0..N).map(|i| arena.alloc(i as u64)).collect();
        let arena_time = start.elapsed();
        drop(refs);

        println!("Box::new:     {:?} ({:.2} ns/alloc)",
                 box_time, box_time.as_nanos() as f64 / N as f64);
        println!("Arena::alloc: {:?} ({:.2} ns/alloc)",
                 arena_time, arena_time.as_nanos() as f64 / N as f64);
        println!("Speedup: {:.1}x",
                 box_time.as_nanos() as f64 / arena_time.as_nanos() as f64);
    }
}

Testing Strategy

1. Alignment Verification Tests

#[test]
fn test_alignment_u8() {
    let arena = Arena::new(1024);
    let val: &mut u8 = arena.alloc(42u8);
    assert_eq!(val as *const _ as usize % 1, 0);  // 1-byte aligned
}

#[test]
fn test_alignment_u32() {
    let arena = Arena::new(1024);
    let _: &mut u8 = arena.alloc(1u8);  // Misalign by 1
    let val: &mut u32 = arena.alloc(42u32);
    assert_eq!(val as *const _ as usize % 4, 0);  // Must be 4-byte aligned
}

#[test]
fn test_alignment_u64() {
    let arena = Arena::new(1024);
    let _: &mut u8 = arena.alloc(1u8);
    let _: &mut u16 = arena.alloc(2u16);
    let val: &mut u64 = arena.alloc(42u64);
    assert_eq!(val as *const _ as usize % 8, 0);  // Must be 8-byte aligned
}

2. Capacity Exhaustion Tests

#[test]
#[should_panic(expected = "out of memory")]
fn test_capacity_exhaustion() {
    let arena = Arena::new(16);  // Small arena
    let _: &mut u64 = arena.alloc(1u64);  // 8 bytes
    let _: &mut u64 = arena.alloc(2u64);  // 8 more bytes
    let _: &mut u64 = arena.alloc(3u64);  // PANIC! No room
}

#[test]
fn test_exact_fit() {
    let arena = Arena::new(8);
    let val: &mut u64 = arena.alloc(42u64);
    assert_eq!(*val, 42);
    // Should succeed - exactly 8 bytes for a u64
}

3. Drop Behavior Tests

use std::sync::atomic::{AtomicUsize, Ordering};

static DROP_COUNT: AtomicUsize = AtomicUsize::new(0);

struct DropCounter;

impl Drop for DropCounter {
    fn drop(&mut self) {
        DROP_COUNT.fetch_add(1, Ordering::SeqCst);
    }
}

#[test]
fn test_objects_not_dropped() {
    // Simple arena does NOT call Drop on allocated objects
    DROP_COUNT.store(0, Ordering::SeqCst);

    {
        let arena = Arena::new(1024);
        let _ = arena.alloc(DropCounter);
        let _ = arena.alloc(DropCounter);
        let _ = arena.alloc(DropCounter);
    }  // Arena dropped here

    // Objects were NOT dropped (leaked)
    assert_eq!(DROP_COUNT.load(Ordering::SeqCst), 0);
}

4. Benchmark Tests

#[test]
fn benchmark_arena_vs_box() {
    // Run with: cargo test --release benchmark_arena_vs_box -- --nocapture
    const N: usize = 1_000_000;

    // Box benchmark
    let start = std::time::Instant::now();
    let boxes: Vec<_> = (0..N).map(|i| Box::new(i)).collect();
    let box_duration = start.elapsed();
    std::hint::black_box(&boxes);
    drop(boxes);

    // Arena benchmark
    let arena = Arena::new(N * 16);
    let start = std::time::Instant::now();
    let arena_refs: Vec<_> = (0..N).map(|i| arena.alloc(i)).collect();
    let arena_duration = start.elapsed();
    std::hint::black_box(&arena_refs);

    println!("\n=== Benchmark Results ===");
    println!("Box::new:     {:?}", box_duration);
    println!("Arena::alloc: {:?}", arena_duration);
    println!("Speedup: {:.1}x",
             box_duration.as_secs_f64() / arena_duration.as_secs_f64());

    assert!(arena_duration < box_duration, "Arena should be faster than Box");
}

Common Pitfalls and Debugging

Pitfall 1: Forgetting Alignment

// WRONG: This will cause misaligned access
fn bad_alloc<T>(&self, value: T) -> &mut T {
    let ptr = self.ptr.get();
    let new_ptr = unsafe { ptr.add(std::mem::size_of::<T>()) };
    // ... forgot to align!
}

// On ARM, this might crash. On x86, it's just slow.

Fix: Always calculate aligned address before allocation.

Pitfall 2: Integer Overflow in Alignment

// WRONG: Can overflow for very large addresses
fn align_up_bad(addr: usize, align: usize) -> usize {
    addr + align - 1 - (addr + align - 1) % align
}

// CORRECT: Use bitwise AND (no overflow possible if align is power of 2)
fn align_up(addr: usize, align: usize) -> usize {
    debug_assert!(align.is_power_of_two());
    (addr + align - 1) & !(align - 1)
}

Pitfall 3: Returning Raw Pointers Instead of References

// WRONG: User can use this after arena is dropped!
fn bad_alloc<T>(&self, value: T) -> *mut T {
    // ...
    ptr as *mut T  // No lifetime connection to arena!
}

// CORRECT: Lifetime ties reference to arena
fn alloc<'a, T>(&'a self, value: T) -> &'a mut T {
    // ...
    unsafe { &mut *ptr }  // Lifetime from &'a self
}

Pitfall 4: Not Handling Zero-Sized Types

// WRONG: Zero-sized types need special handling
fn alloc<T>(&self, value: T) -> &mut T {
    let size = std::mem::size_of::<T>();
    // If size == 0, we'd allocate 0 bytes and return... what?
}

// CORRECT: Handle ZSTs specially
fn alloc<T>(&self, value: T) -> &mut T {
    if std::mem::size_of::<T>() == 0 {
        // For ZST, just return a dangling-but-aligned pointer
        return unsafe { &mut *(std::mem::align_of::<T>() as *mut T) };
    }
    // ... normal allocation
}

Pitfall 5: Calling Drop on Arena-Allocated Objects

// WRONG: Objects aren't dropped, might leak resources
struct FileHandle { fd: i32 }
impl Drop for FileHandle {
    fn drop(&mut self) { unsafe { libc::close(self.fd); } }
}

let arena = Arena::new(1024);
let handle = arena.alloc(FileHandle { fd: open(...) });
drop(arena);  // FileHandle::drop is NOT called! File descriptor leaked!

// SOLUTION 1: Manual cleanup
unsafe { std::ptr::drop_in_place(handle); }
drop(arena);

// SOLUTION 2: Use TypedArena that tracks Drop types
// (more complex implementation)

Pitfall 6: Thread Safety Issues

// WRONG: Cell is not thread-safe
pub struct Arena {
    ptr: Cell<*mut u8>,  // Not Send or Sync!
}

// This will fail to compile (correctly):
// std::thread::spawn(move || arena.alloc(42));

// SOLUTION: Use AtomicPtr for thread-safe arena
use std::sync::atomic::{AtomicPtr, Ordering};

pub struct ThreadSafeArena {
    ptr: AtomicPtr<u8>,
    // ... needs compare-and-swap logic in alloc
}

Questions to Guide Your Design

Alignment Calculation
- How do you calculate the next aligned address? (addr + align - 1) & !(align - 1)?
- Why must alignment always be a power of 2?
Deallocation
- Why is Drop usually a “no-op” for objects in an arena?
- How do you free the entire arena at once?
- What happens to objects that implement Drop?
Memory Safety
- How do you ensure references returned by the arena are valid?
- What lifetime bounds are needed on the alloc method?
- Why is Cell<*mut u8> used for the bump pointer?

Thinking Exercise

Trace the Bump

Imagine your arena has 100 bytes starting at address 0x1000.

You allocate a u32 (4 bytes, 4-align).
You allocate a u8 (1 byte, 1-align).
You allocate a u64 (8 bytes, 8-align).

Trace through the allocation:

Initial state:
  base = 0x1000
  ptr  = 0x1000
  end  = 0x1064 (0x1000 + 100)

Step 1: alloc(u32)
  align = 4, size = 4
  aligned = align_up(0x1000, 4) = 0x1000 (already aligned)
  new_ptr = 0x1000 + 4 = 0x1004
  ptr = 0x1004
  returned: &mut u32 at 0x1000

Step 2: alloc(u8)
  align = 1, size = 1
  aligned = align_up(0x1004, 1) = 0x1004 (already aligned)
  new_ptr = 0x1004 + 1 = 0x1005
  ptr = 0x1005
  returned: &mut u8 at 0x1004

Step 3: alloc(u64)
  align = 8, size = 8
  aligned = align_up(0x1005, 8) = ?

  0x1005 + 8 - 1 = 0x100C
  0x100C & !(8-1) = 0x100C & !7 = 0x100C & 0xFFF8 = 0x1008

  aligned = 0x1008
  new_ptr = 0x1008 + 8 = 0x1010
  ptr = 0x1010
  returned: &mut u64 at 0x1008

  PADDING: 3 bytes between u8 and u64 (0x1005, 0x1006, 0x1007)

Final memory layout:
┌────────────────────────────────────────────────────────────────────┐
│ 0x1000  0x1004  0x1005  0x1008       0x1010                 0x1064│
│ ├──────┼───────┼───────┼───────────────┼──────────────────────────┤│
│ │ u32  │  u8   │ PAD   │     u64       │        FREE              ││
│ │(4B)  │ (1B)  │ (3B)  │    (8B)       │       (84 bytes)         ││
│ └──────┴───────┴───────┴───────────────┴──────────────────────────┘│
│                               ▲                              ▲     │
│                              ptr                            end    │
└────────────────────────────────────────────────────────────────────┘

Answer: 3 bytes of padding were inserted between the u8 and u64.
Total used: 16 bytes (4 + 1 + 3 + 8)

The Interview Questions They’ll Ask

“What is an arena allocator and why is it fast?”

Answer: An arena allocator pre-allocates a large block of memory and allocates from it by bumping a pointer. It’s fast because: (1) allocation is O(1) - just increment a pointer, (2) no per-object metadata overhead, (3) no free list to search, (4) no fragmentation management, (5) excellent cache locality.
“What happens if an arena runs out of memory?”

Answer: Most implementations panic or return an error. More sophisticated arenas might allocate additional chunks and chain them together (like bumpalo does).
“How do you handle the Drop trait for objects inside an arena?”

Answer: Simple arenas don’t call Drop on their contents - this is a “memory leak” by design. TypedArena implementations track Drop types and call destructors before freeing memory. For most arena use cases (like compiler AST nodes), the types don’t implement Drop anyway.
“Why is cache locality better with an arena?”

Answer: Arena-allocated objects are contiguous in memory. When iterating through them, the CPU prefetcher can efficiently load cache lines. With Box-allocated objects scattered across the heap, each access might be a cache miss.
“Explain the alignment calculation formula.”

Answer: (addr + align - 1) & !(align - 1) works because: adding align - 1 ensures we “overflow” into the next aligned block, then the bitwise AND with !(align - 1) masks off the lower bits to round down to the alignment boundary. This only works when align is a power of 2.

Extensions and Challenges

Extension 1: Typed Arena (Arena)

Create an arena that only allocates objects of one type:

pub struct TypedArena<T> {
    chunks: RefCell<Vec<Box<[MaybeUninit<T>]>>>,
    ptr: Cell<*mut T>,
    end: Cell<*mut T>,
}

impl<T> TypedArena<T> {
    pub fn alloc(&self, value: T) -> &mut T {
        if self.ptr == self.end {
            self.grow();
        }
        // ... bump allocation
    }

    fn grow(&self) {
        // Allocate a new chunk, update ptr and end
    }
}

impl<T> Drop for TypedArena<T> {
    fn drop(&mut self) {
        // Call Drop on all allocated objects!
        for chunk in self.chunks.borrow_mut().drain(..) {
            for item in chunk.iter_mut() {
                unsafe { std::ptr::drop_in_place(item.as_mut_ptr()) };
            }
        }
    }
}

Extension 2: Growing Arena (Linked Chunks)

Instead of panicking when full, allocate a new chunk:

pub struct GrowingArena {
    chunks: RefCell<Vec<Box<[u8]>>>,
    current: Cell<*mut u8>,
    end: Cell<*mut u8>,
    chunk_size: usize,
}

impl GrowingArena {
    fn grow(&self) {
        let new_chunk = vec![0u8; self.chunk_size].into_boxed_slice();
        let ptr = new_chunk.as_ptr() as *mut u8;
        let end = unsafe { ptr.add(self.chunk_size) };

        self.chunks.borrow_mut().push(new_chunk);
        self.current.set(ptr);
        self.end.set(end);
    }
}

Extension 3: Thread-Local Arena

Create a thread-local arena for lock-free allocation:

use std::cell::RefCell;

thread_local! {
    static ARENA: RefCell<Arena> = RefCell::new(Arena::new(1024 * 1024));
}

pub fn alloc<T>(value: T) -> &'static mut T {
    ARENA.with(|arena| {
        // SAFETY: This is actually unsound without careful lifetime management!
        // Real implementation needs scoped allocation or guard types.
        unsafe { std::mem::transmute(arena.borrow().alloc(value)) }
    })
}

Real-World Connections

The bumpalo Crate

bumpalo is the most popular arena allocator in Rust:

use bumpalo::Bump;

let bump = Bump::new();

// Allocate values
let x = bump.alloc(1);
let y = bump.alloc([1, 2, 3]);

// Allocate strings
let s = bump.alloc_str("Hello");

// Use with collections
let mut vec = bumpalo::collections::Vec::new_in(&bump);
vec.push(1);
vec.push(2);

The Rust Compiler (rustc)

rustc uses arenas extensively for AST nodes, types, and other compiler data structures:

// From rustc_arena (simplified)
pub struct DroplessArena {
    // Arena for types that don't need Drop
    chunks: RefCell<Vec<Box<[MaybeUninit<u8>]>>>,
    ptr: Cell<*mut u8>,
    end: Cell<*mut u8>,
}

pub struct TypedArena<T> {
    // Arena for types that DO need Drop
    chunks: RefCell<Vec<Box<[MaybeUninit<T>]>>>,
    ptr: Cell<*mut T>,
    end: Cell<*mut T>,
}

ECS Game Engines (Bevy, Legion)

Game engines use arena allocation for components:

// Conceptual example from ECS design
struct ComponentArena<T> {
    data: Arena<T>,
    entity_map: HashMap<Entity, *mut T>,
}

impl<T> ComponentArena<T> {
    fn insert(&mut self, entity: Entity, component: T) {
        let ptr = self.data.alloc(component);
        self.entity_map.insert(entity, ptr);
    }

    fn get(&self, entity: Entity) -> Option<&T> {
        self.entity_map.get(&entity).map(|&ptr| unsafe { &*ptr })
    }
}

Hints in Layers

Hint 1: The Base Pointer Use std::alloc::alloc to get your initial chunk of memory. Store it as a *mut u8.

Hint 2: Alignment Use pointer::align_offset to find how many bytes you need to skip to satisfy the alignment of T. Or implement the (addr + align - 1) & !(align - 1) formula.

Hint 3: Safety Wrap everything in a safe method pub fn alloc<T>(&self, value: T) -> &mut T. Be very careful with the lifetimes - the returned reference must not outlive &self.

Hint 4: Cell for Interior Mutability You need Cell<*mut u8> for the bump pointer because alloc takes &self, not &mut self. This allows multiple allocations without exclusive access.

Hint 5: Testing Alignment Use std::ptr::addr_of!() or cast to usize to verify addresses are properly aligned: addr % align == 0.

Books That Will Help

Topic	Book	Chapter
Custom Allocators	“The Linux Programming Interface” by Kerrisk	Ch. 7: Memory Allocation
Alignment & Structs	“Computer Systems: A Programmer’s Perspective”	Ch. 3.9: Heterogeneous Data Structures
Unsafe Memory	“Programming Rust” by Blandy & Orendorff	Ch. 21: Unsafe Code
Arena Patterns	“Rust for Rustaceans” by Gjengset	Ch. 3: Designing Interfaces
Allocator Internals	“The Rustonomicon” (online)	Memory Allocation

Success Criteria

You have completed this project when:

Your arena can allocate objects of any type with correct alignment
Your arena is 10x+ faster than Box::new() in benchmarks
All tests pass, including alignment verification
You can explain the alignment formula without looking at notes
You understand why Drop is not called on arena objects
You can draw the memory layout after a sequence of allocations
You could implement a growing arena as an extension

Next Steps

After completing this project, you’ll be ready for:

Project 6: Atomic Lock-Free Queue - Use your arena knowledge to build cache-aligned concurrent data structures
Project 7: Zero-Copy Protocol Parser - Combine arena allocation with zero-copy parsing
Project 12: High-Performance KV Store - Use arenas for the index layer of a database engine

This project is part of the Advanced Rust Ecosystem Deep Dive learning path.