Project 5: NUMA-Aware Memory Allocator

Build a per-node memory allocator that guarantees allocations land on a chosen NUMA node and tracks locality statistics.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 2-3 weeks
Main Programming Language C (Alternatives: C++, Rust)
Alternative Programming Languages C++, Rust
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 3. The “Service & Support” Model
Prerequisites C memory management, threading basics, NUMA policies
Key Topics libnuma APIs, mbind/set_mempolicy, arenas, fragmentation

1. Learning Objectives

By completing this project, you will:

  1. Explain Linux NUMA memory policies and how they affect allocation.
  2. Implement per-node arenas with explicit placement guarantees.
  3. Provide a numa_malloc(node, size) API with predictable behavior.
  4. Add thread-local caches to reduce contention.
  5. Measure local vs remote allocation ratios.
  6. Handle node exhaustion with fallback policies.
  7. Produce deterministic allocator tests using fixed workloads.

2. All Theory Needed (Per-Concept Breakdown)

2.1 NUMA Memory Policies and Page Placement APIs

Fundamentals

Linux provides NUMA policies that control where memory pages are allocated: bind (allocate only on specified nodes), preferred (try a node but allow fallback), and interleave (round-robin across nodes). APIs like mbind, set_mempolicy, and libnuma’s numa_alloc_onnode let applications control placement. Without explicit placement, first-touch policy determines node allocation based on the thread that first accesses a page. A NUMA-aware allocator must understand these policies and expose a reliable interface for choosing placement while handling errors such as node exhaustion.

Deep Dive into the Concept

NUMA policy controls are applied at the virtual memory (VM) level. When you allocate memory (via malloc, mmap, or numa_alloc_onnode), the OS reserves virtual address space but does not necessarily map physical pages immediately. Physical pages are mapped on first touch. That means the memory policy in effect at the time of the first access determines where the page is placed. This can lead to surprising behavior if you set a policy during allocation but touch the memory from a different thread or on a different node.

The mbind system call allows you to set a policy on a specific virtual memory range, such as a pool managed by your allocator. The set_mempolicy call sets a policy for the calling thread. libnuma provides convenience wrappers like numa_alloc_onnode and numa_alloc_interleaved, which allocate and touch memory to enforce placement. The allocator you build should use these APIs to ensure determinism: when the user requests memory on node X, the pages should be placed on node X and verified. If the node lacks free pages, the allocator must either fail or fall back to another node based on a configurable policy.

Another subtlety is huge pages. Large allocations can use huge pages to reduce TLB pressure, but huge page pools may be unevenly distributed across nodes. This can skew placement unless your allocator accounts for it. Similarly, transparent huge pages can create unexpected placement behavior because the kernel may merge pages. For this project, you can choose to ignore huge pages or explicitly disable them in tests for determinism.

Policy-aware allocation also interacts with cpusets. If the process is restricted to a subset of nodes, requests for other nodes may fail or be silently redirected. A robust allocator should detect the effective node set and surface warnings. In practice, high-performance allocators (jemalloc, tcmalloc) implement arenas and thread-local caches. Your allocator will use a per-node arena and may offer a per-thread cache that draws from the local arena, ensuring locality and reducing lock contention.

How this fits on projects

This concept is the foundation of your allocator’s correctness. Every placement guarantee depends on understanding and applying NUMA policies and first-touch semantics.

Definitions & Key Terms

  • Bind policy -> Allocate only on specified nodes.
  • Preferred policy -> Prefer a node but allow fallback.
  • Interleave policy -> Round-robin allocation across nodes.
  • First-touch -> Pages allocated on the node of the first access.
  • mbind/set_mempolicy -> Syscalls that set memory policy on ranges or threads.

Mental Model Diagram (ASCII)

Virtual Address Range (allocator arena)
|--------------------------------------|
Policy: bind -> node 0 only
First touch by thread on node 0 => pages on node 0
First touch by thread on node 1 => remote pages unless policy enforced

How It Works (Step-by-Step)

  1. Reserve a virtual memory region for each node’s arena.
  2. Apply mbind to the region with a bind policy.
  3. Touch pages from a thread pinned to that node to force placement.
  4. Track allocation metadata per arena.
  5. If allocation fails, apply fallback policy or return error.

Invariants: Policy must be applied before first touch; threads must be pinned for deterministic placement.

Failure modes: Node exhaustion, cpuset restrictions, huge page interference.

Minimal Concrete Example

void *p = numa_alloc_onnode(size, node);
if (!p) return NULL;
memset(p, 0, size); // touch pages to enforce placement

Common Misconceptions

  • “Allocation equals placement” -> placement happens on first touch.
  • “Policies are global” -> they can be per-thread or per-range.
  • “Fallback never happens” -> preferred policy can allocate elsewhere.

Check-Your-Understanding Questions

  1. Why might a page be placed on the wrong node even if you used numa_alloc_onnode?
  2. What is the difference between bind and preferred policies?
  3. How do cpusets affect memory policy?

Check-Your-Understanding Answers

  1. If the page is touched by a thread on a different node or if policy is not enforced.
  2. Bind forces allocation on specific nodes; preferred allows fallback.
  3. They restrict which nodes the process can use, overriding policies.

Real-World Applications

  • NUMA-aware allocators in databases and in-memory caches.
  • High-frequency trading systems where locality is critical.

Where You’ll Apply It

References

  • “Operating Systems: Three Easy Pieces” – Ch. 13-21
  • “C Interfaces and Implementations” (Hanson) – Ch. 5-6
  • “The Linux Programming Interface” – Ch. 6

Key Insights

Placement is not a side effect; it is a policy that must be enforced and verified.

Summary

NUMA policies control where pages are placed, but first-touch semantics can override expectations. A NUMA-aware allocator must apply policies before touching memory, pin threads, and handle node exhaustion explicitly.

Homework/Exercises to Practice the Concept

  1. Allocate memory with bind policy and verify placement with numastat.
  2. Allocate with preferred policy and observe fallback behavior.
  3. Compare placement when touching memory from different nodes.

Solutions to the Homework/Exercises

  1. Pages should be mostly local to the bound node.
  2. Fallback will allocate on other nodes if the preferred node is full.
  3. First-touch determines which node actually receives the pages.

2.2 Allocator Architecture: Arenas, Fragmentation, and Thread Caches

Fundamentals

High-performance allocators use arenas: independent pools of memory that reduce contention. In a NUMA-aware allocator, each node has its own arena, keeping most allocations local. Within an arena, free lists track blocks of different sizes. Thread-local caches can serve small allocations without locks, further reducing overhead. Fragmentation is the cost of unused memory due to allocation patterns; a good allocator balances locality, fragmentation, and concurrency.

Deep Dive into the Concept

An allocator must satisfy requests of varying sizes quickly while minimizing fragmentation and contention. The typical design uses size classes: small allocations are rounded up to fixed sizes, and each size class has its own free list. This simplifies allocation and deallocation but can waste memory (internal fragmentation). Large allocations are handled separately, often by direct mmap or by maintaining large blocks in the arena.

Arenas provide concurrency and locality. With a global allocator, all threads contend on the same locks and data structures. With per-node arenas, threads pinned to a node allocate from the local arena, reducing both contention and remote memory traffic. However, this introduces new challenges: if one node’s arena is exhausted while another is underutilized, you need a fallback strategy. Options include allowing remote allocations, stealing from another arena, or returning an error. Each choice has a trade-off between locality and availability.

Thread-local caches are a further optimization. Each thread keeps a small cache of recently freed blocks. On allocation, it checks the cache first; on free, it returns blocks to the cache. This reduces lock contention but increases memory footprint. In NUMA systems, caches should preferably store blocks from the local arena. If a thread migrates across nodes, it may hold blocks from the wrong node. The allocator can either flush caches on migration or treat migration as an error condition, which is why thread pinning is so important.

Fragmentation is unavoidable, but you can manage it by using segregated free lists, coalescing adjacent free blocks, and periodically scavenging unused blocks. For this project, focus on deterministic behavior and clear locality metrics rather than maximal optimization. A simple per-node arena with size-class free lists and optional thread caches is sufficient to demonstrate the core concepts.

How this fits on projects

This concept shapes the allocator’s internal design: per-node arenas, size classes, and optional thread caches. It directly affects performance and locality ratios.

Definitions & Key Terms

  • Arena -> Independent memory pool with its own free lists.
  • Size class -> Fixed-size allocation bucket.
  • Fragmentation -> Wasted memory due to allocation patterns.
  • Thread-local cache -> Per-thread pool of recently freed blocks.
  • Coalescing -> Merging adjacent free blocks.

Mental Model Diagram (ASCII)

Node 0 Arena                 Node 1 Arena
[64B][128B][256B] ...         [64B][128B][256B] ...
  ^   ^   ^                      ^   ^   ^
Thread0 cache                 Thread1 cache

How It Works (Step-by-Step)

  1. Initialize per-node arenas with size-class free lists.
  2. On allocation, choose arena based on target node.
  3. Check thread-local cache, then arena free list.
  4. If empty, grow arena using mmap + policy bind.
  5. On free, return to thread cache or arena list.

Invariants: Blocks in an arena are bound to that node.

Failure modes: Thread migration, arena imbalance, fragmentation.

Minimal Concrete Example

typedef struct {
    void *free_lists[NUM_SIZE_CLASSES];
    pthread_mutex_t lock;
} Arena;

Arena arenas[MAX_NODES];

void *numa_malloc(int node, size_t size) {
    Arena *a = &arenas[node];
    // size class lookup and allocation logic
    return allocate_from_arena(a, size);
}

Common Misconceptions

  • “Per-node arenas eliminate all contention” -> thread-local caches are still useful.
  • “Fragmentation is negligible” -> it can dominate memory usage over time.
  • “Migration is harmless” -> it can break locality guarantees.

Check-Your-Understanding Questions

  1. Why do arenas reduce contention?
  2. How can thread-local caches increase memory usage?
  3. What happens if a node’s arena runs out of memory?

Check-Your-Understanding Answers

  1. They reduce shared locks by isolating allocation paths.
  2. Each thread holds blocks that are not visible to others.
  3. The allocator must fallback, steal, or fail the allocation.

Real-World Applications

  • High-performance allocators like jemalloc and tcmalloc.
  • NUMA-aware database and cache engines.

Where You’ll Apply It

References

  • “C Interfaces and Implementations” – Ch. 5-6
  • “Computer Systems: A Programmer’s Perspective” – Ch. 9

Key Insights

Arenas and thread caches trade memory overhead for locality and scalability.

Summary

Allocator design balances concurrency, locality, and fragmentation. Per-node arenas keep memory local, thread caches reduce contention, and size classes manage fragmentation. These mechanisms are the core of a NUMA-aware allocator.

Homework/Exercises to Practice the Concept

  1. Implement size classes for 16, 32, 64, 128 bytes.
  2. Simulate allocations and measure fragmentation.
  3. Add a thread-local cache and compare lock contention.

Solutions to the Homework/Exercises

  1. Use an array of free lists indexed by size class.
  2. Fragmentation increases when allocations are irregular; measure wasted bytes.
  3. Lock contention decreases as more allocations are served from caches.

3. Project Specification

3.1 What You Will Build

A NUMA-aware allocator library providing APIs like numa_malloc(node, size) and numa_free(ptr). It uses per-node arenas, optional thread-local caches, and reports locality statistics.

Included: per-node arenas, placement via mbind/libnuma, stats reporting. Excluded: full drop-in replacement for malloc or advanced garbage collection.

3.2 Functional Requirements

  1. Allocate memory on a specified NUMA node.
  2. Maintain per-node arenas with size classes.
  3. Provide thread-local caches for small allocations.
  4. Track local vs remote allocation stats per thread and node.
  5. Support fallback policy when node memory is exhausted.
  6. Provide a deterministic test harness with fixed allocation patterns.

3.3 Non-Functional Requirements

  • Performance: allocation latency < 2x malloc for small sizes.
  • Reliability: correct placement verified with numastat.
  • Usability: clear API and statistics output.

3.4 Example Usage / Output

void *p = numa_malloc(0, 4096);
// use memory
numa_free(p);

3.5 Data Formats / Schemas / Protocols

Stats output (JSON):

{
  "node0": {"allocated_mb": 3200, "local_pct": 98.1},
  "node1": {"allocated_mb": 3100, "local_pct": 97.6},
  "remote_pct": 2.1
}

3.6 Edge Cases

  • Allocation on node not visible to process.
  • Node out of memory.
  • Thread migration across nodes.
  • Fragmentation leads to large free memory but no suitable block.

3.7 Real World Outcome

You can allocate memory on specific nodes with predictable locality and report statistics proving placement success.

3.7.1 How to Run (Copy/Paste)

cc -O2 -pthread -lnuma -o numa_alloc_demo src/demo.c
./numa_alloc_demo --threads=8 --policy=bind

3.7.2 Golden Path Demo (Deterministic)

$ ./numa_alloc_demo --threads=8 --policy=bind --seed=42
Node 0: 3.2 GB allocated, 98.1% local
Node 1: 3.1 GB allocated, 97.6% local
Remote allocations: 2.1%

3.7.3 If CLI: Exact Terminal Transcript

$ ./numa_alloc_demo --threads=8 --policy=interleave
Node 0: 3.1 GB allocated, 50.2% local
Node 1: 3.2 GB allocated, 49.8% local

3.7.4 Failure Demo (Deterministic)

$ ./numa_alloc_demo --node=9
ERROR: node 9 not available
EXIT: 1

Exit Codes:

  • 0 success
  • 1 invalid node or arguments
  • 2 allocation failure
  • 3 placement verification failure

4. Solution Architecture

4.1 High-Level Design

+--------------+   +--------------+   +------------------+
| API Layer    |-->| Per-Node Arena|-->| Stats & Reporter |
+--------------+   +--------------+   +------------------+
        ^                 ^
        |                 |
+--------------+     +--------------+
| Thread Cache |---->| Policy Engine|
+--------------+     +--------------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |—|—|—| | API Layer | Expose numa_malloc/numa_free | explicit node parameter | | Policy Engine | Apply mbind/set_mempolicy | bind vs preferred | | Arena Manager | Maintain per-node free lists | size classes | | Thread Cache | Reduce contention | cache size limits | | Stats | Track local/remote ratio | per-thread counters |

4.3 Data Structures (No Full Code)

typedef struct {
    void *free_lists[NUM_CLASSES];
    pthread_mutex_t lock;
} Arena;

typedef struct {
    uint64_t local_allocs;
    uint64_t remote_allocs;
} AllocStats;

4.4 Algorithm Overview

Key Algorithm: Allocation

  1. Determine size class.
  2. Choose arena based on node.
  3. Try thread-local cache; if empty, lock arena and pop block.
  4. If arena empty, grow using mmap + mbind.
  5. Update statistics and return pointer.

Complexity Analysis:

  • Time: O(1) average
  • Space: O(total allocations)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential libnuma-dev

5.2 Project Structure

numa-allocator/
|-- src/
|   |-- allocator.c
|   |-- arena.c
|   |-- policy.c
|   |-- stats.c
|   +-- demo.c
|-- include/
|   +-- numa_alloc.h
|-- tests/
|   +-- deterministic_alloc.c
|-- Makefile
+-- README.md

5.3 The Core Question You’re Answering

“How do I guarantee that memory allocations land on the node I want?”

5.4 Concepts You Must Understand First

  1. NUMA policies and first-touch.
  2. Arena-based allocator design.
  3. Thread pinning and locality metrics.

5.5 Questions to Guide Your Design

  1. How will you handle node exhaustion?
  2. Will you allow cross-node allocations or fail fast?
  3. How will you measure and report locality?

5.6 Thinking Exercise

Design a fallback policy that preserves locality for small allocations but allows large allocations to spill over. What trade-offs does this create?

5.7 The Interview Questions They’ll Ask

  1. “How does first-touch differ from explicit allocation on a node?”
  2. “Why do allocators use per-thread or per-node arenas?”
  3. “How would you measure allocator locality?”

5.8 Hints in Layers

Hint 1: Start with a wrapper around numa_alloc_onnode.

Hint 2: Add per-node free lists.

Hint 3: Add thread caches for small sizes.

Hint 4: Track local/remote allocations with counters.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Allocator design | “C Interfaces and Implementations” | Ch. 5-6 | | Memory policies | “Operating Systems: Three Easy Pieces” | Ch. 13-21 | | NUMA basics | “Computer Architecture” | Ch. 2 |

5.10 Implementation Phases

Phase 1: Basic Allocation (4-5 days)

Goals: implement basic per-node allocation using libnuma.

Tasks:

  1. Add numa_malloc and numa_free wrappers.
  2. Verify placement with numastat.

Checkpoint: allocations are local to chosen nodes.

Phase 2: Arenas + Free Lists (5-6 days)

Goals: add arenas and size classes.

Tasks:

  1. Build size-class free lists.
  2. Add per-node arenas and locks.

Checkpoint: allocator handles many small allocations.

Phase 3: Thread Caches + Stats (4-5 days)

Goals: reduce contention and report locality stats.

Tasks:

  1. Add thread-local caches.
  2. Add local/remote allocation metrics.

Checkpoint: stats show >95% local allocations.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Fallback policy | fail vs remote | preferred + fallback | avoid crashes, preserve locality | | Thread caches | on vs off | on | reduces lock contention | | Size classes | fixed vs dynamic | fixed | simpler and deterministic |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |—|—|—| | Unit Tests | Validate size class selection | allocation sizes map correctly | | Integration Tests | Verify placement | numastat shows local pages | | Edge Case Tests | Node exhaustion | fallback behavior |

6.2 Critical Test Cases

  1. Local allocation ratio >95% under node-local workload.
  2. Node exhaustion triggers fallback or error as configured.
  3. Thread migration detected and reported.

6.3 Test Data

seed=42, allocations=1,000,000, size=64 bytes
expected_local_pct >= 95%

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |—|—|—| | First-touch ignored | Remote pages | Touch pages on target node | | Fragmentation | High memory usage | Use size classes and coalescing | | Thread migration | Locality drops | Pin threads |

7.2 Debugging Strategies

  • Use numastat -p <pid> to inspect node allocation.
  • Log allocations with node ID for verification.

7.3 Performance Traps

  • Overly large thread caches can waste memory.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add numa_realloc support.
  • Add a CLI demo that stresses the allocator.

8.2 Intermediate Extensions

  • Implement huge-page allocations.
  • Add per-thread locality reports.

8.3 Advanced Extensions

  • Integrate allocator with a real workload (e.g., hash map).
  • Add adaptive policies based on observed access patterns.

9. Real-World Connections

9.1 Industry Applications

  • Database buffer pools and in-memory caches.
  • High-frequency trading engines.
  • jemalloc – arenas and per-thread caches.
  • tcmalloc – scalable allocator design.

9.3 Interview Relevance

  • Explaining how allocators enforce NUMA locality.

10. Resources

10.1 Essential Reading

  • “C Interfaces and Implementations” – Ch. 5-6
  • “Operating Systems: Three Easy Pieces” – Ch. 13-21

10.2 Video Resources

  • Allocator design talks.

10.3 Tools & Documentation

  • libnuma – NUMA allocation APIs.
  • numastat – placement verification.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain bind vs preferred vs interleave policies.
  • I can explain why per-node arenas reduce contention.

11.2 Implementation

  • Allocations are local to chosen nodes.
  • Stats reporting is accurate.

11.3 Growth

  • I can explain allocator trade-offs in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • numa_malloc and numa_free work with node binding.
  • Placement verified with numastat.

Full Completion:

  • Arenas, size classes, and thread caches implemented.
  • Local allocation ratio >95%.

Excellence (Going Above & Beyond):

  • Adaptive policies and real workload integration.