Project 5: NUMA-Aware Memory Allocator
Build a per-node memory allocator that guarantees allocations land on a chosen NUMA node and tracks locality statistics.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 2-3 weeks |
| Main Programming Language | C (Alternatives: C++, Rust) |
| Alternative Programming Languages | C++, Rust |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 3. The “Service & Support” Model |
| Prerequisites | C memory management, threading basics, NUMA policies |
| Key Topics | libnuma APIs, mbind/set_mempolicy, arenas, fragmentation |
1. Learning Objectives
By completing this project, you will:
- Explain Linux NUMA memory policies and how they affect allocation.
- Implement per-node arenas with explicit placement guarantees.
- Provide a
numa_malloc(node, size)API with predictable behavior. - Add thread-local caches to reduce contention.
- Measure local vs remote allocation ratios.
- Handle node exhaustion with fallback policies.
- Produce deterministic allocator tests using fixed workloads.
2. All Theory Needed (Per-Concept Breakdown)
2.1 NUMA Memory Policies and Page Placement APIs
Fundamentals
Linux provides NUMA policies that control where memory pages are allocated: bind (allocate only on specified nodes), preferred (try a node but allow fallback), and interleave (round-robin across nodes). APIs like mbind, set_mempolicy, and libnuma’s numa_alloc_onnode let applications control placement. Without explicit placement, first-touch policy determines node allocation based on the thread that first accesses a page. A NUMA-aware allocator must understand these policies and expose a reliable interface for choosing placement while handling errors such as node exhaustion.
Deep Dive into the Concept
NUMA policy controls are applied at the virtual memory (VM) level. When you allocate memory (via malloc, mmap, or numa_alloc_onnode), the OS reserves virtual address space but does not necessarily map physical pages immediately. Physical pages are mapped on first touch. That means the memory policy in effect at the time of the first access determines where the page is placed. This can lead to surprising behavior if you set a policy during allocation but touch the memory from a different thread or on a different node.
The mbind system call allows you to set a policy on a specific virtual memory range, such as a pool managed by your allocator. The set_mempolicy call sets a policy for the calling thread. libnuma provides convenience wrappers like numa_alloc_onnode and numa_alloc_interleaved, which allocate and touch memory to enforce placement. The allocator you build should use these APIs to ensure determinism: when the user requests memory on node X, the pages should be placed on node X and verified. If the node lacks free pages, the allocator must either fail or fall back to another node based on a configurable policy.
Another subtlety is huge pages. Large allocations can use huge pages to reduce TLB pressure, but huge page pools may be unevenly distributed across nodes. This can skew placement unless your allocator accounts for it. Similarly, transparent huge pages can create unexpected placement behavior because the kernel may merge pages. For this project, you can choose to ignore huge pages or explicitly disable them in tests for determinism.
Policy-aware allocation also interacts with cpusets. If the process is restricted to a subset of nodes, requests for other nodes may fail or be silently redirected. A robust allocator should detect the effective node set and surface warnings. In practice, high-performance allocators (jemalloc, tcmalloc) implement arenas and thread-local caches. Your allocator will use a per-node arena and may offer a per-thread cache that draws from the local arena, ensuring locality and reducing lock contention.
How this fits on projects
This concept is the foundation of your allocator’s correctness. Every placement guarantee depends on understanding and applying NUMA policies and first-touch semantics.
Definitions & Key Terms
- Bind policy -> Allocate only on specified nodes.
- Preferred policy -> Prefer a node but allow fallback.
- Interleave policy -> Round-robin allocation across nodes.
- First-touch -> Pages allocated on the node of the first access.
- mbind/set_mempolicy -> Syscalls that set memory policy on ranges or threads.
Mental Model Diagram (ASCII)
Virtual Address Range (allocator arena)
|--------------------------------------|
Policy: bind -> node 0 only
First touch by thread on node 0 => pages on node 0
First touch by thread on node 1 => remote pages unless policy enforced
How It Works (Step-by-Step)
- Reserve a virtual memory region for each node’s arena.
- Apply
mbindto the region with a bind policy. - Touch pages from a thread pinned to that node to force placement.
- Track allocation metadata per arena.
- If allocation fails, apply fallback policy or return error.
Invariants: Policy must be applied before first touch; threads must be pinned for deterministic placement.
Failure modes: Node exhaustion, cpuset restrictions, huge page interference.
Minimal Concrete Example
void *p = numa_alloc_onnode(size, node);
if (!p) return NULL;
memset(p, 0, size); // touch pages to enforce placement
Common Misconceptions
- “Allocation equals placement” -> placement happens on first touch.
- “Policies are global” -> they can be per-thread or per-range.
- “Fallback never happens” -> preferred policy can allocate elsewhere.
Check-Your-Understanding Questions
- Why might a page be placed on the wrong node even if you used
numa_alloc_onnode? - What is the difference between bind and preferred policies?
- How do cpusets affect memory policy?
Check-Your-Understanding Answers
- If the page is touched by a thread on a different node or if policy is not enforced.
- Bind forces allocation on specific nodes; preferred allows fallback.
- They restrict which nodes the process can use, overriding policies.
Real-World Applications
- NUMA-aware allocators in databases and in-memory caches.
- High-frequency trading systems where locality is critical.
Where You’ll Apply It
- In this project: Sec. 3.2, Sec. 4.2, Sec. 5.10.
- Also used in: P07-numa-memory-migration-tool, P10-numa-aware-database-buffer-pool.
References
- “Operating Systems: Three Easy Pieces” – Ch. 13-21
- “C Interfaces and Implementations” (Hanson) – Ch. 5-6
- “The Linux Programming Interface” – Ch. 6
Key Insights
Placement is not a side effect; it is a policy that must be enforced and verified.
Summary
NUMA policies control where pages are placed, but first-touch semantics can override expectations. A NUMA-aware allocator must apply policies before touching memory, pin threads, and handle node exhaustion explicitly.
Homework/Exercises to Practice the Concept
- Allocate memory with bind policy and verify placement with
numastat. - Allocate with preferred policy and observe fallback behavior.
- Compare placement when touching memory from different nodes.
Solutions to the Homework/Exercises
- Pages should be mostly local to the bound node.
- Fallback will allocate on other nodes if the preferred node is full.
- First-touch determines which node actually receives the pages.
2.2 Allocator Architecture: Arenas, Fragmentation, and Thread Caches
Fundamentals
High-performance allocators use arenas: independent pools of memory that reduce contention. In a NUMA-aware allocator, each node has its own arena, keeping most allocations local. Within an arena, free lists track blocks of different sizes. Thread-local caches can serve small allocations without locks, further reducing overhead. Fragmentation is the cost of unused memory due to allocation patterns; a good allocator balances locality, fragmentation, and concurrency.
Deep Dive into the Concept
An allocator must satisfy requests of varying sizes quickly while minimizing fragmentation and contention. The typical design uses size classes: small allocations are rounded up to fixed sizes, and each size class has its own free list. This simplifies allocation and deallocation but can waste memory (internal fragmentation). Large allocations are handled separately, often by direct mmap or by maintaining large blocks in the arena.
Arenas provide concurrency and locality. With a global allocator, all threads contend on the same locks and data structures. With per-node arenas, threads pinned to a node allocate from the local arena, reducing both contention and remote memory traffic. However, this introduces new challenges: if one node’s arena is exhausted while another is underutilized, you need a fallback strategy. Options include allowing remote allocations, stealing from another arena, or returning an error. Each choice has a trade-off between locality and availability.
Thread-local caches are a further optimization. Each thread keeps a small cache of recently freed blocks. On allocation, it checks the cache first; on free, it returns blocks to the cache. This reduces lock contention but increases memory footprint. In NUMA systems, caches should preferably store blocks from the local arena. If a thread migrates across nodes, it may hold blocks from the wrong node. The allocator can either flush caches on migration or treat migration as an error condition, which is why thread pinning is so important.
Fragmentation is unavoidable, but you can manage it by using segregated free lists, coalescing adjacent free blocks, and periodically scavenging unused blocks. For this project, focus on deterministic behavior and clear locality metrics rather than maximal optimization. A simple per-node arena with size-class free lists and optional thread caches is sufficient to demonstrate the core concepts.
How this fits on projects
This concept shapes the allocator’s internal design: per-node arenas, size classes, and optional thread caches. It directly affects performance and locality ratios.
Definitions & Key Terms
- Arena -> Independent memory pool with its own free lists.
- Size class -> Fixed-size allocation bucket.
- Fragmentation -> Wasted memory due to allocation patterns.
- Thread-local cache -> Per-thread pool of recently freed blocks.
- Coalescing -> Merging adjacent free blocks.
Mental Model Diagram (ASCII)
Node 0 Arena Node 1 Arena
[64B][128B][256B] ... [64B][128B][256B] ...
^ ^ ^ ^ ^ ^
Thread0 cache Thread1 cache
How It Works (Step-by-Step)
- Initialize per-node arenas with size-class free lists.
- On allocation, choose arena based on target node.
- Check thread-local cache, then arena free list.
- If empty, grow arena using
mmap+ policy bind. - On free, return to thread cache or arena list.
Invariants: Blocks in an arena are bound to that node.
Failure modes: Thread migration, arena imbalance, fragmentation.
Minimal Concrete Example
typedef struct {
void *free_lists[NUM_SIZE_CLASSES];
pthread_mutex_t lock;
} Arena;
Arena arenas[MAX_NODES];
void *numa_malloc(int node, size_t size) {
Arena *a = &arenas[node];
// size class lookup and allocation logic
return allocate_from_arena(a, size);
}
Common Misconceptions
- “Per-node arenas eliminate all contention” -> thread-local caches are still useful.
- “Fragmentation is negligible” -> it can dominate memory usage over time.
- “Migration is harmless” -> it can break locality guarantees.
Check-Your-Understanding Questions
- Why do arenas reduce contention?
- How can thread-local caches increase memory usage?
- What happens if a node’s arena runs out of memory?
Check-Your-Understanding Answers
- They reduce shared locks by isolating allocation paths.
- Each thread holds blocks that are not visible to others.
- The allocator must fallback, steal, or fail the allocation.
Real-World Applications
- High-performance allocators like jemalloc and tcmalloc.
- NUMA-aware database and cache engines.
Where You’ll Apply It
- In this project: Sec. 4.2, Sec. 4.3, Sec. 5.10.
- Also used in: P06-numa-aware-data-structure-library.
References
- “C Interfaces and Implementations” – Ch. 5-6
- “Computer Systems: A Programmer’s Perspective” – Ch. 9
Key Insights
Arenas and thread caches trade memory overhead for locality and scalability.
Summary
Allocator design balances concurrency, locality, and fragmentation. Per-node arenas keep memory local, thread caches reduce contention, and size classes manage fragmentation. These mechanisms are the core of a NUMA-aware allocator.
Homework/Exercises to Practice the Concept
- Implement size classes for 16, 32, 64, 128 bytes.
- Simulate allocations and measure fragmentation.
- Add a thread-local cache and compare lock contention.
Solutions to the Homework/Exercises
- Use an array of free lists indexed by size class.
- Fragmentation increases when allocations are irregular; measure wasted bytes.
- Lock contention decreases as more allocations are served from caches.
3. Project Specification
3.1 What You Will Build
A NUMA-aware allocator library providing APIs like numa_malloc(node, size) and numa_free(ptr). It uses per-node arenas, optional thread-local caches, and reports locality statistics.
Included: per-node arenas, placement via mbind/libnuma, stats reporting.
Excluded: full drop-in replacement for malloc or advanced garbage collection.
3.2 Functional Requirements
- Allocate memory on a specified NUMA node.
- Maintain per-node arenas with size classes.
- Provide thread-local caches for small allocations.
- Track local vs remote allocation stats per thread and node.
- Support fallback policy when node memory is exhausted.
- Provide a deterministic test harness with fixed allocation patterns.
3.3 Non-Functional Requirements
- Performance: allocation latency < 2x
mallocfor small sizes. - Reliability: correct placement verified with
numastat. - Usability: clear API and statistics output.
3.4 Example Usage / Output
void *p = numa_malloc(0, 4096);
// use memory
numa_free(p);
3.5 Data Formats / Schemas / Protocols
Stats output (JSON):
{
"node0": {"allocated_mb": 3200, "local_pct": 98.1},
"node1": {"allocated_mb": 3100, "local_pct": 97.6},
"remote_pct": 2.1
}
3.6 Edge Cases
- Allocation on node not visible to process.
- Node out of memory.
- Thread migration across nodes.
- Fragmentation leads to large free memory but no suitable block.
3.7 Real World Outcome
You can allocate memory on specific nodes with predictable locality and report statistics proving placement success.
3.7.1 How to Run (Copy/Paste)
cc -O2 -pthread -lnuma -o numa_alloc_demo src/demo.c
./numa_alloc_demo --threads=8 --policy=bind
3.7.2 Golden Path Demo (Deterministic)
$ ./numa_alloc_demo --threads=8 --policy=bind --seed=42
Node 0: 3.2 GB allocated, 98.1% local
Node 1: 3.1 GB allocated, 97.6% local
Remote allocations: 2.1%
3.7.3 If CLI: Exact Terminal Transcript
$ ./numa_alloc_demo --threads=8 --policy=interleave
Node 0: 3.1 GB allocated, 50.2% local
Node 1: 3.2 GB allocated, 49.8% local
3.7.4 Failure Demo (Deterministic)
$ ./numa_alloc_demo --node=9
ERROR: node 9 not available
EXIT: 1
Exit Codes:
0success1invalid node or arguments2allocation failure3placement verification failure
4. Solution Architecture
4.1 High-Level Design
+--------------+ +--------------+ +------------------+
| API Layer |-->| Per-Node Arena|-->| Stats & Reporter |
+--------------+ +--------------+ +------------------+
^ ^
| |
+--------------+ +--------------+
| Thread Cache |---->| Policy Engine|
+--------------+ +--------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|—|—|—|
| API Layer | Expose numa_malloc/numa_free | explicit node parameter |
| Policy Engine | Apply mbind/set_mempolicy | bind vs preferred |
| Arena Manager | Maintain per-node free lists | size classes |
| Thread Cache | Reduce contention | cache size limits |
| Stats | Track local/remote ratio | per-thread counters |
4.3 Data Structures (No Full Code)
typedef struct {
void *free_lists[NUM_CLASSES];
pthread_mutex_t lock;
} Arena;
typedef struct {
uint64_t local_allocs;
uint64_t remote_allocs;
} AllocStats;
4.4 Algorithm Overview
Key Algorithm: Allocation
- Determine size class.
- Choose arena based on node.
- Try thread-local cache; if empty, lock arena and pop block.
- If arena empty, grow using
mmap+ mbind. - Update statistics and return pointer.
Complexity Analysis:
- Time: O(1) average
- Space: O(total allocations)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install -y build-essential libnuma-dev
5.2 Project Structure
numa-allocator/
|-- src/
| |-- allocator.c
| |-- arena.c
| |-- policy.c
| |-- stats.c
| +-- demo.c
|-- include/
| +-- numa_alloc.h
|-- tests/
| +-- deterministic_alloc.c
|-- Makefile
+-- README.md
5.3 The Core Question You’re Answering
“How do I guarantee that memory allocations land on the node I want?”
5.4 Concepts You Must Understand First
- NUMA policies and first-touch.
- Arena-based allocator design.
- Thread pinning and locality metrics.
5.5 Questions to Guide Your Design
- How will you handle node exhaustion?
- Will you allow cross-node allocations or fail fast?
- How will you measure and report locality?
5.6 Thinking Exercise
Design a fallback policy that preserves locality for small allocations but allows large allocations to spill over. What trade-offs does this create?
5.7 The Interview Questions They’ll Ask
- “How does first-touch differ from explicit allocation on a node?”
- “Why do allocators use per-thread or per-node arenas?”
- “How would you measure allocator locality?”
5.8 Hints in Layers
Hint 1: Start with a wrapper around numa_alloc_onnode.
Hint 2: Add per-node free lists.
Hint 3: Add thread caches for small sizes.
Hint 4: Track local/remote allocations with counters.
5.9 Books That Will Help
| Topic | Book | Chapter | |—|—|—| | Allocator design | “C Interfaces and Implementations” | Ch. 5-6 | | Memory policies | “Operating Systems: Three Easy Pieces” | Ch. 13-21 | | NUMA basics | “Computer Architecture” | Ch. 2 |
5.10 Implementation Phases
Phase 1: Basic Allocation (4-5 days)
Goals: implement basic per-node allocation using libnuma.
Tasks:
- Add
numa_mallocandnuma_freewrappers. - Verify placement with
numastat.
Checkpoint: allocations are local to chosen nodes.
Phase 2: Arenas + Free Lists (5-6 days)
Goals: add arenas and size classes.
Tasks:
- Build size-class free lists.
- Add per-node arenas and locks.
Checkpoint: allocator handles many small allocations.
Phase 3: Thread Caches + Stats (4-5 days)
Goals: reduce contention and report locality stats.
Tasks:
- Add thread-local caches.
- Add local/remote allocation metrics.
Checkpoint: stats show >95% local allocations.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Fallback policy | fail vs remote | preferred + fallback | avoid crashes, preserve locality | | Thread caches | on vs off | on | reduces lock contention | | Size classes | fixed vs dynamic | fixed | simpler and deterministic |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |—|—|—| | Unit Tests | Validate size class selection | allocation sizes map correctly | | Integration Tests | Verify placement | numastat shows local pages | | Edge Case Tests | Node exhaustion | fallback behavior |
6.2 Critical Test Cases
- Local allocation ratio >95% under node-local workload.
- Node exhaustion triggers fallback or error as configured.
- Thread migration detected and reported.
6.3 Test Data
seed=42, allocations=1,000,000, size=64 bytes
expected_local_pct >= 95%
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |—|—|—| | First-touch ignored | Remote pages | Touch pages on target node | | Fragmentation | High memory usage | Use size classes and coalescing | | Thread migration | Locality drops | Pin threads |
7.2 Debugging Strategies
- Use
numastat -p <pid>to inspect node allocation. - Log allocations with node ID for verification.
7.3 Performance Traps
- Overly large thread caches can waste memory.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
numa_reallocsupport. - Add a CLI demo that stresses the allocator.
8.2 Intermediate Extensions
- Implement huge-page allocations.
- Add per-thread locality reports.
8.3 Advanced Extensions
- Integrate allocator with a real workload (e.g., hash map).
- Add adaptive policies based on observed access patterns.
9. Real-World Connections
9.1 Industry Applications
- Database buffer pools and in-memory caches.
- High-frequency trading engines.
9.2 Related Open Source Projects
- jemalloc – arenas and per-thread caches.
- tcmalloc – scalable allocator design.
9.3 Interview Relevance
- Explaining how allocators enforce NUMA locality.
10. Resources
10.1 Essential Reading
- “C Interfaces and Implementations” – Ch. 5-6
- “Operating Systems: Three Easy Pieces” – Ch. 13-21
10.2 Video Resources
- Allocator design talks.
10.3 Tools & Documentation
- libnuma – NUMA allocation APIs.
- numastat – placement verification.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain bind vs preferred vs interleave policies.
- I can explain why per-node arenas reduce contention.
11.2 Implementation
- Allocations are local to chosen nodes.
- Stats reporting is accurate.
11.3 Growth
- I can explain allocator trade-offs in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
numa_mallocandnuma_freework with node binding.- Placement verified with numastat.
Full Completion:
- Arenas, size classes, and thread caches implemented.
- Local allocation ratio >95%.
Excellence (Going Above & Beyond):
- Adaptive policies and real workload integration.