Project 5: Lock-Free Stack
Quick Reference
| Attribute | Details |
|---|---|
| Difficulty | Expert |
| Time Estimate | 2-4 weeks |
| Primary Language | C++ |
| Alternative Languages | C, Rust |
| Knowledge Area | Lock-Free Data Structures |
| Tools Required | C++17/20 compiler, ThreadSanitizer, AddressSanitizer |
| Primary Reference | “C++ Concurrency in Action, 2nd Edition” by Anthony Williams, Chapter 7 |
Learning Objectives
By completing this project, you will be able to:
- Implement a lock-free stack using atomic compare-and-swap (CAS) operations
- Explain the ABA problem and why naive CAS-based algorithms fail
- Implement hazard pointers for safe memory reclamation in lock-free structures
- Implement epoch-based reclamation as an alternative to hazard pointers
- Choose correct memory orderings (acquire/release/seq_cst) for each atomic operation
- Stress test concurrent code using sanitizers and multi-threaded workloads
- Reason about all possible interleavings of concurrent operations
Theoretical Foundation
The Core Question You’re Answering
“How do you build a concurrent data structure that multiple threads can modify simultaneously without ever blocking, and how do you safely free memory when any thread might be accessing it?”
This is the pinnacle of concurrent programming. After this project, you’ll understand:
- Why locks are sometimes the enemy of performance
- Why “just use CAS” doesn’t work without memory reclamation
- How to think about every possible interleaving of operations
- Why lock-free programming is considered “hard mode”
Most developers use lock-based containers their entire career. After this project, you’ll understand the frontier of concurrent programming that powers high-performance systems like databases, game engines, and trading platforms.
Core Concepts
Lock-Free Guarantees
Lock-free means system-wide progress is guaranteed. Even if some threads are suspended indefinitely, other threads will continue making progress.
Lock-Based (Blocking) Lock-Free (Non-Blocking)
| |
Thread 1 holds lock Thread 1 reading head
| |
Thread 1 suspended! Thread 1 suspended
| |
Threads 2, 3, 4: BLOCKED Threads 2, 3, 4: PROGRESS!
| |
System makes NO progress System continues working
Why this matters:
- No deadlocks possible (no locks to deadlock on)
- No priority inversion (no lock holder to wait for)
- Better latency under contention (no sleeping/waking)
- Threads can fail without corrupting structure
The trade-off: Lock-free code is dramatically harder to write correctly. Every line requires reasoning about all possible thread interleavings.
Compare-And-Swap (CAS)
CAS is the fundamental building block of lock-free algorithms:
// Atomically: if current value == expected, set to desired, return true
// else update expected to current value, return false
bool compare_exchange_weak(T& expected, T desired);
CAS is a “conditional write” - it only succeeds if the value hasn’t changed since we last read it.
Thread 1: Thread 2:
reads head = A
prepares new_node
reads head = A
prepares new_node2
CAS(A, new_node2) - SUCCESS!
CAS(A, new_node) - FAILS! head now = new_node2
expected updated to new_node2
retry...
CAS(new_node2, new_node) - SUCCESS!
Key insight: CAS lets threads “race” to modify shared data. Losers detect they lost and retry with updated information.
The ABA Problem
The ABA problem is the critical flaw in naive lock-free algorithms. It occurs when:
- Thread 1 reads value A
- Thread 1 is preempted
- Thread 2 changes A to B
- Thread 2 changes B back to A (or allocates new memory at A’s address)
- Thread 1 resumes, sees A, CAS succeeds
- But the structure has fundamentally changed!
Concrete stack example:
Initial State:
head -> [A, data1] -> [B, data2] -> [C, data3] -> null
Thread 1: Pop operation
1. reads head = A
2. reads A->next = B
3. will do CAS(head, A, B)
4. ** PREEMPTED **
Thread 2: (while Thread 1 sleeps)
5. pops A (frees A's memory)
6. pops B (frees B's memory)
7. pushes new node X (allocator reuses A's memory address!)
8. pushes Y
Stack now:
head -> [A_reused, dataX] -> [Y, dataY] -> [C, data3] -> null
Thread 1: Resumes
9. does CAS(head, A, B)
10. CAS SUCCEEDS! (head still "equals" A... the address)
11. head now points to B
DISASTER:
- B was freed! Use-after-free!
- Lost nodes X and Y entirely!
- Stack is corrupted
ASCII Diagram of ABA Problem:
Time Thread 1 Thread 2 Stack State
|
| read head=A head->[A]->[B]->[C]
| read A->next=B
| (preempted)
| pop A head->[B]->[C]
| free A
| pop B head->[C]
| free B
| push X (reuses A's addr) head->[A']->[C]
| push Y head->[Y]->[A']->[C]
| CAS(A, B) succeeds! head->[B] <- FREED!
| CRASH or corruption
v
Why CAS doesn’t catch this: CAS only checks if the pointer value (the address) is the same. It doesn’t know that the memory was freed and reused. The address A is the same, but it now points to completely different data!
Memory Reclamation Solutions
The core problem: When is it safe to free a node in a lock-free structure?
You can’t just delete it after a CAS succeeds - another thread might be holding a pointer to it!
Solution 1: Hazard Pointers
Concept: Each thread publishes which pointers it’s currently dereferencing. Before freeing a node, check if any thread has it as a hazard pointer.
Global Hazard Pointer Array (one slot per thread):
[Thread 0: nullptr ]
[Thread 1: 0x1234 ] <- Thread 1 is accessing node at 0x1234
[Thread 2: nullptr ]
[Thread 3: 0x5678 ] <- Thread 3 is accessing node at 0x5678
Before dereferencing:
1. Publish pointer to your hazard slot
2. Verify pointer is still valid (re-read from structure)
3. If not valid, clear slot and retry
Before freeing:
1. Add node to "retired list" instead of freeing
2. Periodically scan retired list
3. For each retired node, check all hazard slots
4. If no thread has it hazarded, safe to free
Hazard Pointer Protocol:
T* acquire_safe_pointer(std::atomic<T*>& source) {
T* ptr;
do {
ptr = source.load();
hazard_pointer = ptr; // Publish our intent
// Memory barrier here (implicit in atomic store)
} while (ptr != source.load()); // Verify it's still valid
return ptr;
}
void retire(T* ptr) {
retired_list.push(ptr);
if (retired_list.size() > threshold) {
scan_and_free();
}
}
void scan_and_free() {
// Collect all active hazard pointers
std::unordered_set<T*> hazards;
for (int i = 0; i < num_threads; i++) {
T* hp = hazard_pointers[i].load();
if (hp) hazards.insert(hp);
}
// Free nodes not in hazard set
for (T* node : retired_list) {
if (hazards.find(node) == hazards.end()) {
delete node;
} else {
// Still hazarded, keep for later
}
}
}
Solution 2: Epoch-Based Reclamation
Concept: Divide time into “epochs”. A node freed in epoch N can only be deleted when all threads have moved past epoch N.
Global epoch: 2
Thread 0: epoch=2, active=true <- Currently accessing structures
Thread 1: epoch=2, active=false <- Quiescent, not accessing anything
Thread 2: epoch=1, active=true <- Still in old epoch!
Thread 3: epoch=2, active=true
Retired lists:
Epoch 0: [node1, node2] <- Can free! All threads past epoch 0
Epoch 1: [node3] <- Cannot free! Thread 2 still in epoch 1
Epoch 2: [node4, node5] <- Cannot free! Current epoch
Epoch Protocol:
void enter_critical() {
thread_epoch = global_epoch.load();
thread_active = true;
}
void exit_critical() {
thread_active = false;
}
void retire(T* ptr) {
retired[global_epoch].push(ptr);
try_advance_epoch();
}
void try_advance_epoch() {
// Check if all threads are in current epoch
for (int i = 0; i < num_threads; i++) {
if (thread_active[i] && thread_epoch[i] < global_epoch) {
return; // Some thread is behind
}
}
// All threads caught up, advance epoch
int old_epoch = global_epoch++;
// Free nodes from two epochs ago (safe now)
for (T* node : retired[old_epoch - 2]) {
delete node;
}
retired[old_epoch - 2].clear();
}
Solution 3: Reference Counting
Concept: Each node tracks how many threads are accessing it. Only free when count reaches zero.
struct Node {
T data;
std::atomic<Node*> next;
std::atomic<int> ref_count; // How many threads have a pointer to this
};
Problem: Incrementing ref_count and reading the pointer must be atomic together, which is hard without double-width CAS (not available on all platforms).
Why This Matters
Lock-free stacks are used in:
- Memory allocators (jemalloc, tcmalloc) - Per-thread free lists
- Work-stealing schedulers (Intel TBB, Tokio) - Task queues
- Game engines - Lock-free command buffers
- Databases - Concurrent index structures
- High-frequency trading - Ultra-low-latency queues
Understanding lock-free programming permanently changes how you think about concurrent code. You’ll never look at a mutex the same way again.
Historical Context
- 1986: Herlihy and Wing formalize linearizability
- 1990s: First practical lock-free algorithms (Treiber stack, 1986)
- 2002: Michael’s hazard pointers paper
- 2004: Fraser’s epoch-based reclamation
- 2011: C++11 adds memory model and atomics (making portable lock-free code possible)
- 2015: Folly and other libraries provide production lock-free structures
Common Misconceptions
Misconception 1: “Lock-free is always faster than locks”
Wrong! Lock-free has higher constant overhead per operation. For low-contention cases, a simple mutex is often faster. Lock-free shines under high contention and when you need latency guarantees.
Misconception 2: “I can use std::atomic and be done”
Atomic operations are necessary but not sufficient. You still need to handle memory reclamation, prevent ABA, and reason about all interleavings.
Misconception 3: “CAS failures are rare”
Under high contention, CAS can fail repeatedly. You need exponential backoff or other strategies to avoid livelock.
Misconception 4: “Sequential consistency is the only safe choice”
Relaxed orderings are often sufficient and faster. But choosing correctly requires deep understanding of the memory model.
Project Specification
What You’ll Build
A lock-free stack (LIFO) that:
- Supports concurrent
pushandpopoperations from multiple threads - Uses CAS-based algorithm with proper memory ordering
- Handles the ABA problem via hazard pointers or epoch-based reclamation
- Passes stress tests with ThreadSanitizer and AddressSanitizer
- Demonstrates performance advantage under high contention
Interface Specification
template<typename T>
class LockFreeStack {
public:
LockFreeStack();
~LockFreeStack();
// Push a value onto the stack (thread-safe)
void push(T value);
// Pop a value from the stack (thread-safe)
// Returns std::nullopt if stack is empty
std::optional<T> pop();
// Check if stack is empty (approximate - may be stale immediately)
bool empty() const;
// For debugging/testing
size_t size() const; // Approximate count
};
Expected Output
$ ./lockfree_stack_test
=== Lock-Free Stack Test Suite ===
Basic Operations Test:
Push 1000 items sequentially: PASS (1.2ms)
Pop 1000 items (LIFO order verified): PASS (0.8ms)
Pop from empty stack returns nullopt: PASS
Concurrent Stress Test (8 threads, 100K ops each):
Configuration:
Threads: 8
Operations per thread: 100,000
Operation mix: 50% push, 50% pop
Running stress test...
Completed in 245ms
Results:
Total pushes attempted: 400,000
Total pops attempted: 400,000
Successful pops: 400,000
Empty pops (nullopt): 0
Final stack size: 0
Correctness checks:
No lost items: PASS
No duplicate pops: PASS
LIFO order maintained: PASS
Sanitizer reports:
ThreadSanitizer: No races detected
AddressSanitizer: No memory errors
ABA Problem Simulation Test:
Creating ABA-prone scenario...
Thread 1: Reading head, about to pop
Thread 2: Pop A, Pop B, Push new node at A's address
Thread 1: Attempting CAS...
With naive implementation: WOULD CRASH (use-after-free)
With hazard pointers: PASS - ABA prevented
Hazard pointer protected the following accesses:
Node 0x7f8b12340000: 3 hazard references during pop
Node 0x7f8b12340040: 2 hazard references during pop
Memory Reclamation Test:
Push 1,000,000 items
Pop 1,000,000 items
Memory allocation stats:
Nodes allocated: 1,000,000
Nodes reclaimed: 1,000,000
Memory leaked: 0 bytes
Peak memory usage: 48.2 MB
Final memory usage: 0.1 MB (overhead only)
Reclamation timing:
Hazard pointer scans: 847
Average nodes per scan: 1,180
Scan overhead: 2.3% of total time
Performance Comparison:
Test: 8 threads, 1M operations each, 50/50 push/pop
Lock-free stack: 2.1M ops/sec (478ns avg latency)
Mutex-protected stack: 0.8M ops/sec (1250ns avg latency)
std::stack (1 thread): 15.0M ops/sec (67ns avg latency)
Lock-free advantage:
2.6x throughput vs mutex under contention
Consistent latency (no mutex wait spikes)
P99 latency: 890ns (vs 4200ns for mutex)
=== All Tests Passed ===
Solution Architecture
Component Diagram
LockFreeStack<T>
|
+----------------------+----------------------+
| | |
+---------+ +-----------+ +-----------+
| Node | | Hazard | | Retired |
| Storage | | Pointer | | List |
+---------+ | Array | +-----------+
| +-----------+ |
| | |
v v v
+----------+ +------------+ +----------+
| std:: | | Per-thread | | Deferred |
| atomic | | slots | | delete |
| <Node*> | +------------+ +----------+
| head |
+----------+
The ABA Problem - Visual Walkthrough
STEP 1: Initial State
========================
Thread 1: about to pop
Stack: head --> [A|next:B] --> [B|next:C] --> [C|next:null]
Thread 1 reads:
- old_head = A
- next = B
STEP 2: Thread 1 Preempted
========================
Thread 1: (sleeping, holds old_head=A, next=B)
Stack: head --> [A|next:B] --> [B|next:C] --> [C|next:null]
STEP 3: Thread 2 Pops A
========================
Thread 2: CAS(head, A, B) succeeds
Stack: head --> [B|next:C] --> [C|next:null]
[A] is now freed!
STEP 4: Thread 2 Pops B
========================
Thread 2: CAS(head, B, C) succeeds
Stack: head --> [C|next:null]
[B] is now freed!
STEP 5: Thread 2 Pushes X (Memory Reuse!)
========================
Thread 2: new Node X allocated at SAME ADDRESS as A was!
Stack: head --> [X|next:C] --> [C|next:null]
^
|
Same address as A!
STEP 6: Thread 1 Resumes - DISASTER!
========================
Thread 1: still holds old_head=A (now X), next=B (FREED!)
Thread 1 does: CAS(head, A, B)
CAS(head, X, B) <-- X happens to equal A's address!
CAS SUCCEEDS because head still "equals" A (same address)!
Stack: head --> [B|...] --> ???
CORRUPTION:
- head now points to freed memory B!
- Node X is lost!
- Undefined behavior ensues
Hazard Pointer Solution - Visual
STEP 1: Thread 1 Begins Pop (WITH HAZARD POINTER)
========================
Thread 1:
1. Load head -> A
2. PUBLISH hazard_pointer[thread_1] = A <-- PROTECTION!
3. Verify head still == A (double-check)
4. Read A->next = B
5. Attempt CAS
Hazard Array: [T0: null] [T1: A] [T2: null] [T3: null]
STEP 2: Thread 1 Preempted
========================
Hazard Array: [T0: null] [T1: A] [T2: null] [T3: null]
^^^^^
A is protected!
STEP 3: Thread 2 Pops A
========================
Thread 2: CAS(head, A, B) succeeds
But Thread 2 does NOT free A immediately!
Instead: retired_list.push(A)
STEP 4: Thread 2 Tries to Reclaim
========================
Thread 2: scan_and_free()
- Collects hazard pointers: {A} <-- A is hazarded!
- Checks retired_list: [A]
- A is in hazard set: SKIP, don't free
Result: A stays allocated, Thread 1 is safe!
STEP 5: Thread 1 Resumes
========================
Thread 1: CAS(head, A, B)
- head is now B (Thread 2 changed it)
- CAS FAILS (expected A, got B)
- Thread 1 retries with correct value
No corruption! Hazard pointers saved the day.
STEP 6: Thread 1 Finishes and Clears Hazard
========================
Thread 1: hazard_pointer[thread_1] = nullptr
Next scan_and_free() can reclaim A.
Key Components
Node Structure
template<typename T>
struct Node {
T data;
std::atomic<Node*> next;
explicit Node(T value) : data(std::move(value)), next(nullptr) {}
};
Hazard Pointer System
class HazardPointerDomain {
static constexpr size_t MAX_THREADS = 64;
std::array<std::atomic<void*>, MAX_THREADS> hazard_pointers;
// Per-thread retired lists
thread_local std::vector<void*> retired_list;
thread_local std::function<void(void*)> deleter;
public:
// Acquire a hazard pointer slot for this thread
void* protect(std::atomic<void*>& source);
// Release hazard pointer
void release(int slot);
// Defer deletion until safe
void retire(void* ptr, std::function<void(void*)> del);
// Try to reclaim retired nodes
void scan();
};
Data Flow
Push Operation:
1. Allocate new node
2. Load current head
3. Set new_node->next = head
4. CAS(head, old_head, new_node)
5. If failed, goto 2 (retry)
Pop Operation:
1. Load head (protected by hazard pointer)
2. If null, return empty
3. Load head->next
4. CAS(head, old_head, next)
5. If failed, clear hazard pointer, goto 1
6. If success, retire old_head, return data
Implementation Guide
Phase 1: Basic Lock-Free Stack (Days 1-3)
Goal: Implement push/pop with CAS, ignoring memory reclamation.
Note: This version has the ABA problem and will crash under stress. That’s intentional - you need to see the problem before solving it.
template<typename T>
class NaiveLockFreeStack {
struct Node {
T data;
Node* next;
Node(T val) : data(std::move(val)), next(nullptr) {}
};
std::atomic<Node*> head{nullptr};
public:
void push(T value) {
Node* new_node = new Node(std::move(value));
new_node->next = head.load(std::memory_order_relaxed);
while (!head.compare_exchange_weak(
new_node->next, // Expected (updated on failure)
new_node, // Desired
std::memory_order_release,// Success ordering
std::memory_order_relaxed // Failure ordering
)) {
// CAS failed, new_node->next now has current head
// Loop will retry
}
}
std::optional<T> pop() {
Node* old_head = head.load(std::memory_order_acquire);
while (old_head != nullptr) {
Node* next = old_head->next; // DANGER: old_head might be freed!
if (head.compare_exchange_weak(
old_head, // Expected
next, // Desired
std::memory_order_release,
std::memory_order_relaxed
)) {
T value = std::move(old_head->data);
delete old_head; // DANGER: Another thread might be reading this!
return value;
}
// CAS failed, old_head updated to current head
}
return std::nullopt; // Stack was empty
}
};
Why this breaks: Between loading old_head->next and the CAS, another thread can:
- Pop old_head (free it)
- Pop next (free it)
- Push new nodes that reuse old_head’s address
- Now our CAS succeeds but points to freed memory
Test this phase: Run with AddressSanitizer. It should detect use-after-free under stress.
Phase 2: Hazard Pointer Infrastructure (Days 4-7)
Goal: Build the hazard pointer system.
class HazardPointerDomain {
public:
static constexpr size_t MAX_THREADS = 128;
static constexpr size_t SCAN_THRESHOLD = 100;
private:
struct HPRecord {
std::atomic<std::thread::id> owner;
std::atomic<void*> pointer;
};
std::array<HPRecord, MAX_THREADS> hp_records;
// Thread-local storage
struct ThreadData {
int slot = -1;
std::vector<std::pair<void*, std::function<void(void*)>>> retired;
};
static thread_local ThreadData tls;
public:
// Acquire a slot for this thread
int acquire_slot() {
auto tid = std::this_thread::get_id();
for (size_t i = 0; i < MAX_THREADS; ++i) {
std::thread::id empty;
if (hp_records[i].owner.compare_exchange_strong(
empty, tid, std::memory_order_acquire)) {
return static_cast<int>(i);
}
}
throw std::runtime_error("No hazard pointer slots available");
}
void release_slot(int slot) {
hp_records[slot].pointer.store(nullptr, std::memory_order_release);
hp_records[slot].owner.store(std::thread::id{}, std::memory_order_release);
}
// Protect a pointer (publish to hazard array)
template<typename T>
T* protect(int slot, const std::atomic<T*>& source) {
T* ptr;
do {
ptr = source.load(std::memory_order_acquire);
hp_records[slot].pointer.store(ptr, std::memory_order_release);
// Must re-check: source might have changed between load and store
} while (ptr != source.load(std::memory_order_acquire));
return ptr;
}
void clear(int slot) {
hp_records[slot].pointer.store(nullptr, std::memory_order_release);
}
// Defer deletion
template<typename T>
void retire(T* ptr) {
tls.retired.emplace_back(ptr, [](void* p) { delete static_cast<T*>(p); });
if (tls.retired.size() >= SCAN_THRESHOLD) {
scan();
}
}
void scan() {
// Collect all active hazard pointers
std::unordered_set<void*> hazards;
for (size_t i = 0; i < MAX_THREADS; ++i) {
void* hp = hp_records[i].pointer.load(std::memory_order_acquire);
if (hp != nullptr) {
hazards.insert(hp);
}
}
// Partition: keep hazarded, delete others
auto it = std::remove_if(tls.retired.begin(), tls.retired.end(),
[&hazards](const auto& entry) {
if (hazards.find(entry.first) == hazards.end()) {
entry.second(entry.first); // Delete
return true; // Remove from list
}
return false; // Keep in list
});
tls.retired.erase(it, tls.retired.end());
}
};
Phase 3: Integrate Hazard Pointers with Stack (Days 8-10)
Goal: Make the stack ABA-safe.
template<typename T>
class LockFreeStack {
struct Node {
T data;
std::atomic<Node*> next;
Node(T val) : data(std::move(val)), next(nullptr) {}
};
std::atomic<Node*> head{nullptr};
HazardPointerDomain& hp_domain;
// Thread-local slot management
static thread_local int hp_slot;
static thread_local bool slot_initialized;
int get_slot() {
if (!slot_initialized) {
hp_slot = hp_domain.acquire_slot();
slot_initialized = true;
}
return hp_slot;
}
public:
explicit LockFreeStack(HazardPointerDomain& domain) : hp_domain(domain) {}
void push(T value) {
Node* new_node = new Node(std::move(value));
new_node->next.store(head.load(std::memory_order_relaxed),
std::memory_order_relaxed);
while (!head.compare_exchange_weak(
new_node->next.load(std::memory_order_relaxed),
new_node,
std::memory_order_release,
std::memory_order_relaxed
)) {
// Retry
}
}
std::optional<T> pop() {
int slot = get_slot();
Node* old_head;
while (true) {
// Protect head with hazard pointer
old_head = hp_domain.protect(slot, head);
if (old_head == nullptr) {
hp_domain.clear(slot);
return std::nullopt;
}
// Now safe to read old_head->next (it's hazard-protected)
Node* next = old_head->next.load(std::memory_order_acquire);
if (head.compare_exchange_weak(
old_head,
next,
std::memory_order_release,
std::memory_order_relaxed
)) {
// Success! We own old_head now
break;
}
// CAS failed, loop will re-protect new head
}
// Clear hazard pointer before retiring
hp_domain.clear(slot);
T value = std::move(old_head->data);
// Don't delete immediately - retire for later reclamation
hp_domain.retire(old_head);
return value;
}
};
Phase 4: Stress Testing (Days 11-12)
Goal: Verify correctness under heavy concurrent load.
void stress_test() {
HazardPointerDomain hp_domain;
LockFreeStack<int> stack(hp_domain);
constexpr int NUM_THREADS = 8;
constexpr int OPS_PER_THREAD = 100000;
std::atomic<int> total_pushed{0};
std::atomic<int> total_popped{0};
std::atomic<int> empty_pops{0};
auto worker = [&](int thread_id) {
std::mt19937 rng(thread_id);
std::uniform_int_distribution<int> dist(0, 1);
for (int i = 0; i < OPS_PER_THREAD; ++i) {
if (dist(rng) == 0) {
// Push
stack.push(thread_id * OPS_PER_THREAD + i);
total_pushed.fetch_add(1, std::memory_order_relaxed);
} else {
// Pop
auto result = stack.pop();
if (result) {
total_popped.fetch_add(1, std::memory_order_relaxed);
} else {
empty_pops.fetch_add(1, std::memory_order_relaxed);
}
}
}
};
std::vector<std::thread> threads;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < NUM_THREADS; ++i) {
threads.emplace_back(worker, i);
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
// Drain remaining items
int remaining = 0;
while (stack.pop()) {
remaining++;
total_popped.fetch_add(1, std::memory_order_relaxed);
}
std::cout << "Stress test completed in " << duration.count() << "ms\n";
std::cout << "Pushed: " << total_pushed << "\n";
std::cout << "Popped: " << total_popped << "\n";
std::cout << "Empty pops: " << empty_pops << "\n";
std::cout << "Remaining: " << remaining << "\n";
assert(total_pushed == total_popped);
std::cout << "PASS: All items accounted for\n";
}
Phase 5: Performance Benchmarking (Days 13-14)
Goal: Compare against mutex-based implementation.
template<typename T>
class MutexStack {
std::stack<T> stack;
mutable std::mutex mtx;
public:
void push(T value) {
std::lock_guard<std::mutex> lock(mtx);
stack.push(std::move(value));
}
std::optional<T> pop() {
std::lock_guard<std::mutex> lock(mtx);
if (stack.empty()) return std::nullopt;
T value = std::move(stack.top());
stack.pop();
return value;
}
};
void benchmark() {
// Test both implementations under same conditions
// Measure throughput (ops/sec) and latency distribution
}
Testing Strategy
Compiler Flags
# Development build with sanitizers
clang++ -std=c++20 -g -O1 \
-fsanitize=thread \
-fno-omit-frame-pointer \
lockfree_stack.cpp -o lockfree_stack_tsan
clang++ -std=c++20 -g -O1 \
-fsanitize=address \
-fno-omit-frame-pointer \
lockfree_stack.cpp -o lockfree_stack_asan
# Release build for benchmarking
clang++ -std=c++20 -O3 -DNDEBUG \
-march=native \
lockfree_stack.cpp -o lockfree_stack_release
ThreadSanitizer Testing
ThreadSanitizer detects data races - concurrent accesses where at least one is a write and there’s no synchronization.
$ ./lockfree_stack_tsan
# Should produce NO warnings
# If you see:
# WARNING: ThreadSanitizer: data race
# Read of size 8 at 0x... by thread T1:
# Previous write of size 8 at 0x... by thread T2:
# That indicates a bug in your implementation
Common TSan findings in lock-free code:
- Missing memory ordering (using relaxed when acquire/release needed)
- Accessing non-atomic member through atomic pointer
- Not protecting reads with hazard pointers
AddressSanitizer Testing
AddressSanitizer detects memory errors - use-after-free, buffer overflows, etc.
$ ./lockfree_stack_asan
# Should produce NO warnings
# If you see:
# ==12345==ERROR: AddressSanitizer: heap-use-after-free
# That means your ABA protection is broken
The ABA test should crash WITHOUT hazard pointers:
void aba_test_naive() {
NaiveLockFreeStack<int> stack;
// Orchestrate an ABA scenario
// This SHOULD trigger ASan errors
}
Unit Tests
TEST_CASE("Basic operations") {
LockFreeStack<int> stack;
SECTION("Push and pop single item") {
stack.push(42);
auto result = stack.pop();
REQUIRE(result.has_value());
REQUIRE(*result == 42);
}
SECTION("Pop from empty returns nullopt") {
auto result = stack.pop();
REQUIRE_FALSE(result.has_value());
}
SECTION("LIFO order maintained") {
stack.push(1);
stack.push(2);
stack.push(3);
REQUIRE(*stack.pop() == 3);
REQUIRE(*stack.pop() == 2);
REQUIRE(*stack.pop() == 1);
}
}
TEST_CASE("Concurrent operations") {
// See stress test above
}
Common Pitfalls
Pitfall 1: Wrong Memory Ordering
Problem: Using memory_order_relaxed everywhere for “performance”.
// WRONG: No synchronization between push and pop
void push(T value) {
new_node->next = head.load(memory_order_relaxed);
head.compare_exchange_weak(..., memory_order_relaxed, memory_order_relaxed);
}
// RIGHT: Release on push, acquire on pop
void push(T value) {
new_node->next = head.load(memory_order_relaxed);
head.compare_exchange_weak(..., memory_order_release, memory_order_relaxed);
}
auto pop() {
Node* old_head = head.load(memory_order_acquire); // See push's release
// ...
}
Why it matters: Without acquire/release, the compiler and CPU can reorder operations. A pop might see head but not see the data that was stored in the node!
Pitfall 2: Not Re-checking After Hazard Pointer Publication
Problem: Publishing hazard pointer but not verifying source is still valid.
// WRONG: Race between load and store
T* ptr = source.load();
hazard_pointer = ptr;
// Gap here! source might have changed and ptr might be freed!
Node* next = ptr->next; // Use-after-free possible!
// RIGHT: Double-check pattern
T* ptr;
do {
ptr = source.load();
hazard_pointer = ptr;
} while (ptr != source.load());
// Now ptr is guaranteed to be protected
Pitfall 3: Retiring Before Clearing Hazard Pointer
Problem: If you retire a node while it’s still your hazard pointer, you might free it!
// WRONG: Still hazard-protected when retired
T value = old_head->data;
hp_domain.retire(old_head); // Might scan and find our own hazard!
hp_domain.clear(slot);
// RIGHT: Clear hazard first
hp_domain.clear(slot);
T value = std::move(old_head->data);
hp_domain.retire(old_head);
// ALSO RIGHT: Clear before extracting data (but be careful)
T value = std::move(old_head->data); // Safe: we won CAS, we own it
hp_domain.clear(slot);
hp_domain.retire(old_head);
Pitfall 4: Thread-Local Slot Leakage
Problem: Not releasing hazard pointer slot when thread exits.
// WRONG: Slot never released
static thread_local int slot = hp_domain.acquire_slot();
// RIGHT: Use RAII or explicit cleanup
class ThreadGuard {
HazardPointerDomain& domain;
int slot;
public:
ThreadGuard(HazardPointerDomain& d) : domain(d), slot(d.acquire_slot()) {}
~ThreadGuard() { domain.release_slot(slot); }
int get_slot() const { return slot; }
};
Pitfall 5: Livelock Under High Contention
Problem: CAS fails repeatedly, threads make no progress.
// PROBLEMATIC: Tight spin loop
while (!head.compare_exchange_weak(old_head, new_node)) {
// Immediate retry - hammers cache line
}
// BETTER: Exponential backoff
int backoff = 1;
while (!head.compare_exchange_weak(old_head, new_node)) {
for (int i = 0; i < backoff; ++i) {
_mm_pause(); // CPU hint: we're spinning
}
backoff = std::min(backoff * 2, 1024);
}
Extensions and Challenges
Extension 1: Epoch-Based Reclamation
Implement epoch-based reclamation as an alternative to hazard pointers:
class EpochDomain {
std::atomic<uint64_t> global_epoch{0};
struct ThreadData {
std::atomic<uint64_t> epoch;
std::atomic<bool> active;
std::array<std::vector<void*>, 3> retired; // One per epoch (mod 3)
};
// ...
};
Advantages over hazard pointers:
- Lower overhead per operation (no hazard pointer store)
- Better for read-heavy workloads
- Simpler reasoning about protection
Disadvantages:
- Memory can be held longer (until all threads advance)
- Requires cooperation from all threads
- One stalled thread can prevent all reclamation
Extension 2: Wait-Free Pop
Make pop wait-free (guaranteed to complete in bounded steps):
// Use helping: if you fail CAS, help the other thread complete
// This is significantly more complex
Extension 3: Lock-Free Queue
Build a lock-free FIFO queue using the same techniques:
template<typename T>
class LockFreeQueue {
std::atomic<Node*> head;
std::atomic<Node*> tail;
// Two atomic pointers = more complex synchronization
void enqueue(T value);
std::optional<T> dequeue();
};
Extension 4: Performance Optimization
Optimize for your hardware:
- Cache line padding: Prevent false sharing
- Prefetching: Hint to CPU about upcoming accesses
- NUMA awareness: Keep data on local memory node
- Specialized allocators: Pool nodes, reduce allocation overhead
Challenge: Formal Verification
Use a model checker to verify your implementation:
- CDSChecker: Specifically designed for C++11 memory model
- Relacy: Race detector for lock-free algorithms
- TLA+/PlusCal: Model the algorithm abstractly
Resources
Essential Reading
| Topic | Book/Resource | Chapter/Section |
|---|---|---|
| Lock-free stack fundamentals | “C++ Concurrency in Action” by Anthony Williams | Chapter 7.2 |
| ABA problem and solutions | “C++ Concurrency in Action” by Anthony Williams | Chapter 7.2.3 |
| Hazard pointers | “C++ Concurrency in Action” by Anthony Williams | Chapter 7.2.4 |
| Memory ordering | “C++ Concurrency in Action” by Anthony Williams | Chapter 5 |
| Lock-free theory | “The Art of Multiprocessor Programming” by Herlihy & Shavit | Chapters 10-11 |
Online Resources
- ModernesCpp Lock-Free Series: Practical C++ lock-free implementation guides
- Preshing on Programming: Excellent articles on memory ordering and lock-free programming
- 1024cores: Deep dive into lock-free algorithms and data structures
- Herb Sutter’s “atomic<> Weapons” talks: Video series on C++ atomics
Papers
- “Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes” by Michael (2002) - Original hazard pointers paper
- “Practical Lock-Freedom” by Fraser (2004) - Epoch-based reclamation
- “The Art of Multiprocessor Programming” - Comprehensive textbook
Reference Implementations
- Folly: Facebook’s lock-free data structures
- Boost.Lockfree: Production-ready lock-free containers
- libcds: Concurrent Data Structures library
Self-Assessment Checklist
Before considering this project complete, verify you can:
Understanding
- Explain what “lock-free” means and how it differs from “wait-free”
- Draw the ABA problem scenario step-by-step
- Explain why CAS alone doesn’t prevent ABA
- Describe hazard pointers and when nodes can be safely freed
- Explain why memory ordering matters for correctness
- Identify when lock-free is better than mutex (and when it’s worse)
Implementation
- Implement lock-free push with correct CAS usage
- Implement lock-free pop with hazard pointer protection
- Implement the hazard pointer domain (publish, retire, scan)
- Handle thread-local slot acquisition and release correctly
Testing
- All tests pass with ThreadSanitizer (no data races)
- All tests pass with AddressSanitizer (no memory errors)
- Stress test with 8+ threads shows correct behavior
- Performance is measurably better than mutex under contention
Analysis
- Measured and compared throughput vs mutex-based stack
- Measured memory reclamation overhead
- Identified peak memory usage during stress test
Submission/Completion Criteria
Your lock-free stack is complete when:
- Correctness: Passes all stress tests with sanitizers clean
- ABA Safety: Demonstrably handles the ABA scenario
- Memory Safety: No leaks (all retired nodes eventually freed)
- Performance: Shows throughput advantage over mutex under contention
- Documentation: Code is well-commented explaining the synchronization
Deliverables
lockfree_stack.hpp- The lock-free stack implementationhazard_pointer.hpp- The hazard pointer domaintest_lockfree_stack.cpp- Unit and stress testsbenchmark.cpp- Performance comparison vs mutexREADME.md- Build instructions and results
Interview Preparation
After completing this project, you should be able to:
- Explain lock-free guarantees vs mutex vs wait-free
- Whiteboard the ABA problem with a concrete example
- Describe two solutions to safe memory reclamation
- Analyze memory orderings in concurrent code
- Discuss trade-offs of lock-free vs lock-based approaches
Next Project: P06 - Lock-Free MPMC Queue - The “final boss” of lock-free data structures