Project 5: Lock-Free Stack

Quick Reference

Attribute	Details
Difficulty	Expert
Time Estimate	2-4 weeks
Primary Language	C++
Alternative Languages	C, Rust
Knowledge Area	Lock-Free Data Structures
Tools Required	C++17/20 compiler, ThreadSanitizer, AddressSanitizer
Primary Reference	“C++ Concurrency in Action, 2nd Edition” by Anthony Williams, Chapter 7

Learning Objectives

By completing this project, you will be able to:

Implement a lock-free stack using atomic compare-and-swap (CAS) operations
Explain the ABA problem and why naive CAS-based algorithms fail
Implement hazard pointers for safe memory reclamation in lock-free structures
Implement epoch-based reclamation as an alternative to hazard pointers
Choose correct memory orderings (acquire/release/seq_cst) for each atomic operation
Stress test concurrent code using sanitizers and multi-threaded workloads
Reason about all possible interleavings of concurrent operations

Theoretical Foundation

The Core Question You’re Answering

“How do you build a concurrent data structure that multiple threads can modify simultaneously without ever blocking, and how do you safely free memory when any thread might be accessing it?”

This is the pinnacle of concurrent programming. After this project, you’ll understand:

Why locks are sometimes the enemy of performance
Why “just use CAS” doesn’t work without memory reclamation
How to think about every possible interleaving of operations
Why lock-free programming is considered “hard mode”

Most developers use lock-based containers their entire career. After this project, you’ll understand the frontier of concurrent programming that powers high-performance systems like databases, game engines, and trading platforms.

Core Concepts

Lock-Free Guarantees

Lock-free means system-wide progress is guaranteed. Even if some threads are suspended indefinitely, other threads will continue making progress.

Lock-Based (Blocking)           Lock-Free (Non-Blocking)
       |                              |
Thread 1 holds lock            Thread 1 reading head
       |                              |
Thread 1 suspended!            Thread 1 suspended
       |                              |
Threads 2, 3, 4: BLOCKED       Threads 2, 3, 4: PROGRESS!
       |                              |
System makes NO progress       System continues working

Why this matters:

No deadlocks possible (no locks to deadlock on)
No priority inversion (no lock holder to wait for)
Better latency under contention (no sleeping/waking)
Threads can fail without corrupting structure

The trade-off: Lock-free code is dramatically harder to write correctly. Every line requires reasoning about all possible thread interleavings.

Compare-And-Swap (CAS)

CAS is the fundamental building block of lock-free algorithms:

// Atomically: if current value == expected, set to desired, return true
//             else update expected to current value, return false
bool compare_exchange_weak(T& expected, T desired);

CAS is a “conditional write” - it only succeeds if the value hasn’t changed since we last read it.

Thread 1:                          Thread 2:
  reads head = A
  prepares new_node
                                     reads head = A
                                     prepares new_node2
                                     CAS(A, new_node2) - SUCCESS!
  CAS(A, new_node) - FAILS!          head now = new_node2
  expected updated to new_node2
  retry...
  CAS(new_node2, new_node) - SUCCESS!

Key insight: CAS lets threads “race” to modify shared data. Losers detect they lost and retry with updated information.

The ABA Problem

The ABA problem is the critical flaw in naive lock-free algorithms. It occurs when:

Thread 1 reads value A
Thread 1 is preempted
Thread 2 changes A to B
Thread 2 changes B back to A (or allocates new memory at A’s address)
Thread 1 resumes, sees A, CAS succeeds
But the structure has fundamentally changed!

Concrete stack example:

Initial State:
  head -> [A, data1] -> [B, data2] -> [C, data3] -> null

Thread 1: Pop operation
  1. reads head = A
  2. reads A->next = B
  3. will do CAS(head, A, B)
  4. ** PREEMPTED **

Thread 2: (while Thread 1 sleeps)
  5. pops A (frees A's memory)
  6. pops B (frees B's memory)
  7. pushes new node X (allocator reuses A's memory address!)
  8. pushes Y

Stack now:
  head -> [A_reused, dataX] -> [Y, dataY] -> [C, data3] -> null

Thread 1: Resumes
  9. does CAS(head, A, B)
  10. CAS SUCCEEDS! (head still "equals" A... the address)
  11. head now points to B

DISASTER:
  - B was freed! Use-after-free!
  - Lost nodes X and Y entirely!
  - Stack is corrupted

ASCII Diagram of ABA Problem:

Time    Thread 1                    Thread 2                    Stack State
 |
 |      read head=A                                            head->[A]->[B]->[C]
 |      read A->next=B
 |      (preempted)
 |                                  pop A                       head->[B]->[C]
 |                                  free A
 |                                  pop B                       head->[C]
 |                                  free B
 |                                  push X (reuses A's addr)   head->[A']->[C]
 |                                  push Y                     head->[Y]->[A']->[C]
 |      CAS(A, B) succeeds!                                    head->[B] <- FREED!
 |      CRASH or corruption
 v

Why CAS doesn’t catch this: CAS only checks if the pointer value (the address) is the same. It doesn’t know that the memory was freed and reused. The address A is the same, but it now points to completely different data!

Memory Reclamation Solutions

The core problem: When is it safe to free a node in a lock-free structure?

You can’t just delete it after a CAS succeeds - another thread might be holding a pointer to it!

Solution 1: Hazard Pointers

Concept: Each thread publishes which pointers it’s currently dereferencing. Before freeing a node, check if any thread has it as a hazard pointer.

Global Hazard Pointer Array (one slot per thread):
  [Thread 0: nullptr ]
  [Thread 1: 0x1234  ]  <- Thread 1 is accessing node at 0x1234
  [Thread 2: nullptr ]
  [Thread 3: 0x5678  ]  <- Thread 3 is accessing node at 0x5678

Before dereferencing:
  1. Publish pointer to your hazard slot
  2. Verify pointer is still valid (re-read from structure)
  3. If not valid, clear slot and retry

Before freeing:
  1. Add node to "retired list" instead of freeing
  2. Periodically scan retired list
  3. For each retired node, check all hazard slots
  4. If no thread has it hazarded, safe to free

Hazard Pointer Protocol:

T* acquire_safe_pointer(std::atomic<T*>& source) {
    T* ptr;
    do {
        ptr = source.load();
        hazard_pointer = ptr;  // Publish our intent
        // Memory barrier here (implicit in atomic store)
    } while (ptr != source.load());  // Verify it's still valid
    return ptr;
}

void retire(T* ptr) {
    retired_list.push(ptr);
    if (retired_list.size() > threshold) {
        scan_and_free();
    }
}

void scan_and_free() {
    // Collect all active hazard pointers
    std::unordered_set<T*> hazards;
    for (int i = 0; i < num_threads; i++) {
        T* hp = hazard_pointers[i].load();
        if (hp) hazards.insert(hp);
    }

    // Free nodes not in hazard set
    for (T* node : retired_list) {
        if (hazards.find(node) == hazards.end()) {
            delete node;
        } else {
            // Still hazarded, keep for later
        }
    }
}

Solution 2: Epoch-Based Reclamation

Concept: Divide time into “epochs”. A node freed in epoch N can only be deleted when all threads have moved past epoch N.

Global epoch: 2

Thread 0: epoch=2, active=true   <- Currently accessing structures
Thread 1: epoch=2, active=false  <- Quiescent, not accessing anything
Thread 2: epoch=1, active=true   <- Still in old epoch!
Thread 3: epoch=2, active=true

Retired lists:
  Epoch 0: [node1, node2]  <- Can free! All threads past epoch 0
  Epoch 1: [node3]         <- Cannot free! Thread 2 still in epoch 1
  Epoch 2: [node4, node5]  <- Cannot free! Current epoch

Epoch Protocol:

void enter_critical() {
    thread_epoch = global_epoch.load();
    thread_active = true;
}

void exit_critical() {
    thread_active = false;
}

void retire(T* ptr) {
    retired[global_epoch].push(ptr);
    try_advance_epoch();
}

void try_advance_epoch() {
    // Check if all threads are in current epoch
    for (int i = 0; i < num_threads; i++) {
        if (thread_active[i] && thread_epoch[i] < global_epoch) {
            return;  // Some thread is behind
        }
    }

    // All threads caught up, advance epoch
    int old_epoch = global_epoch++;

    // Free nodes from two epochs ago (safe now)
    for (T* node : retired[old_epoch - 2]) {
        delete node;
    }
    retired[old_epoch - 2].clear();
}

Solution 3: Reference Counting

Concept: Each node tracks how many threads are accessing it. Only free when count reaches zero.

struct Node {
    T data;
    std::atomic<Node*> next;
    std::atomic<int> ref_count;  // How many threads have a pointer to this
};

Problem: Incrementing ref_count and reading the pointer must be atomic together, which is hard without double-width CAS (not available on all platforms).

Why This Matters

Lock-free stacks are used in:

Memory allocators (jemalloc, tcmalloc) - Per-thread free lists
Work-stealing schedulers (Intel TBB, Tokio) - Task queues
Game engines - Lock-free command buffers
Databases - Concurrent index structures
High-frequency trading - Ultra-low-latency queues

Understanding lock-free programming permanently changes how you think about concurrent code. You’ll never look at a mutex the same way again.

Historical Context

1986: Herlihy and Wing formalize linearizability
1990s: First practical lock-free algorithms (Treiber stack, 1986)
2002: Michael’s hazard pointers paper
2004: Fraser’s epoch-based reclamation
2011: C++11 adds memory model and atomics (making portable lock-free code possible)
2015: Folly and other libraries provide production lock-free structures

Common Misconceptions

Misconception 1: “Lock-free is always faster than locks”

Wrong! Lock-free has higher constant overhead per operation. For low-contention cases, a simple mutex is often faster. Lock-free shines under high contention and when you need latency guarantees.

Misconception 2: “I can use std::atomic and be done”

Atomic operations are necessary but not sufficient. You still need to handle memory reclamation, prevent ABA, and reason about all interleavings.

Misconception 3: “CAS failures are rare”

Under high contention, CAS can fail repeatedly. You need exponential backoff or other strategies to avoid livelock.

Misconception 4: “Sequential consistency is the only safe choice”

Relaxed orderings are often sufficient and faster. But choosing correctly requires deep understanding of the memory model.

Project Specification

What You’ll Build

A lock-free stack (LIFO) that:

Supports concurrent push and pop operations from multiple threads
Uses CAS-based algorithm with proper memory ordering
Handles the ABA problem via hazard pointers or epoch-based reclamation
Passes stress tests with ThreadSanitizer and AddressSanitizer
Demonstrates performance advantage under high contention

Interface Specification

template<typename T>
class LockFreeStack {
public:
    LockFreeStack();
    ~LockFreeStack();

    // Push a value onto the stack (thread-safe)
    void push(T value);

    // Pop a value from the stack (thread-safe)
    // Returns std::nullopt if stack is empty
    std::optional<T> pop();

    // Check if stack is empty (approximate - may be stale immediately)
    bool empty() const;

    // For debugging/testing
    size_t size() const;  // Approximate count
};

Expected Output

$ ./lockfree_stack_test

=== Lock-Free Stack Test Suite ===

Basic Operations Test:
  Push 1000 items sequentially: PASS (1.2ms)
  Pop 1000 items (LIFO order verified): PASS (0.8ms)
  Pop from empty stack returns nullopt: PASS

Concurrent Stress Test (8 threads, 100K ops each):
  Configuration:
    Threads: 8
    Operations per thread: 100,000
    Operation mix: 50% push, 50% pop

  Running stress test...
  Completed in 245ms

  Results:
    Total pushes attempted: 400,000
    Total pops attempted: 400,000
    Successful pops: 400,000
    Empty pops (nullopt): 0
    Final stack size: 0

  Correctness checks:
    No lost items: PASS
    No duplicate pops: PASS
    LIFO order maintained: PASS

  Sanitizer reports:
    ThreadSanitizer: No races detected
    AddressSanitizer: No memory errors

ABA Problem Simulation Test:
  Creating ABA-prone scenario...
  Thread 1: Reading head, about to pop
  Thread 2: Pop A, Pop B, Push new node at A's address
  Thread 1: Attempting CAS...

  With naive implementation: WOULD CRASH (use-after-free)
  With hazard pointers: PASS - ABA prevented

  Hazard pointer protected the following accesses:
    Node 0x7f8b12340000: 3 hazard references during pop
    Node 0x7f8b12340040: 2 hazard references during pop

Memory Reclamation Test:
  Push 1,000,000 items
  Pop 1,000,000 items

  Memory allocation stats:
    Nodes allocated: 1,000,000
    Nodes reclaimed: 1,000,000
    Memory leaked: 0 bytes
    Peak memory usage: 48.2 MB
    Final memory usage: 0.1 MB (overhead only)

  Reclamation timing:
    Hazard pointer scans: 847
    Average nodes per scan: 1,180
    Scan overhead: 2.3% of total time

Performance Comparison:
  Test: 8 threads, 1M operations each, 50/50 push/pop

  Lock-free stack:       2.1M ops/sec  (478ns avg latency)
  Mutex-protected stack: 0.8M ops/sec  (1250ns avg latency)
  std::stack (1 thread): 15.0M ops/sec (67ns avg latency)

  Lock-free advantage:
    2.6x throughput vs mutex under contention
    Consistent latency (no mutex wait spikes)
    P99 latency: 890ns (vs 4200ns for mutex)

=== All Tests Passed ===

Solution Architecture

Component Diagram

                         LockFreeStack<T>
                               |
        +----------------------+----------------------+
        |                      |                      |
   +---------+          +-----------+          +-----------+
   |  Node   |          |  Hazard   |          |  Retired  |
   | Storage |          |  Pointer  |          |   List    |
   +---------+          |   Array   |          +-----------+
        |               +-----------+                |
        |                     |                      |
        v                     v                      v
  +----------+         +------------+         +----------+
  | std::    |         | Per-thread |         | Deferred |
  | atomic   |         |   slots    |         |  delete  |
  | <Node*>  |         +------------+         +----------+
  |  head    |
  +----------+

The ABA Problem - Visual Walkthrough

STEP 1: Initial State
========================
Thread 1: about to pop

Stack:  head --> [A|next:B] --> [B|next:C] --> [C|next:null]

Thread 1 reads:
  - old_head = A
  - next = B


STEP 2: Thread 1 Preempted
========================
Thread 1: (sleeping, holds old_head=A, next=B)

Stack:  head --> [A|next:B] --> [B|next:C] --> [C|next:null]


STEP 3: Thread 2 Pops A
========================
Thread 2: CAS(head, A, B) succeeds

Stack:  head --> [B|next:C] --> [C|next:null]

[A] is now freed!


STEP 4: Thread 2 Pops B
========================
Thread 2: CAS(head, B, C) succeeds

Stack:  head --> [C|next:null]

[B] is now freed!


STEP 5: Thread 2 Pushes X (Memory Reuse!)
========================
Thread 2: new Node X allocated at SAME ADDRESS as A was!

Stack:  head --> [X|next:C] --> [C|next:null]
                  ^
                  |
              Same address as A!


STEP 6: Thread 1 Resumes - DISASTER!
========================
Thread 1: still holds old_head=A (now X), next=B (FREED!)

Thread 1 does: CAS(head, A, B)
                CAS(head, X, B)  <-- X happens to equal A's address!

CAS SUCCEEDS because head still "equals" A (same address)!

Stack:  head --> [B|...] --> ???

CORRUPTION:
  - head now points to freed memory B!
  - Node X is lost!
  - Undefined behavior ensues

Hazard Pointer Solution - Visual

STEP 1: Thread 1 Begins Pop (WITH HAZARD POINTER)
========================
Thread 1:
  1. Load head -> A
  2. PUBLISH hazard_pointer[thread_1] = A  <-- PROTECTION!
  3. Verify head still == A (double-check)
  4. Read A->next = B
  5. Attempt CAS

Hazard Array:  [T0: null] [T1: A] [T2: null] [T3: null]

STEP 2: Thread 1 Preempted
========================
Hazard Array:  [T0: null] [T1: A] [T2: null] [T3: null]
                           ^^^^^
                     A is protected!

STEP 3: Thread 2 Pops A
========================
Thread 2: CAS(head, A, B) succeeds

But Thread 2 does NOT free A immediately!
Instead: retired_list.push(A)

STEP 4: Thread 2 Tries to Reclaim
========================
Thread 2: scan_and_free()
  - Collects hazard pointers: {A}  <-- A is hazarded!
  - Checks retired_list: [A]
  - A is in hazard set: SKIP, don't free

Result: A stays allocated, Thread 1 is safe!

STEP 5: Thread 1 Resumes
========================
Thread 1: CAS(head, A, B)
  - head is now B (Thread 2 changed it)
  - CAS FAILS (expected A, got B)
  - Thread 1 retries with correct value

No corruption! Hazard pointers saved the day.

STEP 6: Thread 1 Finishes and Clears Hazard
========================
Thread 1: hazard_pointer[thread_1] = nullptr

Next scan_and_free() can reclaim A.

Key Components

Node Structure

template<typename T>
struct Node {
    T data;
    std::atomic<Node*> next;

    explicit Node(T value) : data(std::move(value)), next(nullptr) {}
};

Hazard Pointer System

class HazardPointerDomain {
    static constexpr size_t MAX_THREADS = 64;

    std::array<std::atomic<void*>, MAX_THREADS> hazard_pointers;

    // Per-thread retired lists
    thread_local std::vector<void*> retired_list;
    thread_local std::function<void(void*)> deleter;

public:
    // Acquire a hazard pointer slot for this thread
    void* protect(std::atomic<void*>& source);

    // Release hazard pointer
    void release(int slot);

    // Defer deletion until safe
    void retire(void* ptr, std::function<void(void*)> del);

    // Try to reclaim retired nodes
    void scan();
};

Data Flow

Push Operation:
Allocate new node
Load current head
Set new_node->next = head
CAS(head, old_head, new_node)
If failed, goto 2 (retry)

Pop Operation:
Load head (protected by hazard pointer)
If null, return empty
Load head->next
CAS(head, old_head, next)
If failed, clear hazard pointer, goto 1
If success, retire old_head, return data

Implementation Guide

Phase 1: Basic Lock-Free Stack (Days 1-3)

Goal: Implement push/pop with CAS, ignoring memory reclamation.

Note: This version has the ABA problem and will crash under stress. That’s intentional - you need to see the problem before solving it.

template<typename T>
class NaiveLockFreeStack {
    struct Node {
        T data;
        Node* next;
        Node(T val) : data(std::move(val)), next(nullptr) {}
    };

    std::atomic<Node*> head{nullptr};

public:
    void push(T value) {
        Node* new_node = new Node(std::move(value));
        new_node->next = head.load(std::memory_order_relaxed);

        while (!head.compare_exchange_weak(
            new_node->next,           // Expected (updated on failure)
            new_node,                 // Desired
            std::memory_order_release,// Success ordering
            std::memory_order_relaxed // Failure ordering
        )) {
            // CAS failed, new_node->next now has current head
            // Loop will retry
        }
    }

    std::optional<T> pop() {
        Node* old_head = head.load(std::memory_order_acquire);

        while (old_head != nullptr) {
            Node* next = old_head->next;  // DANGER: old_head might be freed!

            if (head.compare_exchange_weak(
                old_head,                  // Expected
                next,                      // Desired
                std::memory_order_release,
                std::memory_order_relaxed
            )) {
                T value = std::move(old_head->data);
                delete old_head;  // DANGER: Another thread might be reading this!
                return value;
            }
            // CAS failed, old_head updated to current head
        }
        return std::nullopt;  // Stack was empty
    }
};

Why this breaks: Between loading old_head->next and the CAS, another thread can:

Pop old_head (free it)
Pop next (free it)
Push new nodes that reuse old_head’s address
Now our CAS succeeds but points to freed memory

Test this phase: Run with AddressSanitizer. It should detect use-after-free under stress.

Phase 2: Hazard Pointer Infrastructure (Days 4-7)

Goal: Build the hazard pointer system.

class HazardPointerDomain {
public:
    static constexpr size_t MAX_THREADS = 128;
    static constexpr size_t SCAN_THRESHOLD = 100;

private:
    struct HPRecord {
        std::atomic<std::thread::id> owner;
        std::atomic<void*> pointer;
    };

    std::array<HPRecord, MAX_THREADS> hp_records;

    // Thread-local storage
    struct ThreadData {
        int slot = -1;
        std::vector<std::pair<void*, std::function<void(void*)>>> retired;
    };
    static thread_local ThreadData tls;

public:
    // Acquire a slot for this thread
    int acquire_slot() {
        auto tid = std::this_thread::get_id();

        for (size_t i = 0; i < MAX_THREADS; ++i) {
            std::thread::id empty;
            if (hp_records[i].owner.compare_exchange_strong(
                empty, tid, std::memory_order_acquire)) {
                return static_cast<int>(i);
            }
        }
        throw std::runtime_error("No hazard pointer slots available");
    }

    void release_slot(int slot) {
        hp_records[slot].pointer.store(nullptr, std::memory_order_release);
        hp_records[slot].owner.store(std::thread::id{}, std::memory_order_release);
    }

    // Protect a pointer (publish to hazard array)
    template<typename T>
    T* protect(int slot, const std::atomic<T*>& source) {
        T* ptr;
        do {
            ptr = source.load(std::memory_order_acquire);
            hp_records[slot].pointer.store(ptr, std::memory_order_release);
            // Must re-check: source might have changed between load and store
        } while (ptr != source.load(std::memory_order_acquire));
        return ptr;
    }

    void clear(int slot) {
        hp_records[slot].pointer.store(nullptr, std::memory_order_release);
    }

    // Defer deletion
    template<typename T>
    void retire(T* ptr) {
        tls.retired.emplace_back(ptr, [](void* p) { delete static_cast<T*>(p); });

        if (tls.retired.size() >= SCAN_THRESHOLD) {
            scan();
        }
    }

    void scan() {
        // Collect all active hazard pointers
        std::unordered_set<void*> hazards;
        for (size_t i = 0; i < MAX_THREADS; ++i) {
            void* hp = hp_records[i].pointer.load(std::memory_order_acquire);
            if (hp != nullptr) {
                hazards.insert(hp);
            }
        }

        // Partition: keep hazarded, delete others
        auto it = std::remove_if(tls.retired.begin(), tls.retired.end(),
            [&hazards](const auto& entry) {
                if (hazards.find(entry.first) == hazards.end()) {
                    entry.second(entry.first);  // Delete
                    return true;  // Remove from list
                }
                return false;  // Keep in list
            });
        tls.retired.erase(it, tls.retired.end());
    }
};

Phase 3: Integrate Hazard Pointers with Stack (Days 8-10)

Goal: Make the stack ABA-safe.

template<typename T>
class LockFreeStack {
    struct Node {
        T data;
        std::atomic<Node*> next;
        Node(T val) : data(std::move(val)), next(nullptr) {}
    };

    std::atomic<Node*> head{nullptr};
    HazardPointerDomain& hp_domain;

    // Thread-local slot management
    static thread_local int hp_slot;
    static thread_local bool slot_initialized;

    int get_slot() {
        if (!slot_initialized) {
            hp_slot = hp_domain.acquire_slot();
            slot_initialized = true;
        }
        return hp_slot;
    }

public:
    explicit LockFreeStack(HazardPointerDomain& domain) : hp_domain(domain) {}

    void push(T value) {
        Node* new_node = new Node(std::move(value));
        new_node->next.store(head.load(std::memory_order_relaxed),
                             std::memory_order_relaxed);

        while (!head.compare_exchange_weak(
            new_node->next.load(std::memory_order_relaxed),
            new_node,
            std::memory_order_release,
            std::memory_order_relaxed
        )) {
            // Retry
        }
    }

    std::optional<T> pop() {
        int slot = get_slot();
        Node* old_head;

        while (true) {
            // Protect head with hazard pointer
            old_head = hp_domain.protect(slot, head);

            if (old_head == nullptr) {
                hp_domain.clear(slot);
                return std::nullopt;
            }

            // Now safe to read old_head->next (it's hazard-protected)
            Node* next = old_head->next.load(std::memory_order_acquire);

            if (head.compare_exchange_weak(
                old_head,
                next,
                std::memory_order_release,
                std::memory_order_relaxed
            )) {
                // Success! We own old_head now
                break;
            }
            // CAS failed, loop will re-protect new head
        }

        // Clear hazard pointer before retiring
        hp_domain.clear(slot);

        T value = std::move(old_head->data);

        // Don't delete immediately - retire for later reclamation
        hp_domain.retire(old_head);

        return value;
    }
};

Phase 4: Stress Testing (Days 11-12)

Goal: Verify correctness under heavy concurrent load.

void stress_test() {
    HazardPointerDomain hp_domain;
    LockFreeStack<int> stack(hp_domain);

    constexpr int NUM_THREADS = 8;
    constexpr int OPS_PER_THREAD = 100000;

    std::atomic<int> total_pushed{0};
    std::atomic<int> total_popped{0};
    std::atomic<int> empty_pops{0};

    auto worker = [&](int thread_id) {
        std::mt19937 rng(thread_id);
        std::uniform_int_distribution<int> dist(0, 1);

        for (int i = 0; i < OPS_PER_THREAD; ++i) {
            if (dist(rng) == 0) {
                // Push
                stack.push(thread_id * OPS_PER_THREAD + i);
                total_pushed.fetch_add(1, std::memory_order_relaxed);
            } else {
                // Pop
                auto result = stack.pop();
                if (result) {
                    total_popped.fetch_add(1, std::memory_order_relaxed);
                } else {
                    empty_pops.fetch_add(1, std::memory_order_relaxed);
                }
            }
        }
    };

    std::vector<std::thread> threads;
    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < NUM_THREADS; ++i) {
        threads.emplace_back(worker, i);
    }

    for (auto& t : threads) {
        t.join();
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    // Drain remaining items
    int remaining = 0;
    while (stack.pop()) {
        remaining++;
        total_popped.fetch_add(1, std::memory_order_relaxed);
    }

    std::cout << "Stress test completed in " << duration.count() << "ms\n";
    std::cout << "Pushed: " << total_pushed << "\n";
    std::cout << "Popped: " << total_popped << "\n";
    std::cout << "Empty pops: " << empty_pops << "\n";
    std::cout << "Remaining: " << remaining << "\n";

    assert(total_pushed == total_popped);
    std::cout << "PASS: All items accounted for\n";
}

Phase 5: Performance Benchmarking (Days 13-14)

Goal: Compare against mutex-based implementation.

template<typename T>
class MutexStack {
    std::stack<T> stack;
    mutable std::mutex mtx;

public:
    void push(T value) {
        std::lock_guard<std::mutex> lock(mtx);
        stack.push(std::move(value));
    }

    std::optional<T> pop() {
        std::lock_guard<std::mutex> lock(mtx);
        if (stack.empty()) return std::nullopt;
        T value = std::move(stack.top());
        stack.pop();
        return value;
    }
};

void benchmark() {
    // Test both implementations under same conditions
    // Measure throughput (ops/sec) and latency distribution
}

Testing Strategy

Compiler Flags

# Development build with sanitizers
clang++ -std=c++20 -g -O1 \
    -fsanitize=thread \
    -fno-omit-frame-pointer \
    lockfree_stack.cpp -o lockfree_stack_tsan

clang++ -std=c++20 -g -O1 \
    -fsanitize=address \
    -fno-omit-frame-pointer \
    lockfree_stack.cpp -o lockfree_stack_asan

# Release build for benchmarking
clang++ -std=c++20 -O3 -DNDEBUG \
    -march=native \
    lockfree_stack.cpp -o lockfree_stack_release

ThreadSanitizer Testing

ThreadSanitizer detects data races - concurrent accesses where at least one is a write and there’s no synchronization.

$ ./lockfree_stack_tsan
# Should produce NO warnings

# If you see:
# WARNING: ThreadSanitizer: data race
# Read of size 8 at 0x... by thread T1:
# Previous write of size 8 at 0x... by thread T2:
# That indicates a bug in your implementation

Common TSan findings in lock-free code:

Missing memory ordering (using relaxed when acquire/release needed)
Accessing non-atomic member through atomic pointer
Not protecting reads with hazard pointers

AddressSanitizer Testing

AddressSanitizer detects memory errors - use-after-free, buffer overflows, etc.

$ ./lockfree_stack_asan
# Should produce NO warnings

# If you see:
# ==12345==ERROR: AddressSanitizer: heap-use-after-free
# That means your ABA protection is broken

The ABA test should crash WITHOUT hazard pointers:

void aba_test_naive() {
    NaiveLockFreeStack<int> stack;
    // Orchestrate an ABA scenario
    // This SHOULD trigger ASan errors
}

Unit Tests

TEST_CASE("Basic operations") {
    LockFreeStack<int> stack;

    SECTION("Push and pop single item") {
        stack.push(42);
        auto result = stack.pop();
        REQUIRE(result.has_value());
        REQUIRE(*result == 42);
    }

    SECTION("Pop from empty returns nullopt") {
        auto result = stack.pop();
        REQUIRE_FALSE(result.has_value());
    }

    SECTION("LIFO order maintained") {
        stack.push(1);
        stack.push(2);
        stack.push(3);

        REQUIRE(*stack.pop() == 3);
        REQUIRE(*stack.pop() == 2);
        REQUIRE(*stack.pop() == 1);
    }
}

TEST_CASE("Concurrent operations") {
    // See stress test above
}

Common Pitfalls

Pitfall 1: Wrong Memory Ordering

Problem: Using memory_order_relaxed everywhere for “performance”.

// WRONG: No synchronization between push and pop
void push(T value) {
    new_node->next = head.load(memory_order_relaxed);
    head.compare_exchange_weak(..., memory_order_relaxed, memory_order_relaxed);
}

// RIGHT: Release on push, acquire on pop
void push(T value) {
    new_node->next = head.load(memory_order_relaxed);
    head.compare_exchange_weak(..., memory_order_release, memory_order_relaxed);
}

auto pop() {
    Node* old_head = head.load(memory_order_acquire);  // See push's release
    // ...
}

Why it matters: Without acquire/release, the compiler and CPU can reorder operations. A pop might see head but not see the data that was stored in the node!

Pitfall 2: Not Re-checking After Hazard Pointer Publication

Problem: Publishing hazard pointer but not verifying source is still valid.

// WRONG: Race between load and store
T* ptr = source.load();
hazard_pointer = ptr;
// Gap here! source might have changed and ptr might be freed!
Node* next = ptr->next;  // Use-after-free possible!

// RIGHT: Double-check pattern
T* ptr;
do {
    ptr = source.load();
    hazard_pointer = ptr;
} while (ptr != source.load());
// Now ptr is guaranteed to be protected

Pitfall 3: Retiring Before Clearing Hazard Pointer

Problem: If you retire a node while it’s still your hazard pointer, you might free it!

// WRONG: Still hazard-protected when retired
T value = old_head->data;
hp_domain.retire(old_head);  // Might scan and find our own hazard!
hp_domain.clear(slot);

// RIGHT: Clear hazard first
hp_domain.clear(slot);
T value = std::move(old_head->data);
hp_domain.retire(old_head);

// ALSO RIGHT: Clear before extracting data (but be careful)
T value = std::move(old_head->data);  // Safe: we won CAS, we own it
hp_domain.clear(slot);
hp_domain.retire(old_head);

Pitfall 4: Thread-Local Slot Leakage

Problem: Not releasing hazard pointer slot when thread exits.

// WRONG: Slot never released
static thread_local int slot = hp_domain.acquire_slot();

// RIGHT: Use RAII or explicit cleanup
class ThreadGuard {
    HazardPointerDomain& domain;
    int slot;
public:
    ThreadGuard(HazardPointerDomain& d) : domain(d), slot(d.acquire_slot()) {}
    ~ThreadGuard() { domain.release_slot(slot); }
    int get_slot() const { return slot; }
};

Pitfall 5: Livelock Under High Contention

Problem: CAS fails repeatedly, threads make no progress.

// PROBLEMATIC: Tight spin loop
while (!head.compare_exchange_weak(old_head, new_node)) {
    // Immediate retry - hammers cache line
}

// BETTER: Exponential backoff
int backoff = 1;
while (!head.compare_exchange_weak(old_head, new_node)) {
    for (int i = 0; i < backoff; ++i) {
        _mm_pause();  // CPU hint: we're spinning
    }
    backoff = std::min(backoff * 2, 1024);
}

Extensions and Challenges

Extension 1: Epoch-Based Reclamation

Implement epoch-based reclamation as an alternative to hazard pointers:

class EpochDomain {
    std::atomic<uint64_t> global_epoch{0};

    struct ThreadData {
        std::atomic<uint64_t> epoch;
        std::atomic<bool> active;
        std::array<std::vector<void*>, 3> retired;  // One per epoch (mod 3)
    };

    // ...
};

Advantages over hazard pointers:

Lower overhead per operation (no hazard pointer store)
Better for read-heavy workloads
Simpler reasoning about protection

Disadvantages:

Memory can be held longer (until all threads advance)
Requires cooperation from all threads
One stalled thread can prevent all reclamation

Extension 2: Wait-Free Pop

Make pop wait-free (guaranteed to complete in bounded steps):

// Use helping: if you fail CAS, help the other thread complete
// This is significantly more complex

Extension 3: Lock-Free Queue

Build a lock-free FIFO queue using the same techniques:

template<typename T>
class LockFreeQueue {
    std::atomic<Node*> head;
    std::atomic<Node*> tail;

    // Two atomic pointers = more complex synchronization
    void enqueue(T value);
    std::optional<T> dequeue();
};

Extension 4: Performance Optimization

Optimize for your hardware:

Cache line padding: Prevent false sharing
Prefetching: Hint to CPU about upcoming accesses
NUMA awareness: Keep data on local memory node
Specialized allocators: Pool nodes, reduce allocation overhead

Challenge: Formal Verification

Use a model checker to verify your implementation:

CDSChecker: Specifically designed for C++11 memory model
Relacy: Race detector for lock-free algorithms
TLA+/PlusCal: Model the algorithm abstractly

Resources

Essential Reading

Topic	Book/Resource	Chapter/Section
Lock-free stack fundamentals	“C++ Concurrency in Action” by Anthony Williams	Chapter 7.2
ABA problem and solutions	“C++ Concurrency in Action” by Anthony Williams	Chapter 7.2.3
Hazard pointers	“C++ Concurrency in Action” by Anthony Williams	Chapter 7.2.4
Memory ordering	“C++ Concurrency in Action” by Anthony Williams	Chapter 5
Lock-free theory	“The Art of Multiprocessor Programming” by Herlihy & Shavit	Chapters 10-11

Online Resources

ModernesCpp Lock-Free Series: Practical C++ lock-free implementation guides
Preshing on Programming: Excellent articles on memory ordering and lock-free programming
1024cores: Deep dive into lock-free algorithms and data structures
Herb Sutter’s “atomic<> Weapons” talks: Video series on C++ atomics

Papers

“Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes” by Michael (2002) - Original hazard pointers paper
“Practical Lock-Freedom” by Fraser (2004) - Epoch-based reclamation
“The Art of Multiprocessor Programming” - Comprehensive textbook

Reference Implementations

Folly: Facebook’s lock-free data structures
Boost.Lockfree: Production-ready lock-free containers
libcds: Concurrent Data Structures library

Self-Assessment Checklist

Before considering this project complete, verify you can:

Understanding

Explain what “lock-free” means and how it differs from “wait-free”
Draw the ABA problem scenario step-by-step
Explain why CAS alone doesn’t prevent ABA
Describe hazard pointers and when nodes can be safely freed
Explain why memory ordering matters for correctness
Identify when lock-free is better than mutex (and when it’s worse)

Implementation

Implement lock-free push with correct CAS usage
Implement lock-free pop with hazard pointer protection
Implement the hazard pointer domain (publish, retire, scan)
Handle thread-local slot acquisition and release correctly

Testing

All tests pass with ThreadSanitizer (no data races)
All tests pass with AddressSanitizer (no memory errors)
Stress test with 8+ threads shows correct behavior
Performance is measurably better than mutex under contention

Analysis

Measured and compared throughput vs mutex-based stack
Measured memory reclamation overhead
Identified peak memory usage during stress test

Submission/Completion Criteria

Your lock-free stack is complete when:

Correctness: Passes all stress tests with sanitizers clean
ABA Safety: Demonstrably handles the ABA scenario
Memory Safety: No leaks (all retired nodes eventually freed)
Performance: Shows throughput advantage over mutex under contention
Documentation: Code is well-commented explaining the synchronization

Deliverables

lockfree_stack.hpp - The lock-free stack implementation
hazard_pointer.hpp - The hazard pointer domain
test_lockfree_stack.cpp - Unit and stress tests
benchmark.cpp - Performance comparison vs mutex
README.md - Build instructions and results

Interview Preparation

After completing this project, you should be able to:

Explain lock-free guarantees vs mutex vs wait-free
Whiteboard the ABA problem with a concrete example
Describe two solutions to safe memory reclamation
Analyze memory orderings in concurrent code
Discuss trade-offs of lock-free vs lock-based approaches

Next Project: P06 - Lock-Free MPMC Queue - The “final boss” of lock-free data structures