Project 4: Spinlock and Read-Write Lock Library

Build production-quality synchronization primitives from scratch using only std::atomic, understanding why “just use a mutex” is often the right answer.


Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Primary Language C++
Alternative Languages C, Rust, Assembly (for comparison)
Prerequisites Projects 1-2 completed, understanding of std::atomic
Main Reference “C++ Concurrency in Action” Chapter 5 - Anthony Williams
Knowledge Area Atomics, Synchronization Primitives, Memory Ordering
Tools Required C++17 compiler, perf (optional), ThreadSanitizer

Learning Objectives

By completing this project, you will be able to:

  1. Implement a spinlock using atomic exchange - Understand how test-and-set creates mutual exclusion
  2. Choose appropriate memory orderings - Know when to use acquire, release, relaxed, and seq_cst
  3. Optimize for contention - Apply test-and-test-and-set and exponential backoff techniques
  4. Design fair locks - Implement ticket locks that guarantee FIFO ordering
  5. Build reader-writer locks - Coordinate multiple readers with exclusive writers using atomic bit manipulation
  6. Benchmark correctly - Measure lock performance and understand micro-benchmark pitfalls
  7. Recognize when NOT to use spinlocks - Understand why std::mutex often outperforms hand-rolled locks

Theoretical Foundation

Why Build Your Own Locks?

You probably shouldn’t use these in production. So why build them?

Understanding atomics: You cannot truly understand atomic operations until you’ve used them to build something real. Reading about memory_order_acquire is abstract; implementing a spinlock makes it concrete.

Understanding contention: Seeing your naive spinlock perform 10x worse than std::mutex under high contention teaches more than any textbook.

Interview preparation: “Implement a spinlock” and “explain memory ordering” are common systems programming interview questions. Building this project prepares you for both.

Debugging expertise: When you hit a concurrency bug in production code, understanding lock internals helps you diagnose whether it’s a lock implementation issue or a logic error.

Core Concepts

1. What Is a Spinlock?

A spinlock is the simplest mutual exclusion primitive. Unlike a mutex, which yields to the OS scheduler when blocked, a spinlock “spins” in a tight loop checking if the lock is available:

Mutex behavior:                     Spinlock behavior:
┌─────────────────────────┐         ┌─────────────────────────┐
│ Thread A holds lock     │         │ Thread A holds lock     │
│                         │         │                         │
│ Thread B tries to lock: │         │ Thread B tries to lock: │
│   → Can't get it        │         │   → Can't get it        │
│   → Syscall to kernel   │         │   → Loop: is it free?   │
│   → Put on wait queue   │         │   → Loop: is it free?   │
│   → Context switch out  │         │   → Loop: is it free?   │
│   → (sleep)             │         │   → Loop: is it free?   │
│                         │         │   → (burning CPU)       │
│ Thread A unlocks:       │         │                         │
│   → Kernel wakes B      │         │ Thread A unlocks:       │
│   → Context switch in   │         │   → B sees it free      │
│   → B has lock          │         │   → B grabs lock        │
└─────────────────────────┘         └─────────────────────────┘

When spinlocks win: Short critical sections (< 100 cycles), multi-core systems, real-time constraints.

When mutexes win: Long critical sections, uniprocessor systems, high contention.

2. The Test-and-Set Operation

The fundamental building block is atomic exchange (test-and-set):

Before:  lock_state = 0 (unlocked)

Thread A executes: exchange(lock_state, 1)
  → Returns old value: 0
  → Sets new value: 1
  → Thread A knows it got the lock (returned 0)

Thread B executes: exchange(lock_state, 1)
  → Returns old value: 1
  → Sets new value: 1 (no change)
  → Thread B knows it didn't get the lock (returned 1)

This is atomic: no matter how threads interleave, exactly one thread sees the 0→1 transition.

3. Memory Ordering: The Critical Part

Atomic operations don’t just prevent torn reads/writes—they establish ordering between threads. This is where most lock implementations go wrong.

Incorrect spinlock (no memory ordering):
┌─────────────────────────────────────────────────────────────────┐
│ Thread A:                    │ Thread B:                        │
│   data = 42;                 │   while (lock == 1) spin;        │
│   lock = 1;                  │   print(data);  // might print 0!│
└─────────────────────────────────────────────────────────────────┘

Why? Without memory ordering, the compiler or CPU might reorder:
- Thread A might set lock before data
- Thread B might read data before seeing lock

Correct spinlock (with acquire/release):
┌─────────────────────────────────────────────────────────────────┐
│ Thread A:                    │ Thread B:                        │
│   data = 42;                 │   while (lock.load(acquire) == 0)│
│   lock.store(1, release);    │     spin;                        │
│   // release: all writes     │   // acquire: see all writes     │
│   // before this are visible │   // before the release          │
│                              │   print(data);  // always 42     │
└─────────────────────────────────────────────────────────────────┘

Memory orderings for locks:

  • Acquire (on lock): “I need to see all writes that happened before the release that I’m synchronizing with”
  • Release (on unlock): “All my writes must be visible before anyone sees this unlock”
  • Relaxed: “Just atomicity, no ordering” - NEVER use this alone for lock state
  • Seq_cst: “Total global order” - Correct but potentially slower than acquire/release

4. The Cache Coherence Problem

On multi-core CPUs, each core has its own cache. When Thread A writes to a cache line, all other cores must invalidate their cached copy. This is the MESI protocol:

Core 0                     Core 1                     Memory
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│ Cache       │           │ Cache       │           │             │
│ lock=0 (E)  │           │ (Invalid)   │           │ lock=0      │
└─────────────┘           └─────────────┘           └─────────────┘
      │                         │
      │ Core 0: exchange(1)     │
      ▼                         │
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│ Cache       │           │ Cache       │           │             │
│ lock=1 (M)  │◀──────────│ (snoop msg) │           │ lock=0      │
└─────────────┘  invalidate   └─────────────┘           └─────────────┘
                              │
                              │ Core 1: exchange(1)
                              │  → must fetch from Core 0
                              ▼
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│ Cache       │ ─────────▶│ Cache       │           │             │
│ lock=1 (S)  │  transfer │ lock=1 (M)  │           │ lock=0      │
└─────────────┘           └─────────────┘           └─────────────┘

The problem: A naive spinlock executes exchange() on every iteration. Each exchange invalidates the cache line on all other cores, causing a “cache line ping-pong” that devastates performance.

The solution: Test-and-test-and-set (TTAS):

  1. First, read the lock with a relaxed load (doesn’t modify, no invalidation)
  2. Only if it looks free, try the exchange

5. Fairness vs Performance

Simple spinlocks are unfair: a thread might never get the lock if others keep grabbing it. Ticket locks provide FIFO fairness:

Ticket Lock State:
┌───────────────────────────────────────┐
│  now_serving: 5                       │
│  next_ticket: 8                       │
└───────────────────────────────────────┘

Waiting threads:
  Thread A: my_ticket = 6  (waiting: 6 != 5)
  Thread B: my_ticket = 7  (waiting: 7 != 5)
  Thread C: my_ticket = 5  (running: 5 == 5)

When C unlocks:
  now_serving.fetch_add(1)  → becomes 6
  Thread A: my_ticket = 6  (running: 6 == 6)
  Thread B: my_ticket = 7  (waiting: 7 != 6)

Ticket locks guarantee that threads acquire the lock in the order they requested it.

6. Read-Write Lock Concepts

Many data structures are read-mostly. A read-write lock allows multiple concurrent readers OR one exclusive writer:

RWLock States:
┌────────────────────────────────────────────────────────────┐
│                                                            │
│  Unlocked        Multiple Readers     Writer              │
│  ┌─────┐         ┌─────┐              ┌─────┐             │
│  │  0  │         │ 3R  │              │  W  │             │
│  └─────┘         └─────┘              └─────┘             │
│     │               │                    │                 │
│     │ reader_lock() │                    │ reader_lock()   │
│     ▼               ▼                    ▼                 │
│  ┌─────┐         ┌─────┐              BLOCK               │
│  │ 1R  │         │ 4R  │                                   │
│  └─────┘         └─────┘                                   │
│     │                                                      │
│     │ writer_lock()                                        │
│     ▼                                                      │
│  BLOCK (wait for readers to drain)                         │
│                                                            │
└────────────────────────────────────────────────────────────┘

Atomic state encoding (one approach):

  • Bits 0-30: Reader count (up to 2 billion concurrent readers)
  • Bit 31: Writer flag (1 = writer holds or is waiting)

Historical Context

1970s: Spinlocks emerged on early multiprocessor systems. The test-and-set instruction was added to hardware specifically for mutual exclusion.

1991: Mellor-Crummey and Scott published “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors” introducing MCS locks—the foundation of modern high-performance locks.

2011: C++11 standardized the memory model, giving portable access to atomic operations. Before this, correct lock-free code required compiler-specific intrinsics and hardware knowledge.

Today: Linux kernel uses ticket spinlocks (since 2008), then queued spinlocks (since 2015). Understanding these helps you read kernel code.

Common Misconceptions

“Spinlocks are always faster than mutexes” False. For any critical section longer than ~100 cycles, or any system with high contention, mutexes win. The syscall overhead is amortized by efficient sleeping.

“Relaxed ordering is fine for spinlocks” Catastrophically wrong. Relaxed provides atomicity but no ordering. Your protected data access might be reordered outside the critical section.

“exchange() is the only way to implement a spinlock” False. Compare-and-swap (CAS) also works, and is required for more complex structures. Exchange is just simpler for basic spinlocks.

“Read-write locks are always better than mutexes for read-heavy workloads” Depends. RWLocks have more overhead (maintaining reader count). For very short critical sections, a simple spinlock might be faster even with occasional writer blocking.


Project Specification

What You’re Building

A synchronization primitive library called sync_primitives containing:

  1. simple_spinlock: Basic test-and-set spinlock
  2. ttas_spinlock: Test-and-test-and-set with exponential backoff
  3. ticket_lock: Fair FIFO spinlock
  4. rw_spinlock: Reader-writer lock with atomic state
  5. benchmark suite: Comparative performance analysis

Functional Requirements

// All locks must support RAII guards
{
    std::lock_guard<simple_spinlock> guard(lock);
    // critical section
}

// Reader-writer lock supports shared and exclusive modes
{
    rw_spinlock::read_guard reader(rwlock);   // shared access
}
{
    rw_spinlock::write_guard writer(rwlock);  // exclusive access
}

Interface Specification

class simple_spinlock {
public:
    void lock();
    bool try_lock();
    void unlock();
};

class ttas_spinlock {
public:
    void lock();      // With exponential backoff
    bool try_lock();
    void unlock();
};

class ticket_lock {
public:
    void lock();      // Returns when it's your turn
    bool try_lock();  // Non-blocking attempt
    void unlock();
};

class rw_spinlock {
public:
    void lock_shared();      // Reader lock
    bool try_lock_shared();
    void unlock_shared();

    void lock_exclusive();   // Writer lock
    bool try_lock_exclusive();
    void unlock_exclusive();

    class read_guard;   // RAII for shared
    class write_guard;  // RAII for exclusive
};

Expected Benchmark Output

$ ./spinlock_benchmark

===============================================================================
                     Spinlock & RWLock Benchmark Suite
===============================================================================
CPU: AMD Ryzen 9 5900X (12 cores, 24 threads)
Compiler: g++ 12.1 with -O3 -march=native

-------------------------------------------------------------------------------
Test 1: Uncontended Lock/Unlock Latency (10M operations, single thread)
-------------------------------------------------------------------------------
                         Mean (cycles)    Std Dev    Min     Max
simple_spinlock                   12        2.1       9       45
ttas_spinlock                     14        2.3      10       48
ticket_lock                       21        3.4      15       62
rw_spinlock (read)                18        2.8      13       55
rw_spinlock (write)               24        3.9      17       71
std::mutex                        41        8.2      32      180

Analysis: Spinlocks are 2-3x faster than std::mutex in the uncontended case
          due to avoiding potential syscall overhead.

-------------------------------------------------------------------------------
Test 2: High Contention Stress Test (8 threads, 1M ops each)
-------------------------------------------------------------------------------
                         Total Time (ms)    Ops/sec      Fairness*
simple_spinlock                 2,340       3,418,803    0.23
ttas_spinlock                     890       8,988,764    0.31
ticket_lock                       920       8,695,652    0.98
std::mutex                        650      12,307,692    0.85

* Fairness = min_ops / max_ops across threads (1.0 = perfect fairness)

Analysis: Under contention, std::mutex WINS! The OS scheduler is smarter
          than busy-waiting. simple_spinlock shows extreme unfairness.

-------------------------------------------------------------------------------
Test 3: Reader-Writer Scalability (95% reads, 5% writes)
-------------------------------------------------------------------------------
Threads     rw_spinlock (ops/s)    std::mutex (ops/s)    Speedup
    1            8,200,000            7,800,000           1.05x
    2           15,800,000            7,100,000           2.23x
    4           30,500,000            6,800,000           4.49x
    8           58,200,000            6,200,000           9.39x
   16          102,000,000            5,900,000          17.29x

Analysis: RWLock scales linearly for read-heavy workloads. Readers don't
          block each other, achieving near-linear speedup.

-------------------------------------------------------------------------------
Test 4: Cache Line Contention Analysis
-------------------------------------------------------------------------------
simple_spinlock: 89,234,000 cache line invalidations (BAD)
ttas_spinlock:    8,912,000 cache line invalidations (GOOD)
                 ↳ 10x reduction from test-and-test-and-set pattern

-------------------------------------------------------------------------------
Test 5: Fairness Under Load (8 threads competing for 1M total acquisitions)
-------------------------------------------------------------------------------
                    Thread Distribution (%)
Lock Type       T0    T1    T2    T3    T4    T5    T6    T7
simple           42    18     8     7     6     7     6     6   ← UNFAIR
ttas             31    19    12    10     9     8     6     5   ← Better
ticket           12.5  12.5  12.5  12.5  12.5  12.5  12.5  12.5 ← FAIR
std::mutex       14    13    13    12    12    12    12    12   ← Good

===============================================================================
                              CONCLUSIONS
===============================================================================
1. For uncontended locks: Spinlocks are 2-3x faster than std::mutex
2. For contended locks: std::mutex is 1.5-3x faster (use the OS!)
3. For read-heavy workloads: RWLock can provide 10-20x scaling
4. For fairness requirements: Use ticket_lock or std::mutex
5. TTAS is 2-10x better than simple_spinlock under contention

RECOMMENDATION: Use std::mutex by default. Only use spinlocks when:
  - Critical section is extremely short (< 100 cycles)
  - You've measured and proven spinlock is faster for YOUR workload
  - You need the RWLock reader concurrency
===============================================================================

Solution Architecture

Lock State Diagrams

Simple Spinlock State

                    ┌──────────────────────────────────┐
                    │         STATE = 0 (Unlocked)     │
                    │                                  │
                    └───────────────┬──────────────────┘
                                    │
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
            ▼                       ▼                       ▼
     Thread A: lock()        Thread B: lock()        Thread C: lock()
            │                       │                       │
            │ exchange(1)           │ exchange(1)           │ exchange(1)
            │ returns 0 ✓          │ returns 1 ✗           │ returns 1 ✗
            │                       │                       │
            ▼                       ▼                       ▼
    ┌───────────────┐        ┌───────────────┐       ┌───────────────┐
    │ A holds lock  │        │ B spins       │       │ C spins       │
    │ STATE = 1     │        │ exchange(1).. │       │ exchange(1).. │
    └───────────────┘        └───────────────┘       └───────────────┘
            │
            │ unlock(): store(0, release)
            ▼
    ┌──────────────────────────────────┐
    │         STATE = 0 (Unlocked)     │
    │         Next exchange wins       │
    └──────────────────────────────────┘

Ticket Lock State

┌─────────────────────────────────────────────────────────────────────────┐
│                          TICKET LOCK STATE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Counters:   next_ticket: 5     now_serving: 3                          │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                    Waiting Threads                                  │ │
│  ├────────────────────────────────────────────────────────────────────┤ │
│  │                                                                     │ │
│  │   Thread A          Thread B          Thread C          Thread D   │ │
│  │   my_ticket=3       my_ticket=4       my_ticket=5       (none)     │ │
│  │   ★ RUNNING         ○ WAITING         ○ WAITING                    │ │
│  │   (3 == 3)          (4 != 3)          (5 != 3)                     │ │
│  │                                                                     │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  Thread D calls lock():                                                  │
│    1. my_ticket = next_ticket.fetch_add(1) = 5                          │
│    2. next_ticket becomes 6                                              │
│    3. D waits: while (now_serving != 5) spin;                           │
│                                                                          │
│  Thread A calls unlock():                                                │
│    1. now_serving.fetch_add(1) → becomes 4                              │
│    2. Thread B sees now_serving == 4 == my_ticket                       │
│    3. Thread B is now RUNNING                                            │
│                                                                          │
│  Fairness Guarantee: Threads run in ticket order (3, 4, 5, 6, ...)      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Read-Write Lock State Encoding

State variable (32 bits):
┌─────────────────────────────────────────────────────────────────────────┐
│ Bit 31                        Bits 30-0                                  │
│ ┌────────┐  ┌──────────────────────────────────────────────────────────┐│
│ │ Writer │  │                    Reader Count                          ││
│ │  Flag  │  │               (0 to 2,147,483,647)                       ││
│ └────────┘  └──────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘

State Examples:
  0x00000000  =  No writer, 0 readers      (Unlocked)
  0x00000003  =  No writer, 3 readers      (Read locked)
  0x80000000  =  Writer flag, 0 readers    (Write locked)
  0x80000005  =  Writer waiting, 5 readers (Readers draining)

Constants:
  WRITER_BIT  = 0x80000000  (bit 31)
  READER_MASK = 0x7FFFFFFF  (bits 0-30)

State Transitions:
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│    ┌─────────────┐        reader_lock()         ┌─────────────┐         │
│    │             │ ──────────────────────────▶  │             │         │
│    │  UNLOCKED   │                               │  N READERS  │         │
│    │  (0)        │ ◀──────────────────────────  │  (N)        │         │
│    │             │        reader_unlock()        │             │         │
│    └──────┬──────┘                               └──────┬──────┘         │
│           │                                             │                │
│           │ writer_lock()                               │ writer_lock()  │
│           │ (set bit 31)                                │ (set bit 31,   │
│           │                                             │  wait for N=0) │
│           ▼                                             ▼                │
│    ┌─────────────┐                               ┌─────────────┐         │
│    │   WRITER    │                               │   DRAINING  │         │
│    │  (0x8000..) │                               │ (0x8000000N)│         │
│    └─────────────┘                               └─────────────┘         │
│           │                                             │                │
│           │ writer_unlock()                             │ readers exit   │
│           │ (clear bit 31)                              │ (N→0)          │
│           │                                             │                │
│           ▼                                             ▼                │
│    ┌─────────────┐                               ┌─────────────┐         │
│    │  UNLOCKED   │ ◀──────────────────────────  │   WRITER    │         │
│    │  (0)        │                               │  (0x8000..) │         │
│    └─────────────┘                               └─────────────┘         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key Data Structures

// Simple spinlock: single atomic bool/int
class simple_spinlock {
private:
    std::atomic<bool> locked_{false};
    // OR: std::atomic<int> locked_{0};
};

// Ticket lock: two atomic counters
class ticket_lock {
private:
    std::atomic<uint32_t> next_ticket_{0};
    std::atomic<uint32_t> now_serving_{0};
    // Padding to prevent false sharing
    char padding_[64 - 2 * sizeof(std::atomic<uint32_t>)];
};

// RW spinlock: single atomic with bit manipulation
class rw_spinlock {
private:
    static constexpr uint32_t WRITER_BIT = 0x80000000;
    static constexpr uint32_t READER_MASK = 0x7FFFFFFF;

    std::atomic<uint32_t> state_{0};
};

Component Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        sync_primitives Library                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐          │
│  │ simple_spinlock │  │  ttas_spinlock  │  │   ticket_lock   │          │
│  │                 │  │                 │  │                 │          │
│  │ • exchange loop │  │ • relaxed read  │  │ • fetch_add for │          │
│  │ • acquire/release│ │ • then exchange │  │   ticket        │          │
│  │ • minimal code  │  │ • exp. backoff  │  │ • FIFO fairness │          │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘          │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐│
│  │                          rw_spinlock                                 ││
│  │                                                                      ││
│  │  • Bit-packed state: writer flag + reader count                     ││
│  │  • CAS loops for state transitions                                  ││
│  │  • Writer waits for readers to drain                                ││
│  │  • Provides read_guard and write_guard RAII types                   ││
│  └─────────────────────────────────────────────────────────────────────┘│
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐│
│  │                         Benchmark Suite                              ││
│  │                                                                      ││
│  │  • Uncontended latency (rdtsc cycle counting)                       ││
│  │  • Contention stress test (N threads, M operations)                 ││
│  │  • Fairness analysis (ops per thread distribution)                  ││
│  │  • Cache analysis (perf counters if available)                      ││
│  │  • Comparison with std::mutex and std::shared_mutex                 ││
│  └─────────────────────────────────────────────────────────────────────┘│
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Implementation Guide

Phase 1: Simple Spinlock (Days 1-2)

Goal: Implement and verify a basic test-and-set spinlock.

Step 1: Basic Implementation

class simple_spinlock {
private:
    std::atomic<bool> locked_{false};

public:
    void lock() {
        // exchange(true) returns previous value
        // If previous was false, we got the lock
        // If previous was true, someone else has it, keep trying
        while (locked_.exchange(true, std::memory_order_acquire)) {
            // Spin until we get the lock
        }
    }

    void unlock() {
        locked_.store(false, std::memory_order_release);
    }

    bool try_lock() {
        // Only succeed if currently unlocked (false -> true)
        return !locked_.exchange(true, std::memory_order_acquire);
    }
};

Why acquire on lock?

  • Ensures all reads after the lock see writes made before the corresponding unlock
  • Creates a “happens-before” relationship with the release in unlock

Why release on unlock?

  • Ensures all writes before unlock are visible to threads that subsequently acquire

Step 2: Verification Test

void test_simple_spinlock() {
    simple_spinlock lock;
    int counter = 0;
    const int iterations = 100000;
    const int num_threads = 4;

    std::vector<std::thread> threads;
    for (int i = 0; i < num_threads; ++i) {
        threads.emplace_back([&]() {
            for (int j = 0; j < iterations; ++j) {
                lock.lock();
                ++counter;  // Critical section
                lock.unlock();
            }
        });
    }

    for (auto& t : threads) t.join();

    // Must equal exactly num_threads * iterations
    assert(counter == num_threads * iterations);
    std::cout << "Simple spinlock test PASSED\n";
}

Checkpoint: Test passes consistently with ThreadSanitizer enabled.

Phase 2: Test-and-Test-and-Set Optimization (Days 3-4)

Goal: Reduce cache line contention.

The Problem with Simple Spinlock

Every call to exchange() performs a write, which invalidates the cache line on all cores:

Simple spinlock: 100 threads spinning
  - Each thread: exchange(true) every iteration
  - 100 cache invalidations per iteration
  - Result: Cache line "ping-pongs" between cores

TTAS spinlock: 100 threads spinning
  - Each thread: load() in a loop (read-only, no invalidation!)
  - Only one exchange() when lock appears free
  - Result: Cache line stays shared until contention clears

Implementation

class ttas_spinlock {
private:
    std::atomic<bool> locked_{false};

public:
    void lock() {
        for (;;) {
            // Phase 1: Test (read-only, no cache invalidation)
            while (locked_.load(std::memory_order_relaxed)) {
                // Spin on local cache, no bus traffic
                // Optional: add pause instruction for hyperthreading
                #if defined(__x86_64__)
                    __builtin_ia32_pause();
                #endif
            }

            // Phase 2: Test-and-Set (only if it looked free)
            if (!locked_.exchange(true, std::memory_order_acquire)) {
                return;  // Got the lock
            }
            // Otherwise, someone else grabbed it; back to Phase 1
        }
    }

    void unlock() {
        locked_.store(false, std::memory_order_release);
    }
};

Why relaxed for the inner load?

  • We’re just checking, not acquiring yet
  • No ordering requirements until we actually get the lock
  • Relaxed is fastest for spin-waiting

Adding Exponential Backoff

Under high contention, even TTAS causes contention at the “looks free” moment. Backoff spreads attempts over time:

void lock() {
    int backoff = 1;
    const int max_backoff = 1024;

    for (;;) {
        while (locked_.load(std::memory_order_relaxed)) {
            for (int i = 0; i < backoff; ++i) {
                #if defined(__x86_64__)
                    __builtin_ia32_pause();
                #endif
            }
        }

        if (!locked_.exchange(true, std::memory_order_acquire)) {
            return;
        }

        // Failed to acquire, increase backoff
        backoff = std::min(backoff * 2, max_backoff);
    }
}

Checkpoint: Benchmark shows 2-10x improvement over simple_spinlock under contention.

Phase 3: Ticket Lock for Fairness (Days 5-6)

Goal: Guarantee FIFO ordering.

Implementation

class ticket_lock {
private:
    alignas(64) std::atomic<uint32_t> next_ticket_{0};
    alignas(64) std::atomic<uint32_t> now_serving_{0};
    // alignas(64) prevents false sharing between the two counters

public:
    void lock() {
        // Get our ticket number
        uint32_t my_ticket = next_ticket_.fetch_add(1, std::memory_order_relaxed);

        // Wait until it's our turn
        while (now_serving_.load(std::memory_order_acquire) != my_ticket) {
            #if defined(__x86_64__)
                __builtin_ia32_pause();
            #endif
        }
    }

    void unlock() {
        // Advance to next ticket
        now_serving_.fetch_add(1, std::memory_order_release);
    }

    bool try_lock() {
        uint32_t expected = now_serving_.load(std::memory_order_relaxed);
        // Try to take the next ticket only if it's currently our turn
        return next_ticket_.compare_exchange_strong(
            expected,
            expected + 1,
            std::memory_order_acquire,
            std::memory_order_relaxed
        );
    }
};

Why fetch_add with relaxed for getting ticket?

  • We just need a unique ticket number
  • The ordering guarantee comes from the acquire load on now_serving

Fairness guarantee:

  • Thread with ticket N runs when now_serving == N
  • now_serving only increases
  • Therefore, threads run in ticket order

Checkpoint: Fairness test shows near-equal distribution across threads.

Phase 4: Read-Write Lock (Days 7-10)

Goal: Allow concurrent readers with exclusive writers.

State Design

class rw_spinlock {
private:
    static constexpr uint32_t WRITER_BIT = 0x80000000;
    static constexpr uint32_t READER_MASK = 0x7FFFFFFF;

    std::atomic<uint32_t> state_{0};

Reader Lock Implementation

void lock_shared() {
    for (;;) {
        // Read current state
        uint32_t current = state_.load(std::memory_order_relaxed);

        // If a writer holds or is waiting, we must wait
        if (current & WRITER_BIT) {
            #if defined(__x86_64__)
                __builtin_ia32_pause();
            #endif
            continue;
        }

        // Try to increment reader count
        // compare_exchange_weak may fail spuriously (that's okay, we'll retry)
        if (state_.compare_exchange_weak(
                current,
                current + 1,
                std::memory_order_acquire,
                std::memory_order_relaxed)) {
            return;  // Got read lock
        }
        // CAS failed, current was updated, loop back
    }
}

void unlock_shared() {
    // Decrement reader count
    state_.fetch_sub(1, std::memory_order_release);
}

Why compare_exchange_weak?

  • We’re in a loop anyway, so spurious failures are fine
  • Weak may be faster on some architectures (no loop in hardware)

The TOCTOU problem: Between checking for writer and incrementing, a writer could appear. That’s why we use CAS: if the state changed, CAS fails and we recheck.

Writer Lock Implementation

void lock_exclusive() {
    for (;;) {
        uint32_t current = state_.load(std::memory_order_relaxed);

        // If any writer is present, wait
        if (current & WRITER_BIT) {
            #if defined(__x86_64__)
                __builtin_ia32_pause();
            #endif
            continue;
        }

        // Try to set writer bit (claiming write intent)
        if (state_.compare_exchange_weak(
                current,
                current | WRITER_BIT,
                std::memory_order_acquire,
                std::memory_order_relaxed)) {

            // We've claimed writer intent, now wait for readers to drain
            while ((state_.load(std::memory_order_relaxed) & READER_MASK) != 0) {
                #if defined(__x86_64__)
                    __builtin_ia32_pause();
                #endif
            }
            return;  // All readers gone, we have exclusive access
        }
    }
}

void unlock_exclusive() {
    // Clear writer bit
    state_.fetch_and(~WRITER_BIT, std::memory_order_release);
}

Writer starvation prevention: Once a writer sets the bit, new readers will block. Existing readers drain out, then the writer proceeds.

RAII Guards

class read_guard {
private:
    rw_spinlock& lock_;
public:
    explicit read_guard(rw_spinlock& lock) : lock_(lock) {
        lock_.lock_shared();
    }
    ~read_guard() {
        lock_.unlock_shared();
    }
    read_guard(const read_guard&) = delete;
    read_guard& operator=(const read_guard&) = delete;
};

class write_guard {
private:
    rw_spinlock& lock_;
public:
    explicit write_guard(rw_spinlock& lock) : lock_(lock) {
        lock_.lock_exclusive();
    }
    ~write_guard() {
        lock_.unlock_exclusive();
    }
    write_guard(const write_guard&) = delete;
    write_guard& operator=(const write_guard&) = delete;
};

Checkpoint: Multiple readers can proceed concurrently; writer gets exclusive access.

Phase 5: Benchmarking (Days 11-14)

Goal: Comparative performance analysis.

Cycle-Accurate Timing

#if defined(__x86_64__)
inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
#else
inline uint64_t rdtsc() {
    return std::chrono::high_resolution_clock::now().time_since_epoch().count();
}
#endif

template<typename Lock>
void benchmark_uncontended(Lock& lock, const char* name) {
    const int iterations = 1000000;

    uint64_t start = rdtsc();
    for (int i = 0; i < iterations; ++i) {
        lock.lock();
        lock.unlock();
    }
    uint64_t end = rdtsc();

    double cycles_per_op = (double)(end - start) / iterations;
    std::cout << name << ": " << cycles_per_op << " cycles/op\n";
}

Contention Test

template<typename Lock>
void benchmark_contended(Lock& lock, int num_threads, int ops_per_thread) {
    std::atomic<uint64_t> total_ops{0};
    std::vector<uint64_t> per_thread_ops(num_threads);

    auto start = std::chrono::high_resolution_clock::now();

    std::vector<std::thread> threads;
    for (int t = 0; t < num_threads; ++t) {
        threads.emplace_back([&, t]() {
            uint64_t my_ops = 0;
            for (int i = 0; i < ops_per_thread; ++i) {
                lock.lock();
                // Minimal critical section
                ++my_ops;
                lock.unlock();
            }
            per_thread_ops[t] = my_ops;
            total_ops.fetch_add(my_ops, std::memory_order_relaxed);
        });
    }

    for (auto& th : threads) th.join();

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    // Calculate fairness
    uint64_t min_ops = *std::min_element(per_thread_ops.begin(), per_thread_ops.end());
    uint64_t max_ops = *std::max_element(per_thread_ops.begin(), per_thread_ops.end());
    double fairness = (double)min_ops / max_ops;

    std::cout << "Total time: " << duration.count() << "ms\n";
    std::cout << "Ops/sec: " << (total_ops * 1000 / duration.count()) << "\n";
    std::cout << "Fairness: " << fairness << "\n";
}

RWLock Scaling Test

template<typename RWLock>
void benchmark_rwlock_scaling(int read_percent = 95) {
    RWLock lock;
    std::atomic<uint64_t> counter{0};

    auto worker = [&](int thread_id) {
        std::mt19937 rng(thread_id);
        std::uniform_int_distribution<int> dist(1, 100);

        for (int i = 0; i < 100000; ++i) {
            if (dist(rng) <= read_percent) {
                // Read operation
                typename RWLock::read_guard guard(lock);
                volatile auto x = counter.load(std::memory_order_relaxed);
                (void)x;
            } else {
                // Write operation
                typename RWLock::write_guard guard(lock);
                counter.fetch_add(1, std::memory_order_relaxed);
            }
        }
    };

    for (int num_threads : {1, 2, 4, 8, 16}) {
        auto start = std::chrono::high_resolution_clock::now();

        std::vector<std::thread> threads;
        for (int t = 0; t < num_threads; ++t) {
            threads.emplace_back(worker, t);
        }
        for (auto& th : threads) th.join();

        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

        uint64_t total_ops = num_threads * 100000;
        std::cout << num_threads << " threads: "
                  << (total_ops * 1000 / duration.count()) << " ops/sec\n";
    }
}

Checkpoint: Benchmarks reveal clear performance differences and guide when to use each lock type.


Testing Strategy

Unit Tests

// Test mutual exclusion
TEST(SimpleSpinlock, MutualExclusion) {
    simple_spinlock lock;
    int counter = 0;
    std::vector<std::thread> threads;

    for (int i = 0; i < 10; ++i) {
        threads.emplace_back([&]() {
            for (int j = 0; j < 10000; ++j) {
                lock.lock();
                int temp = counter;
                temp++;
                counter = temp;  // Non-atomic increment
                lock.unlock();
            }
        });
    }

    for (auto& t : threads) t.join();
    EXPECT_EQ(counter, 100000);
}

// Test try_lock semantics
TEST(SimpleSpinlock, TryLock) {
    simple_spinlock lock;

    EXPECT_TRUE(lock.try_lock());   // Should succeed
    EXPECT_FALSE(lock.try_lock());  // Should fail (already locked)
    lock.unlock();
    EXPECT_TRUE(lock.try_lock());   // Should succeed again
    lock.unlock();
}

// Test RWLock concurrent readers
TEST(RWSpinlock, ConcurrentReaders) {
    rw_spinlock lock;
    std::atomic<int> readers_inside{0};
    std::atomic<int> max_concurrent{0};

    std::vector<std::thread> threads;
    for (int i = 0; i < 10; ++i) {
        threads.emplace_back([&]() {
            for (int j = 0; j < 1000; ++j) {
                lock.lock_shared();
                int current = ++readers_inside;
                int prev_max = max_concurrent.load();
                while (current > prev_max &&
                       !max_concurrent.compare_exchange_weak(prev_max, current));
                std::this_thread::yield();  // Stay inside briefly
                --readers_inside;
                lock.unlock_shared();
            }
        });
    }

    for (auto& t : threads) t.join();
    EXPECT_GT(max_concurrent, 1);  // Multiple readers should overlap
}

ThreadSanitizer Integration

Compile with TSan to detect data races:

g++ -fsanitize=thread -g -O1 spinlock_test.cpp -o spinlock_test
./spinlock_test

A correct implementation produces no TSan warnings.

Stress Tests

void stress_test(int num_threads, int duration_seconds) {
    simple_spinlock lock;
    std::atomic<bool> stop{false};
    std::atomic<uint64_t> operations{0};
    int shared_data = 0;

    std::vector<std::thread> threads;
    for (int i = 0; i < num_threads; ++i) {
        threads.emplace_back([&]() {
            while (!stop) {
                lock.lock();
                // Verify invariant
                assert(shared_data >= 0);
                shared_data++;
                shared_data--;
                operations.fetch_add(1, std::memory_order_relaxed);
                lock.unlock();
            }
        });
    }

    std::this_thread::sleep_for(std::chrono::seconds(duration_seconds));
    stop = true;

    for (auto& t : threads) t.join();

    std::cout << "Completed " << operations << " operations\n";
    assert(shared_data == 0);  // Invariant maintained
}

Common Pitfalls and Debugging

Pitfall 1: Missing Memory Ordering

Symptom: Data corruption despite holding lock. Tests fail intermittently.

Example of bug:

// WRONG: relaxed ordering doesn't synchronize
void lock() {
    while (locked_.exchange(true, std::memory_order_relaxed)) {}
}
void unlock() {
    locked_.store(false, std::memory_order_relaxed);
}

Why it fails: The compiler/CPU can reorder operations across the lock boundary. Protected writes might become visible before the lock appears acquired.

Fix: Use acquire on lock, release on unlock:

void lock() {
    while (locked_.exchange(true, std::memory_order_acquire)) {}
}
void unlock() {
    locked_.store(false, std::memory_order_release);
}

Pitfall 2: Livelock in RWLock

Symptom: System appears to hang. Threads are running (burning CPU) but making no progress.

Example scenario:

Thread W1: Tries to write, sets WRITER_BIT, waits for readers
Thread R1: Tries to read, sees WRITER_BIT, spins
Thread R2: Tries to read, sees WRITER_BIT, spins
...
Thread W1: Gives up, clears WRITER_BIT
Thread R1: Acquires read lock
Thread W1: Tries again, sets WRITER_BIT
Thread R2: Sees WRITER_BIT, spins
Thread R1: Still reading, W1 waits
... (W1 starves if readers keep coming)

Fix: Once WRITER_BIT is set, new readers must wait. Existing readers drain, then writer proceeds. (Our implementation does this correctly.)

Pitfall 3: False Sharing in Ticket Lock

Symptom: Ticket lock performs worse than simple spinlock.

Cause: next_ticket and now_serving on same cache line.

Example of bug:

class ticket_lock {
    std::atomic<uint32_t> next_ticket_;   // |
    std::atomic<uint32_t> now_serving_;   // | Same cache line!
};

Fix: Align to separate cache lines:

class ticket_lock {
    alignas(64) std::atomic<uint32_t> next_ticket_;
    alignas(64) std::atomic<uint32_t> now_serving_;
};

Pitfall 4: Forgetting Pause in Spin Loops

Symptom: High CPU usage, hyperthreading performance degradation.

Cause: Tight spin loops prevent the sibling hyperthread from running efficiently.

Fix: Add pause instruction:

while (locked_.load(std::memory_order_relaxed)) {
    #if defined(__x86_64__)
        __builtin_ia32_pause();  // Or _mm_pause()
    #elif defined(__aarch64__)
        __asm__ volatile("yield");
    #endif
}

Pitfall 5: compare_exchange_strong vs compare_exchange_weak

Symptom: Occasional unexpected failures in CAS loops.

Understanding the difference:

  • compare_exchange_strong: Only fails if value differs
  • compare_exchange_weak: May fail spuriously (even if values match)

When to use each:

  • Weak in loops (we’ll retry anyway, and weak may be faster)
  • Strong when failure has side effects or we’re not looping

Debugging with PrintF

When locks misbehave, add tracing:

void lock() {
    thread_local static int thread_id = get_thread_id();
    std::cerr << "[T" << thread_id << "] Attempting lock\n";

    while (locked_.exchange(true, std::memory_order_acquire)) {
        // Uncomment to see spin attempts (very verbose!)
        // std::cerr << "[T" << thread_id << "] Spinning...\n";
    }

    std::cerr << "[T" << thread_id << "] Acquired lock\n";
}

Extensions and Challenges

Extension 1: MCS Lock (Advanced Fairness)

Implement the Mellor-Crummey-Scott lock, which spins on local variables instead of shared state:

MCS Lock Structure:
Each thread has a node:
┌─────────┐     ┌─────────┐     ┌─────────┐
│  Node A │────▶│  Node B │────▶│  Node C │
│ locked  │     │ locked  │     │ locked  │
└─────────┘     └─────────┘     └─────────┘
     ↑               ↑               ↑
Thread A         Thread B        Thread C
spins on         spins on         spins on
Node A           Node B           Node C
(local!)         (local!)         (local!)

Benefits: Each thread spins on its own cache line. No global cache line ping-pong.

Extension 2: Adaptive Spinning

Implement a lock that switches between spinning and yielding based on observed contention:

void lock() {
    int spin_count = 0;
    const int MAX_SPIN = 1000;

    while (!try_lock()) {
        if (++spin_count < MAX_SPIN) {
            // Spin
            __builtin_ia32_pause();
        } else {
            // Yield to OS
            std::this_thread::yield();
            spin_count = 0;
        }
    }
}

Extension 3: Hierarchical Lock

For NUMA systems, implement a lock that prefers threads on the same NUMA node:

NUMA Node 0          NUMA Node 1
┌───────────────┐    ┌───────────────┐
│ Local lock 0  │    │ Local lock 1  │
│               │    │               │
│ Thread 0     │    │ Thread 4      │
│ Thread 1     │    │ Thread 5      │
│ Thread 2     │    │ Thread 6      │
│ Thread 3     │    │ Thread 7      │
└───────┬───────┘    └───────┬───────┘
        │                    │
        └─────────┬─────────┘
                  │
            ┌───────────┐
            │Global lock│
            └───────────┘

Extension 4: Sequence Lock (SeqLock)

Implement a lock where readers never block but may retry:

class seqlock {
    std::atomic<uint32_t> seq_{0};  // Even = no writer, odd = writer active

    // Writer
    void write_lock() {
        seq_.fetch_add(1, std::memory_order_acquire);  // Makes odd
    }
    void write_unlock() {
        seq_.fetch_add(1, std::memory_order_release);  // Makes even
    }

    // Reader (non-blocking, may retry)
    template<typename F>
    auto read(F&& func) {
        while (true) {
            uint32_t s1 = seq_.load(std::memory_order_acquire);
            if (s1 & 1) continue;  // Writer active, retry

            auto result = func();

            std::atomic_thread_fence(std::memory_order_acquire);
            uint32_t s2 = seq_.load(std::memory_order_relaxed);
            if (s1 == s2) return result;  // No writer during read
            // Otherwise retry
        }
    }
};

Challenge: Lock-Free Comparison

Implement the benchmarks with a lock-free counter (atomic increment) and compare throughput. When does locking beat lock-free?


Resources

Primary References

Academic Papers

  • “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors” - Mellor-Crummey & Scott, 1991
    • Introduces MCS locks, still relevant today
  • “The Art of Multiprocessor Programming” - Herlihy & Shavit
    • Chapter 7: Spin Locks and Contention
    • Formal treatment of lock correctness

Implementation References

  • Linux kernel spinlock implementation: include/asm-generic/spinlock.h
  • Folly SpinLock: folly/synchronization/SpinLock.h
  • Boost.Atomic documentation

Tools

  • ThreadSanitizer (TSan): -fsanitize=thread - Detects data races
  • perf stat: perf stat -e cache-misses,cache-references - Cache analysis
  • Intel VTune: Detailed lock contention analysis

Self-Assessment Checklist

Before considering this project complete, verify you can:

Implementation

  • Simple spinlock passes all correctness tests with TSan
  • TTAS shows measurably better performance under contention than simple spinlock
  • Ticket lock shows near-perfect fairness in distribution test
  • RWLock allows multiple concurrent readers
  • RWLock writer gets exclusive access (readers drain)
  • All locks work with std::lock_guard

Understanding

  • Explain why memory_order_acquire is needed on lock
  • Explain why memory_order_release is needed on unlock
  • Explain the cache coherence problem that TTAS solves
  • Explain the TOCTOU race in RWLock and how CAS prevents it
  • Explain when spinlocks beat mutexes and vice versa

Benchmarking

  • Measured uncontended latency in cycles
  • Measured contended throughput with multiple threads
  • Compared fairness between lock types
  • Compared against std::mutex and std::shared_mutex
  • Can explain why std::mutex often wins under contention

Interview Readiness

  • Can implement a spinlock on a whiteboard
  • Can explain compare_exchange semantics
  • Can discuss fairness vs performance tradeoffs
  • Can describe real-world lock usage scenarios

Submission/Completion Criteria

Your implementation is complete when:

  1. All tests pass with ThreadSanitizer enabled (no data races)

  2. Benchmark output matches expected patterns:
    • Spinlocks faster than mutex for uncontended
    • Mutex faster than spinlocks under high contention
    • RWLock scales with reader count
    • Ticket lock shows ~1.0 fairness
  3. Code quality:
    • All locks are header-only or single compilation unit
    • RAII guards provided for all lock types
    • Proper use of alignas to prevent false sharing
    • Pause instructions in spin loops
  4. Documentation:
    • README explaining each lock type and when to use it
    • Benchmark results from your machine
    • Your analysis of the results
  5. Bonus points:
    • MCS lock implementation
    • NUMA-aware measurements
    • Comparison with std::atomic_flag spinlock

The Core Question You’re Answering

“What does it mean for an operation to be atomic, and how do atomic operations create synchronization?”

When you write locked_.exchange(true), the hardware guarantees that no other core can interleave a read or write to that location. But atomicity alone doesn’t create a lock—a lock requires that operations inside the critical section happen after acquiring and before releasing.

This is what memory ordering provides. The acquire on lock says “nothing I do can move before this.” The release on unlock says “nothing I did can move after this.” Together, they create a happens-before relationship: everything inside the critical section happens after the lock and before the unlock, as observed by any other thread.

Building locks from atomics teaches you that synchronization is not magic—it’s a careful choreography of visibility guarantees. When you understand this deeply, you understand why concurrent programming is hard: the default is chaos, and every guarantee costs performance.

After completing this project, you’ll never again wonder “why does my concurrent code break?” You’ll know exactly which ordering guarantee is missing.


After completing this project, you’ll have internalized why “just use a mutex” is often the right answer—and you’ll know exactly when it isn’t. Project 5 (Lock-Free Stack) builds on these atomic operation skills to create data structures where no thread ever waits for another.