Project 6: Atomic Lock-Free Queue (The Concurrency Beast)

Project 6: Atomic Lock-Free Queue (The Concurrency Beast)

“Lock-free programming is like juggling chainsaws while blindfolded. The chainsaws are memory barriers, and the blindfold is the CPU reordering your instructions.” - Herb Sutter


Project Metadata

  • Main Programming Language: Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Difficulty: Level 5: Master
  • Knowledge Area: Concurrency / Low-Level Atomics
  • Time Estimate: 2-3 weeks
  • Prerequisites: Strong understanding of Rust ownership, basic threading (std::thread), understanding of memory layout and pointers

What You Will Build

A Single-Producer Single-Consumer (SPSC) or Multi-Producer Multi-Consumer (MPMC) lock-free queue using only atomic operations. You will not use Mutex or RwLock. Your queue will achieve 18+ million operations per second, dramatically outperforming standard library channels.


Learning Objectives

By the end of this project, you will be able to:

  1. Explain lock-free vs. lock-based data structures and articulate the trade-offs of each approach
  2. Master Rust’s atomic memory orderings (Relaxed, Acquire, Release, AcqRel, SeqCst) and apply the correct ordering to each operation
  3. Implement Compare-and-Swap (CAS) loops correctly and understand their role in lock-free algorithms
  4. Identify and solve the ABA problem using tagged pointers or epoch-based reclamation
  5. Prevent false sharing through cache line padding and alignment
  6. Design a ring buffer with efficient head/tail pointer management
  7. Benchmark and verify lock-free data structures using stress tests and formal verification tools like Loom

Deep Theoretical Foundation

Before writing any code, you must internalize the concepts that make lock-free programming one of the most challenging areas in computer science. This section provides the mental framework you need.

What Is Lock-Free Programming and Why Does It Matter?

Traditional concurrent programming uses locks (mutexes, semaphores) to protect shared data. When Thread A holds a lock, Thread B must wait. This waiting is the fundamental problem with locks:

+-------------------------------------------------------------------------+
|                    LOCK-BASED QUEUE: THE WAITING GAME                   |
+-------------------------------------------------------------------------+

Timeline:        0ms   10ms   20ms   30ms   40ms   50ms   60ms   70ms
                  |     |      |      |      |      |      |      |
Thread A:        [===LOCK===][WRITE][===UNLOCK===]
                                                    ^
Thread B:        [TRY LOCK]..WAITING..WAITING......[===LOCK===][READ]
                        ^                          |
                        |                          Finally gets lock!
                        Blocked here for 50ms

Problem: Thread B does NOTHING for 50ms. CPU cycles wasted.
         If Thread A crashes while holding lock -> Deadlock.
         If Thread A is preempted by OS -> Priority inversion.

Lock-free programming eliminates this waiting. In a lock-free data structure:

At least one thread is guaranteed to make progress in a finite number of steps, regardless of what other threads are doing.

This means:

  • No deadlocks (no locks to hold)
  • No priority inversion (no waiting on lower-priority threads)
  • Better worst-case latency (no unbounded waits)
  • Often better throughput under contention
+-------------------------------------------------------------------------+
|                   LOCK-FREE QUEUE: CONCURRENT PROGRESS                  |
+-------------------------------------------------------------------------+

Timeline:        0ms   10ms   20ms   30ms   40ms   50ms   60ms   70ms
                  |     |      |      |      |      |      |      |
Thread A:        [CAS][WRITE][CAS][WRITE][CAS][WRITE]...
                  |     |      |      |      |      |
Thread B:        [CAS][READ][CAS][FAIL-RETRY][CAS][READ]...
                                   ^
                                   Retry, don't wait!

Key insight: Both threads ALWAYS make progress.
             No thread ever blocks waiting for another.
             "Failed" operations immediately retry.

When to use lock-free structures:

Use Case Why Lock-Free Wins
Audio/Video Processing Cannot tolerate latency spikes; buffer underruns cause audible glitches
Game Engines Frame deadlines are absolute; blocking = stutter
Trading Systems Microseconds matter; latency = lost money
Database Engines High-throughput transaction logging
Message Passing Actor systems, channel implementations

Book Reference: “Rust Atomics and Locks” by Mara Bos, Chapter 1: “Basics of Rust Concurrency”


Memory Ordering: The Heart of Lock-Free Programming

Modern CPUs don’t execute instructions in the order you write them. They reorder for performance. The memory ordering model defines what guarantees you get about when writes become visible to other threads.

Rust provides five memory orderings, from weakest to strongest:

+-------------------------------------------------------------------------+
|                        MEMORY ORDERING SPECTRUM                         |
+-------------------------------------------------------------------------+

Weakest (Fastest)                                    Strongest (Slowest)
     |                                                        |
     v                                                        v
 +---------+    +----------+    +----------+    +--------+    +--------+
 | Relaxed | -> | Acquire  | -> | Release  | -> | AcqRel | -> | SeqCst |
 +---------+    +----------+    +----------+    +--------+    +--------+
     |              |               |               |              |
     |              |               |               |              |
 No ordering   Prevents      Prevents         Both          Global
 guarantees    reads from    writes from      Acquire       total
 between       being         being            AND           order
 threads       reordered     reordered        Release       visible
               BEFORE        AFTER                          to all
               this load     this store                     threads

Relaxed Ordering

The compiler and CPU can reorder operations freely. Only guarantees atomicity (no torn reads/writes).

// Relaxed: Only use for counters where order doesn't matter
static COUNTER: AtomicUsize = AtomicUsize::new(0);
COUNTER.fetch_add(1, Ordering::Relaxed); // Just count, order irrelevant

Acquire/Release Ordering

The workhorse of lock-free programming. Creates a “happens-before” relationship.

+-------------------------------------------------------------------------+
|                    ACQUIRE/RELEASE SYNCHRONIZATION                      |
+-------------------------------------------------------------------------+

Thread A (Producer)                    Thread B (Consumer)
        |                                      |
        |  data = 42;                          |
        |  // All writes BEFORE                |
        |  // this Release are                 |
        |  // visible to any thread            |
        |  // that does an Acquire             |
        |  // on the same location             |
        |       |                              |
        v       v                              |
  flag.store(true, Release)                    |
        |                                      |
        |   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~>      |
        |   (synchronization via flag)         |
        |                                      v
        |                        flag.load(Acquire)
        |                        // After this Acquire,
        |                        // all writes from before
        |                        // the Release are visible
        |                              |
        |                              v
        |                         assert_eq!(data, 42); // SAFE!
        |                              |
        v                              v

The Release "publishes" all prior writes.
The Acquire "receives" all those writes.

The Critical Rule: A Release store “synchronizes with” an Acquire load on the same memory location. All writes before the Release become visible after the Acquire.

use std::sync::atomic::{AtomicBool, Ordering};

static FLAG: AtomicBool = AtomicBool::new(false);
static mut DATA: i32 = 0;

// Thread A
unsafe { DATA = 42; }
FLAG.store(true, Ordering::Release);  // "Publishes" DATA = 42

// Thread B
while !FLAG.load(Ordering::Acquire) { /* spin */ }  // "Receives" publication
unsafe { assert_eq!(DATA, 42); }  // Guaranteed to see 42!

SeqCst (Sequentially Consistent)

Provides a global total order visible to all threads. The strongest guarantee, but often unnecessary and slower.

+-------------------------------------------------------------------------+
|                 SEQUENTIALLY CONSISTENT ORDERING                        |
+-------------------------------------------------------------------------+

With SeqCst, ALL threads agree on a single total order of ALL SeqCst
operations. It's like everyone is watching the same movie.

Thread A          Thread B          Thread C
   |                 |                 |
   v                 v                 v
x.store(1)      y.store(1)         observe()
   |                 |                 |
   |                 |                 |
   +--------+--------+                 |
            |                          |
            v                          v
     Global total order           Sees either:
     that ALL threads             (x=1, y=0) then (x=1, y=1)
     observe:                     OR
                                  (x=0, y=1) then (x=1, y=1)
     [x=1] -> [y=1]
     OR                           But NEVER:
     [y=1] -> [x=1]               (x=1, y=1) then (x=0, y=0)

When to use SeqCst: Only when you need multiple atomic variables to appear consistently ordered across all threads. For single-variable synchronization, Acquire/Release is usually sufficient and faster.

Book Reference: “Rust Atomics and Locks” by Mara Bos, Chapter 3: “Memory Ordering”


Compare-and-Swap (CAS): The Lock-Free Primitive

CAS is the fundamental operation that makes lock-free programming possible. It atomically:

  1. Reads the current value
  2. Compares it to an expected value
  3. If equal, writes a new value and returns success
  4. If not equal, returns failure (and the actual current value)
+-------------------------------------------------------------------------+
|                    COMPARE-AND-SWAP OPERATION                           |
+-------------------------------------------------------------------------+

                    CAS(address, expected, new)
                            |
                            v
              +----------------------------+
              |    Read current value      |
              |    from address            |
              +-------------+--------------+
                            |
                            v
              +----------------------------+
              |  current == expected?      |
              +------+-------------+-------+
                     |             |
                  YES|             |NO
                     v             v
              +-------------+ +-------------------+
              | Write new   | | Don't write       |
              | Return Ok   | | Return Err(current)|
              +-------------+ +-------------------+

This ENTIRE operation is ATOMIC. No other thread can see
an intermediate state.

CAS Loop Pattern (the bread and butter of lock-free code):

use std::sync::atomic::{AtomicUsize, Ordering};

fn lock_free_increment(counter: &AtomicUsize) {
    loop {
        let current = counter.load(Ordering::Relaxed);
        let new = current + 1;

        // Try to swap. If another thread changed the value, retry.
        match counter.compare_exchange(
            current,           // expected
            new,               // new value
            Ordering::Release, // success ordering
            Ordering::Relaxed  // failure ordering
        ) {
            Ok(_) => break,    // We won! Our write succeeded.
            Err(_) => continue, // Someone else wrote. Reload and retry.
        }
    }
}
+-------------------------------------------------------------------------+
|                         CAS LOOP VISUALIZATION                          |
+-------------------------------------------------------------------------+

Thread A                              Thread B
   |                                     |
   | current = 5                         | current = 5
   | new = 6                             | new = 6
   |                                     |
   v                                     v
CAS(5 -> 6) ----+                    CAS(5 -> 6) ----+
                |                                    |
                | (Race condition!)                  |
                |                                    |
         +------+------+                      +------+------+
         | Thread A    |                      | Thread B    |
         | wins!       |                      | loses!      |
         | counter = 6 |                      | sees 6 != 5 |
         +-------------+                      +-------------+
                                                    |
                                                    v
                                              RETRY:
                                              current = 6
                                              new = 7
                                              CAS(6 -> 7) -> SUCCESS!
                                              counter = 7

Both threads made progress. No blocking occurred.

The ABA Problem: The Subtle Killer

The ABA problem is the most insidious bug in lock-free programming. It occurs when:

  1. Thread A reads value A from a location
  2. Thread A is preempted
  3. Thread B changes A -> B -> A (back to A)
  4. Thread A resumes, sees A, and thinks nothing changed!
+-------------------------------------------------------------------------+
|                         THE ABA PROBLEM                                 |
+-------------------------------------------------------------------------+

Initial state: HEAD -> [Node A] -> [Node B] -> [Node C] -> NULL

Thread A (POP operation):
  1. Read HEAD = A
  2. Prepare: new_head = A.next = B
  3. *** PREEMPTED ***

                    Thread B runs:
                    4. Pop A (HEAD = B)
                    5. Pop B (HEAD = C)
                    6. Free A, Free B
                    7. Allocate new node (reuses A's memory!)
                    8. Push new A (HEAD = A)

                    New state: HEAD -> [Node A*] -> [Node C] -> NULL
                                        (* new content, same address!)

Thread A resumes:
  9. CAS(HEAD, A, B)
  10. Succeeds! (HEAD still points to address A)
  11. HEAD = B

  PROBLEM: B was freed! HEAD now points to freed memory!

After Thread A's CAS:
  HEAD -> [FREED MEMORY] -> ???
              ^
              CRASH / DATA CORRUPTION

Why this happens: CAS only checks the pointer value, not what the pointer points to. The memory at address A was reused, but the address itself is the same.

Solutions to the ABA problem:

  1. Tagged Pointers (what we’ll use):
    • Add a counter alongside the pointer
    • Increment the counter on every modification
    • CAS checks both pointer AND counter
    +----------------+----------------+
    |    Counter     |    Pointer     |
    |    (32 bits)   |    (32 bits)   |
    +----------------+----------------+
    
    Even if pointer returns to same address,
    counter will be different!
    
  2. Hazard Pointers: Threads publish which pointers they’re using; no reuse until safe

  3. Epoch-Based Reclamation: Memory freed only when all threads have passed an epoch

Book Reference: “Rust Atomics and Locks” by Mara Bos, Chapter 9: “Lock-Free Linked Lists” (covers ABA in depth)


Cache Coherence and False Sharing

Modern CPUs don’t read individual bytes from RAM. They read cache lines (typically 64 bytes). When two threads modify variables on the same cache line, they fight over that cache line, even if they’re modifying different variables!

+-------------------------------------------------------------------------+
|                         FALSE SHARING                                   |
+-------------------------------------------------------------------------+

                       CACHE LINE (64 bytes)
  +-------+-------+-------+-------+-------+-------+-------+-------+
  | head  | tail  |       |       |       |       |       |       |
  | (8B)  | (8B)  | ...   | ...   | ...   | ...   | ...   | ...   |
  +---+---+---+---+-------+-------+-------+-------+-------+-------+
      |       |
      |       +-- Thread B writes tail
      |           -> Invalidates ENTIRE cache line
      |           -> Thread A must reload head!
      |
      +-- Thread A writes head
          -> Invalidates ENTIRE cache line
          -> Thread B must reload tail!

Both threads constantly invalidate each other's cache.
Performance tanks even though they write DIFFERENT variables!

The solution: Pad your data structures so hot variables live on separate cache lines.

+-------------------------------------------------------------------------+
|                      CACHE LINE PADDING                                 |
+-------------------------------------------------------------------------+

                       CACHE LINE 1 (64 bytes)
  +-------+-------+-------+-------+-------+-------+-------+-------+
  | head  |  PAD  |  PAD  |  PAD  |  PAD  |  PAD  |  PAD  |  PAD  |
  | (8B)  | (56B padding to fill line)                             |
  +-------+-------+-------+-------+-------+-------+-------+-------+

                       CACHE LINE 2 (64 bytes)
  +-------+-------+-------+-------+-------+-------+-------+-------+
  | tail  |  PAD  |  PAD  |  PAD  |  PAD  |  PAD  |  PAD  |  PAD  |
  | (8B)  | (56B padding to fill line)                             |
  +-------+-------+-------+-------+-------+-------+-------+-------+

Now Thread A and Thread B have their own cache lines.
No invalidation! Full speed!

In Rust:

#[repr(align(64))]  // Align to cache line boundary
struct CacheLinePadded<T> {
    value: T,
    _padding: [u8; 64 - std::mem::size_of::<T>()],
}

Hardware vs. Software Memory Models

Understanding the difference between what the hardware does and what your programming language promises is crucial.

+-------------------------------------------------------------------------+
|                    MEMORY MODEL HIERARCHY                               |
+-------------------------------------------------------------------------+

              +----------------------------------+
              |     Your Rust Code               |
              |  (uses Ordering::Acquire, etc.)  |
              +----------------+-----------------+
                               |
                               v
              +----------------------------------+
              |     Rust Memory Model            |
              |  (Abstract machine semantics)    |
              |  - Defines valid executions      |
              |  - Compiler inserts barriers     |
              +----------------+-----------------+
                               |
                               v
              +----------------------------------+
              |     LLVM / Machine Code          |
              |  (Platform-specific barriers)    |
              |  x86: MFENCE, LOCK prefix        |
              |  ARM: DMB, DSB, ISB              |
              +----------------+-----------------+
                               |
                               v
              +----------------------------------+
              |     Hardware Memory Model        |
              |  (What the CPU actually does)    |
              |  x86-64: Strong (TSO)            |
              |  ARM: Weak                       |
              +----------------------------------+

Different CPUs have different memory models:

x86-64 (Intel/AMD):
  - Total Store Order (TSO)
  - Relatively strong: stores are seen in order
  - Your code might work "by accident" even with wrong orderings!

ARM/RISC-V:
  - Weak memory model
  - Stores can be reordered aggressively
  - Bugs in ordering WILL manifest on these platforms

This is why testing on x86 isn't enough. Use Loom or test on ARM!

Book Reference: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron, Chapter 6: “The Memory Hierarchy”


Lock-Free vs. Wait-Free vs. Obstruction-Free

These terms define different levels of progress guarantees:

+-------------------------------------------------------------------------+
|                    PROGRESS GUARANTEE HIERARCHY                         |
+-------------------------------------------------------------------------+

BLOCKING (Locks)
  - A thread can prevent all others from progressing indefinitely
  - Example: Thread holds mutex, gets preempted

OBSTRUCTION-FREE
  - A thread makes progress if it runs alone (no contention)
  - Under contention, no guarantees
  - Weakest non-blocking guarantee

LOCK-FREE
  - At least ONE thread makes progress in finite steps
  - System as a whole makes progress
  - Individual threads might retry forever (starvation)
  - OUR TARGET

WAIT-FREE
  - EVERY thread makes progress in finite steps
  - No starvation possible
  - Hardest to achieve
  - Often requires helping mechanisms

       Strength of Guarantee
            ^
            |    +-------------+
            |    | Wait-Free   |  Every thread progresses
            |    +-------------+
            |           ^
            |           |
            |    +-------------+
            |    | Lock-Free   |  System progresses (our target)
            |    +-------------+
            |           ^
            |           |
            |    +-------------------+
            |    | Obstruction-Free  |  Progress if alone
            |    +-------------------+
            |           ^
            |           |
            |    +-------------+
            +--- | Blocking    |  Can halt entirely
                 +-------------+

What we’re building: A lock-free SPSC queue (simplest) that can be extended to MPMC.


Real-World Use Cases

Lock-free data structures power the world’s most demanding systems:

Domain Example Why Lock-Free
Audio Processing Digital Audio Workstations (DAWs) Buffer underruns cause audible clicks; cannot tolerate latency spikes
Game Engines Unity, Unreal Engine job systems Frame deadlines are absolute (16.67ms for 60 FPS)
Financial Trading High-Frequency Trading platforms Microsecond advantages = millions of dollars
Databases RocksDB, ScyllaDB Transaction log must never block
Operating Systems Linux kernel, Windows kernel Interrupt handlers cannot sleep
Message Passing Tokio, Crossbeam, Rayon Channels between async tasks
Real-time Systems Medical devices, avionics Certifiable worst-case latency

Real World Outcome

When you complete this project, you will have a high-performance lock-free queue that dramatically outperforms the standard library. Here’s what running your benchmarks will look like:

$ cargo new lock-free-queue
     Created library `lock-free-queue` package

$ cd lock-free-queue

$ cargo add crossbeam-utils   # For cache line padding
    Updating crates.io index
      Adding crossbeam-utils v0.8.19 to dependencies

$ cargo build --release
   Compiling lock-free-queue v0.1.0 (/Users/you/lock-free-queue)
    Finished release [optimized] target(s) in 2.34s

$ cargo run --release --example benchmark
   Compiling lock-free-queue v0.1.0
    Finished release [optimized] target(s) in 0.89s
     Running `target/release/examples/benchmark`

+======================================================================+
|              LOCK-FREE QUEUE PERFORMANCE BENCHMARK                   |
+======================================================================+

System Information:
  CPU: Apple M2 Pro (12 cores)
  RAM: 32 GB
  OS: macOS 14.0

Test Configuration:
  Messages: 10,000,000
  Message Size: 64 bytes
  Iterations: 5

+----------------------------------------------------------------------+
|                    SINGLE-PRODUCER SINGLE-CONSUMER                   |
+----------------------------------------------------------------------+

Running std::sync::mpsc benchmark...
  Iteration 1: 4.82M ops/sec
  Iteration 2: 4.91M ops/sec
  Iteration 3: 4.76M ops/sec
  Iteration 4: 4.88M ops/sec
  Iteration 5: 4.79M ops/sec
  Average: 4.83M ops/sec

Running crossbeam-channel benchmark...
  Iteration 1: 12.34M ops/sec
  Iteration 2: 12.51M ops/sec
  Iteration 3: 12.28M ops/sec
  Iteration 4: 12.45M ops/sec
  Iteration 5: 12.39M ops/sec
  Average: 12.39M ops/sec

Running YOUR lock-free queue benchmark...
  Iteration 1: 18.92M ops/sec
  Iteration 2: 19.11M ops/sec
  Iteration 3: 18.87M ops/sec
  Iteration 4: 19.23M ops/sec
  Iteration 5: 18.95M ops/sec
  Average: 19.02M ops/sec

+----------------------------------------------------------------------+
|                         BENCHMARK RESULTS                            |
+----------------------------------------------------------------------+

Queue Type          | Throughput    | Latency (p50) | Latency (p99)
--------------------|---------------|---------------|---------------
std::sync::mpsc     | 4.83M ops/s   | 207 ns        | 892 ns
crossbeam-channel   | 12.39M ops/s  | 81 ns         | 234 ns
YOUR Lock-Free      | 19.02M ops/s  | 53 ns         | 127 ns

SPEEDUP vs std::mpsc:    3.94x
SPEEDUP vs crossbeam:    1.53x

+----------------------------------------------------------------------+
|                     MEMORY ORDERING VERIFICATION                     |
+----------------------------------------------------------------------+

Running correctness tests with Loom...

test spsc_basic ... ok (explored 1,247 interleavings)
test spsc_wrap_around ... ok (explored 3,891 interleavings)
test spsc_full_queue ... ok (explored 2,156 interleavings)
test spsc_empty_queue ... ok (explored 987 interleavings)
test spsc_concurrent_stress ... ok (explored 12,456 interleavings)

All orderings verified correct!

+----------------------------------------------------------------------+
|                        FALSE SHARING ANALYSIS                        |
+----------------------------------------------------------------------+

Without cache line padding:
  Throughput: 8.23M ops/sec
  L1 cache misses: 4,231,892

With cache line padding:
  Throughput: 19.02M ops/sec
  L1 cache misses: 12,456

Improvement: 2.31x throughput, 339x fewer cache misses!

+----------------------------------------------------------------------+
|                           SUMMARY                                    |
+----------------------------------------------------------------------+

Your lock-free SPSC queue achieves:
  * 19+ million operations per second
  * 53 nanosecond median latency
  * Zero heap allocations in hot path
  * Verified memory ordering correctness
  * Optimal cache behavior with padding

Congratulations! You've built a production-quality lock-free data structure.

$ cargo test
   Compiling lock-free-queue v0.1.0
    Finished test [unoptimized + debuginfo] target(s) in 1.45s
     Running unittests src/lib.rs

running 12 tests
test tests::test_single_push_pop ... ok
test tests::test_empty_pop_returns_none ... ok
test tests::test_full_push_returns_err ... ok
test tests::test_fifo_ordering ... ok
test tests::test_wrap_around ... ok
test tests::test_capacity ... ok
test tests::test_concurrent_spsc ... ok
test tests::test_high_contention ... ok
test tests::test_memory_ordering ... ok
test loom_tests::loom_spsc_basic ... ok
test loom_tests::loom_concurrent_access ... ok
test loom_tests::loom_wrap_around ... ok

test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Complete Project Specification

Core Requirements

You will implement a bounded, lock-free SPSC (Single-Producer Single-Consumer) queue with these characteristics:

  1. Capacity: Fixed at construction time (power of 2 for efficient modulo via bitwise AND)
  2. Blocking: Never. push and pop return immediately
  3. Ordering: FIFO (First-In First-Out)
  4. Thread Safety: Single producer thread, single consumer thread (no synchronization needed between multiple producers or consumers)
  5. Memory: No heap allocation after construction; the ring buffer is pre-allocated
  6. Performance: 18+ million operations per second on modern hardware

API Design

pub struct SpscQueue<T, const N: usize> {
    // Implementation details
}

impl<T, const N: usize> SpscQueue<T, N> {
    /// Create a new queue with capacity N.
    /// N must be a power of 2.
    pub fn new() -> Self;

    /// Attempt to push a value into the queue.
    /// Returns Err(value) if the queue is full.
    pub fn push(&self, value: T) -> Result<(), T>;

    /// Attempt to pop a value from the queue.
    /// Returns None if the queue is empty.
    pub fn pop(&self) -> Option<T>;

    /// Check if the queue is empty.
    pub fn is_empty(&self) -> bool;

    /// Check if the queue is full.
    pub fn is_full(&self) -> bool;

    /// Get the current number of elements in the queue.
    pub fn len(&self) -> usize;

    /// Get the maximum capacity of the queue.
    pub fn capacity(&self) -> usize;
}

// The queue is Send + Sync because:
// - Only one thread ever calls push (the producer)
// - Only one thread ever calls pop (the consumer)
unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}

Constraints

  • You may NOT use Mutex, RwLock, or any other blocking synchronization primitive
  • You MAY use AtomicUsize, AtomicPtr, and related types from std::sync::atomic
  • You MAY use std::cell::UnsafeCell for interior mutability
  • You MUST correctly use memory orderings (not just SeqCst everywhere)
  • You MUST prevent false sharing via cache line padding

Solution Architecture

Ring Buffer Structure

The queue is implemented as a ring buffer (circular buffer) with head and tail pointers:

+-------------------------------------------------------------------------+
|                         RING BUFFER STRUCTURE                           |
+-------------------------------------------------------------------------+

Capacity = 8 (indices 0-7)

Initial state (empty):
  HEAD = 0 (next slot to READ from)
  TAIL = 0 (next slot to WRITE to)

      HEAD
      TAIL
        |
        v
  +---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |   |
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+

After push(A), push(B), push(C):
  HEAD = 0
  TAIL = 3

      HEAD            TAIL
        |               |
        v               v
  +---+---+---+---+---+---+---+---+
  | A | B | C |   |   |   |   |   |
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+
  ^^^^^^^^^
  Readable data (HEAD to TAIL-1)

After pop() returns A:
  HEAD = 1
  TAIL = 3

          HEAD        TAIL
            |           |
            v           v
  +---+---+---+---+---+---+---+---+
  |   | B | C |   |   |   |   |   |
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+
      ^^^^^^^
      Readable data

Wrap-around example:
  After many operations, HEAD = 6, TAIL = 2

                                  HEAD
                                    |
                                    v
  +---+---+---+---+---+---+---+---+
  | X | Y |   |   |   |   | W |   |
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+
          ^               ^^^^^^^
          |               Readable: W
          TAIL
          Readable: X, Y (wrapped around!)

  To read: index = HEAD % CAPACITY = 6 % 8 = 6
  To write: index = TAIL % CAPACITY = 2 % 8 = 2

Head/Tail Pointer Management

The key insight is that in SPSC:

  • Only the producer ever modifies tail
  • Only the consumer ever modifies head

This means no CAS is needed! Simple atomic loads and stores suffice.

+-------------------------------------------------------------------------+
|                    SPSC OPERATION FLOW                                  |
+-------------------------------------------------------------------------+

PUSH OPERATION (Producer thread only):

  1. Load tail (Relaxed - we own this variable)
  2. Load head (Acquire - synchronize with consumer's Release)
  3. If (tail - head) == capacity: FULL, return Err
  4. Write data to buffer[tail % capacity]
  5. Store tail + 1 (Release - publish the new data)

    Producer                          Consumer
       |                                 |
       | tail_local = tail.load(Relaxed) |
       | head_snapshot = head.load(Acquire)
       |        <---(sync)---            |
       | if full: return Err             |
       | buffer[tail_local] = data       |
       | tail.store(tail_local+1, Release)
       |        ---(sync)--->            |
       |                                 |

POP OPERATION (Consumer thread only):

  1. Load head (Relaxed - we own this variable)
  2. Load tail (Acquire - synchronize with producer's Release)
  3. If head == tail: EMPTY, return None
  4. Read data from buffer[head % capacity]
  5. Store head + 1 (Release - publish the consumption)

       |                                 |
       |                                 | head_local = head.load(Relaxed)
       |                                 | tail_snapshot = tail.load(Acquire)
       |        <---(sync)---            |
       |                                 | if empty: return None
       |                                 | data = buffer[head_local]
       |                                 | head.store(head_local+1, Release)
       |        <---(sync)---            |

Cache Line Padding Strategy

To prevent false sharing, we pad head and tail to separate cache lines:

+-------------------------------------------------------------------------+
|                    MEMORY LAYOUT WITH PADDING                           |
+-------------------------------------------------------------------------+

struct SpscQueue<T, N> {
                         CACHE LINE 1 (64 bytes)
    +----------------------------------------------------------+
    | head: AtomicUsize  |  _pad1: [u8; 56]                    |
    | (8 bytes)          |  (56 bytes padding)                 |
    +----------------------------------------------------------+

                         CACHE LINE 2 (64 bytes)
    +----------------------------------------------------------+
    | tail: AtomicUsize  |  _pad2: [u8; 56]                    |
    | (8 bytes)          |  (56 bytes padding)                 |
    +----------------------------------------------------------+

                         CACHE LINES 3+ (N * size_of::<T>())
    +----------------------------------------------------------+
    | buffer: [UnsafeCell<MaybeUninit<T>>; N]                  |
    | (The actual ring buffer data)                             |
    +----------------------------------------------------------+
}

With this layout:
  - Producer writes to CACHE LINE 2 (tail) and CACHE LINES 3+ (buffer)
  - Consumer reads from CACHE LINE 2 and writes to CACHE LINE 1 (head)
  - No false sharing between head and tail!

Memory Ordering Choices

Here’s the reasoning for each ordering:

Operation Ordering Reasoning
Producer reads tail Relaxed Producer “owns” tail; no sync needed
Producer reads head Acquire Must see consumer’s writes to head
Producer writes buffer (none) Protected by tail/head synchronization
Producer writes tail Release “Publishes” buffer write to consumer
Consumer reads head Relaxed Consumer “owns” head; no sync needed
Consumer reads tail Acquire Must see producer’s writes to tail
Consumer reads buffer (none) Protected by tail/head synchronization
Consumer writes head Release “Publishes” consumption to producer

Phased Implementation Guide

Phase 1: Basic SPSC Queue with Acquire/Release

Goal: Get a working single-threaded queue that compiles.

// src/lib.rs
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
use std::sync::atomic::{AtomicUsize, Ordering};

pub struct SpscQueue<T, const N: usize> {
    head: AtomicUsize,
    tail: AtomicUsize,
    buffer: [UnsafeCell<MaybeUninit<T>>; N],
}

impl<T, const N: usize> SpscQueue<T, N> {
    pub fn new() -> Self {
        assert!(N.is_power_of_two(), "Capacity must be a power of 2");

        // Safe because MaybeUninit doesn't require initialization
        let buffer = unsafe {
            MaybeUninit::<[UnsafeCell<MaybeUninit<T>>; N]>::uninit().assume_init()
        };

        Self {
            head: AtomicUsize::new(0),
            tail: AtomicUsize::new(0),
            buffer,
        }
    }

    pub fn push(&self, value: T) -> Result<(), T> {
        let tail = self.tail.load(Ordering::Relaxed);
        let head = self.head.load(Ordering::Acquire);

        if tail.wrapping_sub(head) == N {
            return Err(value); // Queue is full
        }

        let index = tail & (N - 1); // Efficient modulo for power of 2

        unsafe {
            (*self.buffer[index].get()).write(value);
        }

        self.tail.store(tail.wrapping_add(1), Ordering::Release);
        Ok(())
    }

    pub fn pop(&self) -> Option<T> {
        let head = self.head.load(Ordering::Relaxed);
        let tail = self.tail.load(Ordering::Acquire);

        if head == tail {
            return None; // Queue is empty
        }

        let index = head & (N - 1);

        let value = unsafe {
            (*self.buffer[index].get()).assume_init_read()
        };

        self.head.store(head.wrapping_add(1), Ordering::Release);
        Some(value)
    }
}

unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}

Checkpoint: Write a basic test that pushes and pops a single value.


Phase 2: Bounded Ring Buffer with Proper Indexing

Goal: Handle wrap-around correctly and add capacity checks.

impl<T, const N: usize> SpscQueue<T, N> {
    pub fn is_empty(&self) -> bool {
        let head = self.head.load(Ordering::Relaxed);
        let tail = self.tail.load(Ordering::Acquire);
        head == tail
    }

    pub fn is_full(&self) -> bool {
        let tail = self.tail.load(Ordering::Relaxed);
        let head = self.head.load(Ordering::Acquire);
        tail.wrapping_sub(head) == N
    }

    pub fn len(&self) -> usize {
        let tail = self.tail.load(Ordering::Acquire);
        let head = self.head.load(Ordering::Acquire);
        tail.wrapping_sub(head)
    }

    pub fn capacity(&self) -> usize {
        N
    }
}

Checkpoint: Test wrap-around by pushing more than capacity items (some will fail), popping, and pushing again.


Phase 3: Cache Line Padding to Prevent False Sharing

Goal: Separate head and tail onto different cache lines.

use crossbeam_utils::CachePadded;

pub struct SpscQueue<T, const N: usize> {
    head: CachePadded<AtomicUsize>,
    tail: CachePadded<AtomicUsize>,
    buffer: [UnsafeCell<MaybeUninit<T>>; N],
}

Or implement your own padding:

#[repr(align(64))]
struct Padded<T> {
    value: T,
}

pub struct SpscQueue<T, const N: usize> {
    head: Padded<AtomicUsize>,
    tail: Padded<AtomicUsize>,
    buffer: [UnsafeCell<MaybeUninit<T>>; N],
}

Checkpoint: Benchmark with and without padding. You should see 2x+ improvement.


Phase 4: MPMC Extension with CAS Loops

Goal: Extend to Multi-Producer Multi-Consumer (optional, advanced).

For MPMC, multiple producers might try to write to the same slot. We need CAS:

pub fn mpmc_push(&self, value: T) -> Result<(), T> {
    loop {
        let tail = self.tail.load(Ordering::Relaxed);
        let head = self.head.load(Ordering::Acquire);

        if tail.wrapping_sub(head) == N {
            return Err(value); // Queue is full
        }

        // Try to claim this slot
        match self.tail.compare_exchange_weak(
            tail,
            tail.wrapping_add(1),
            Ordering::Release,
            Ordering::Relaxed,
        ) {
            Ok(_) => {
                // We claimed the slot, write the value
                let index = tail & (N - 1);
                unsafe {
                    (*self.buffer[index].get()).write(value);
                }
                return Ok(());
            }
            Err(_) => continue, // Another producer won, retry
        }
    }
}

Caution: MPMC is significantly more complex. The above is a simplified version; a production MPMC queue needs more sophisticated handling (e.g., sequence numbers per slot).


Phase 5: Benchmarking and Verification

Goal: Measure performance and verify correctness.

Create a benchmark in examples/benchmark.rs:

use std::thread;
use std::time::Instant;
use lock_free_queue::SpscQueue;

fn main() {
    const MESSAGES: usize = 10_000_000;

    let queue: SpscQueue<usize, 1024> = SpscQueue::new();
    let queue = &queue; // Share reference

    let producer = thread::spawn(move || {
        for i in 0..MESSAGES {
            while queue.push(i).is_err() {
                std::hint::spin_loop();
            }
        }
    });

    let consumer = thread::spawn(move || {
        let mut received = 0;
        while received < MESSAGES {
            if queue.pop().is_some() {
                received += 1;
            } else {
                std::hint::spin_loop();
            }
        }
    });

    let start = Instant::now();
    producer.join().unwrap();
    consumer.join().unwrap();
    let elapsed = start.elapsed();

    let ops_per_sec = MESSAGES as f64 / elapsed.as_secs_f64();
    println!("Throughput: {:.2}M ops/sec", ops_per_sec / 1_000_000.0);
}

Testing Strategy

Unit Tests

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_single_push_pop() {
        let queue: SpscQueue<i32, 4> = SpscQueue::new();
        assert!(queue.push(42).is_ok());
        assert_eq!(queue.pop(), Some(42));
    }

    #[test]
    fn test_empty_pop_returns_none() {
        let queue: SpscQueue<i32, 4> = SpscQueue::new();
        assert_eq!(queue.pop(), None);
    }

    #[test]
    fn test_full_push_returns_err() {
        let queue: SpscQueue<i32, 2> = SpscQueue::new();
        assert!(queue.push(1).is_ok());
        assert!(queue.push(2).is_ok());
        assert!(queue.push(3).is_err());
    }

    #[test]
    fn test_fifo_ordering() {
        let queue: SpscQueue<i32, 8> = SpscQueue::new();
        for i in 0..5 {
            queue.push(i).unwrap();
        }
        for i in 0..5 {
            assert_eq!(queue.pop(), Some(i));
        }
    }

    #[test]
    fn test_wrap_around() {
        let queue: SpscQueue<i32, 4> = SpscQueue::new();

        // Fill and empty multiple times
        for _ in 0..10 {
            for i in 0..4 {
                queue.push(i).unwrap();
            }
            for i in 0..4 {
                assert_eq!(queue.pop(), Some(i));
            }
        }
    }
}

Stress Tests

#[test]
fn test_concurrent_stress() {
    use std::thread;

    const ITERATIONS: usize = 100_000;

    let queue: SpscQueue<usize, 1024> = SpscQueue::new();
    let queue = std::sync::Arc::new(queue);

    let producer_queue = queue.clone();
    let producer = thread::spawn(move || {
        for i in 0..ITERATIONS {
            while producer_queue.push(i).is_err() {
                std::hint::spin_loop();
            }
        }
    });

    let consumer_queue = queue.clone();
    let consumer = thread::spawn(move || {
        let mut sum = 0usize;
        let mut count = 0;
        while count < ITERATIONS {
            if let Some(v) = consumer_queue.pop() {
                sum = sum.wrapping_add(v);
                count += 1;
            }
        }
        sum
    });

    producer.join().unwrap();
    let sum = consumer.join().unwrap();

    // Verify we received all items (sum should match expected)
    let expected_sum: usize = (0..ITERATIONS).sum();
    assert_eq!(sum, expected_sum);
}

Loom Testing for Concurrency

Loom exhaustively tests all possible thread interleavings:

[dev-dependencies]
loom = "0.7"
#[cfg(loom)]
mod loom_tests {
    use loom::sync::atomic::{AtomicUsize, Ordering};
    use loom::thread;

    #[test]
    fn loom_spsc_basic() {
        loom::model(|| {
            // Use loom's atomic types instead of std
            let queue = /* loom-compatible queue */;

            let producer = thread::spawn(move || {
                queue.push(1).unwrap();
            });

            let consumer = thread::spawn(move || {
                loop {
                    if let Some(v) = queue.pop() {
                        assert_eq!(v, 1);
                        break;
                    }
                }
            });

            producer.join().unwrap();
            consumer.join().unwrap();
        });
    }
}

Run Loom tests:

RUSTFLAGS="--cfg loom" cargo test --release

Common Pitfalls

Pitfall 1: Using Wrong Memory Orderings

Symptom: Works on x86, fails on ARM. Data corruption under load.

Wrong:

self.tail.store(new_tail, Ordering::Relaxed);  // Bug!

Right:

self.tail.store(new_tail, Ordering::Release);  // Publishes buffer write

Diagnosis: Use Loom to find ordering bugs. Test on ARM (e.g., Apple Silicon, Raspberry Pi).

Pitfall 2: The ABA Problem in MPMC

Symptom: Mysterious corruption when memory is reused.

Wrong: Using raw pointer comparison in CAS.

Right: Use tagged pointers (pointer + counter) or epoch-based reclamation.

Pitfall 3: False Sharing Destroying Performance

Symptom: Multi-threaded version is slower than single-threaded.

Wrong:

struct Queue {
    head: AtomicUsize,  // Same cache line as tail!
    tail: AtomicUsize,
}

Right:

struct Queue {
    head: CachePadded<AtomicUsize>,
    tail: CachePadded<AtomicUsize>,
}

Diagnosis: Use perf to measure cache misses.

Pitfall 4: Forgetting #[repr(align)] or Using Wrong Size

Symptom: Padding doesn’t help performance.

Wrong: Assuming cache line is 32 bytes (old CPUs) or 128 bytes (some server CPUs).

Right: Use 64 bytes (most common), or detect at runtime.

Pitfall 5: Not Using MaybeUninit

Symptom: Undefined behavior when reading uninitialized memory.

Wrong:

buffer: [UnsafeCell<T>; N],  // T might not be Default!

Right:

buffer: [UnsafeCell<MaybeUninit<T>>; N],

Pitfall 6: Overflow in Index Calculation

Symptom: Crash after billions of operations.

Wrong:

let index = tail % N;  // Works until tail overflows

Right:

let index = tail & (N - 1);  // Works with wrapping arithmetic
// AND: use wrapping_add for head/tail increments

Extensions and Challenges

Extension 1: Wait-Free Pop

Make pop wait-free by having the producer help the consumer:

// Producer leaves a "helping" record when it sees a slow consumer

Extension 2: Variable-Size Elements

Support elements of different sizes using a length prefix:

+--------+------------------+--------+------------------+
| len: 4 | data: "abcd"     | len: 2 | data: "xy"       |
+--------+------------------+--------+------------------+

Extension 3: Batch Operations

Push/pop multiple elements atomically for better throughput:

fn push_batch(&self, values: &[T]) -> usize; // Returns count pushed
fn pop_batch(&self, out: &mut [T]) -> usize; // Returns count popped

Extension 4: Block-Based Ring Buffer

For very high throughput, operate on blocks rather than elements:

+------------------+------------------+------------------+
| Block 0 (4KB)    | Block 1 (4KB)    | Block 2 (4KB)    |
| 64 messages      | 64 messages      | 64 messages      |
+------------------+------------------+------------------+

Extension 5: NUMA-Aware Allocation

For multi-socket servers, allocate the buffer on the NUMA node where the consumer runs:

// Use libnuma or platform-specific APIs

The Interview Questions They Will Ask

  1. “What is a ‘lock-free’ data structure?”

    A data structure where at least one thread is guaranteed to make progress in a finite number of steps, regardless of what other threads are doing. There are no locks that can cause a thread to wait indefinitely. Lock-free structures use atomic operations like CAS to coordinate access.

  2. “Explain the difference between Acquire and Release memory ordering.”

    Release ordering on a store “publishes” all writes that happened before it. Acquire ordering on a load “receives” all writes that happened before the matching Release. Together, they create a “synchronizes-with” relationship that establishes a happens-before ordering between threads. Release prevents reordering of earlier writes past the store; Acquire prevents reordering of later reads before the load.

  3. “What is ‘False Sharing’ and how do you prevent it in Rust?”

    False sharing occurs when threads on different cores modify variables that happen to share the same cache line (typically 64 bytes). Each modification invalidates the entire cache line for all other cores, causing excessive cache coherence traffic. Prevent it by padding structures to align hot variables to cache line boundaries using #[repr(align(64))] or crates like crossbeam-utils::CachePadded.

  4. “Why is SeqCst the default, and why is it often slower?”

    SeqCst provides the strongest guarantee: a single total ordering visible to all threads. It’s the default because it’s easiest to reason about. However, it requires expensive memory barriers on most architectures. For many algorithms (like SPSC queues), weaker orderings (Acquire/Release) are sufficient and can be 2-10x faster.

  5. “Explain the ABA problem and how to solve it.”

    The ABA problem occurs when a CAS succeeds because a value changed from A to B and back to A. The CAS sees A and thinks nothing changed, but the context may have changed (e.g., the memory was freed and reallocated). Solutions include: (1) Tagged pointers that combine a pointer with a counter, (2) Hazard pointers that prevent reuse of memory being accessed, (3) Epoch-based reclamation that delays memory reuse.

  6. “What’s the difference between compare_exchange and compare_exchange_weak?”

    compare_exchange_weak may spuriously fail even when the comparison would succeed. This is cheaper on some architectures (like ARM) because it avoids a memory barrier on failure. Use weak in loops where you’ll retry anyway; use the strong version when spurious failures would cause correctness issues.

  7. “Why do we use wrapping_add for the head and tail indices?”

    To handle the case when indices overflow usize::MAX. With wrapping arithmetic, indices wrap around to 0, and the difference tail - head still gives the correct queue length (assuming the queue size is much smaller than usize::MAX). This avoids the need to reset indices to 0, which would require additional synchronization.


Books That Will Help

Topic Book Chapter
Atomic Fundamentals “Rust Atomics and Locks” by Mara Bos Ch. 1: Basics of Rust Concurrency
Memory Ordering Deep Dive “Rust Atomics and Locks” by Mara Bos Ch. 3: Memory Ordering
Lock-Free Data Structures “Rust Atomics and Locks” by Mara Bos Ch. 9: Lock-Free Linked Lists
Concurrency in Rust “Programming Rust” by Jim Blandy Ch. 19: Concurrency
Cache Architecture “Computer Systems: A Programmer’s Perspective” Ch. 6: The Memory Hierarchy
Hardware Memory Models “A Primer on Memory Consistency” by Sorin et al. Chapters 1-4

Summary

You have now built one of the most challenging data structures in computer science: a lock-free queue. Along the way, you have mastered:

  • The difference between lock-free, wait-free, and blocking algorithms
  • Rust’s atomic memory orderings and when to use each
  • The Compare-and-Swap operation and CAS loops
  • The ABA problem and its solutions
  • Cache line padding to prevent false sharing
  • Ring buffer design with efficient head/tail management
  • Verification techniques including Loom

Your queue achieves 18+ million operations per second, dramatically outperforming standard library channels. This is the kind of code that powers game engines, trading systems, and operating system kernels.

You are now equipped to understand and implement the concurrent data structures that underpin modern high-performance systems. The concepts you’ve learned apply far beyond queues: they’re the foundation for lock-free stacks, skip lists, hash maps, and more.

Welcome to the world of lock-free programming. The CPU is now yours.