Project 6: Atomic Lock-Free Queue

Build a high-performance Single-Producer Single-Consumer (SPSC) or Multi-Producer Multi-Consumer (MPMC) lock-free queue using only atomic operations, achieving 18+ million operations per second.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 2-3 weeks
Language Rust
Prerequisites Rust ownership, basic threading (std::thread), memory layout, pointers
Key Topics Lock-free programming, atomics, memory ordering, CAS loops, cache coherence

1. Learning Objectives

By the end of this project, you will be able to:

  1. Explain lock-free vs. lock-based data structures and articulate the trade-offs of each approach
  2. Master Rust’s atomic memory orderings (Relaxed, Acquire, Release, AcqRel, SeqCst) and apply the correct ordering to each operation
  3. Implement Compare-and-Swap (CAS) loops correctly and understand their role in lock-free algorithms
  4. Identify and solve the ABA problem using tagged pointers or epoch-based reclamation
  5. Prevent false sharing through cache line padding and alignment
  6. Design a ring buffer with efficient head/tail pointer management
  7. Benchmark and verify lock-free data structures using stress tests and formal verification tools like Loom
  8. Understand progress guarantees - lock-free, wait-free, and obstruction-free properties
  9. Reason about memory models - hardware vs. software memory models and their implications
  10. Apply lock-free patterns to real-world systems like audio processing, game engines, and trading systems

2. Theoretical Foundation

2.1 Core Concepts

What Is Lock-Free Programming and Why Does It Matter?

Traditional concurrent programming uses locks (mutexes, semaphores) to protect shared data. When Thread A holds a lock, Thread B must wait. This waiting is the fundamental problem with locks:

+-------------------------------------------------------------------------+
|                    LOCK-BASED QUEUE: THE WAITING GAME                   |
+-------------------------------------------------------------------------+

Timeline:        0ms   10ms   20ms   30ms   40ms   50ms   60ms   70ms
                  |     |      |      |      |      |      |      |
Thread A:        [===LOCK===][WRITE][===UNLOCK===]
                                                    ^
Thread B:        [TRY LOCK]..WAITING..WAITING......[===LOCK===][READ]
                        ^                          |
                        |                          Finally gets lock!
                        Blocked here for 50ms

Problem: Thread B does NOTHING for 50ms. CPU cycles wasted.
         If Thread A crashes while holding lock -> Deadlock.
         If Thread A is preempted by OS -> Priority inversion.

Lock-free programming eliminates this waiting. In a lock-free data structure:

At least one thread is guaranteed to make progress in a finite number of steps, regardless of what other threads are doing.

This means:

  • No deadlocks (no locks to hold)
  • No priority inversion (no waiting on lower-priority threads)
  • Better worst-case latency (no unbounded waits)
  • Often better throughput under contention
+-------------------------------------------------------------------------+
|                   LOCK-FREE QUEUE: CONCURRENT PROGRESS                  |
+-------------------------------------------------------------------------+

Timeline:        0ms   10ms   20ms   30ms   40ms   50ms   60ms   70ms
                  |     |      |      |      |      |      |      |
Thread A:        [CAS][WRITE][CAS][WRITE][CAS][WRITE]...
                  |     |      |      |      |      |
Thread B:        [CAS][READ][CAS][FAIL-RETRY][CAS][READ]...
                                   ^
                                   Retry, don't wait!

Key insight: Both threads ALWAYS make progress.
             No thread ever blocks waiting for another.
             "Failed" operations immediately retry.

Memory Ordering: The Heart of Lock-Free Programming

Modern CPUs don’t execute instructions in the order you write them. They reorder for performance. The memory ordering model defines what guarantees you get about when writes become visible to other threads.

Rust provides five memory orderings, from weakest to strongest:

+-------------------------------------------------------------------------+
|                        MEMORY ORDERING SPECTRUM                         |
+-------------------------------------------------------------------------+

Weakest (Fastest)                                    Strongest (Slowest)
     |                                                        |
     v                                                        v
 +---------+    +----------+    +----------+    +--------+    +--------+
 | Relaxed | -> | Acquire  | -> | Release  | -> | AcqRel | -> | SeqCst |
 +---------+    +----------+    +----------+    +--------+    +--------+
     |              |               |               |              |
 No ordering   Prevents      Prevents         Both          Global
 guarantees    reads from    writes from      Acquire       total
 between       being         being            AND           order
 threads       reordered     reordered        Release       visible
               BEFORE        AFTER                          to all
               this load     this store                     threads

Acquire/Release: The Workhorse Pattern

+-------------------------------------------------------------------------+
|                    ACQUIRE/RELEASE SYNCHRONIZATION                      |
+-------------------------------------------------------------------------+

Thread A (Producer)                    Thread B (Consumer)
        |                                      |
        |  data = 42;                          |
        |  // All writes BEFORE                |
        |  // this Release are                 |
        |  // visible to any thread            |
        |  // that does an Acquire             |
        |  // on the same location             |
        |       |                              |
        v       v                              |
  flag.store(true, Release)                    |
        |                                      |
        |   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~>      |
        |   (synchronization via flag)         |
        |                                      v
        |                        flag.load(Acquire)
        |                        // After this Acquire,
        |                        // all writes from before
        |                        // the Release are visible
        |                              |
        |                              v
        |                         assert_eq!(data, 42); // SAFE!

Compare-and-Swap (CAS): The Lock-Free Primitive

CAS is the fundamental operation that makes lock-free programming possible. It atomically:

  1. Reads the current value
  2. Compares it to an expected value
  3. If equal, writes a new value and returns success
  4. If not equal, returns failure (and the actual current value)
+-------------------------------------------------------------------------+
|                    COMPARE-AND-SWAP OPERATION                           |
+-------------------------------------------------------------------------+

                    CAS(address, expected, new)
                            |
                            v
              +----------------------------+
              |    Read current value      |
              |    from address            |
              +-------------+--------------+
                            |
                            v
              +----------------------------+
              |  current == expected?      |
              +------+-------------+-------+
                     |             |
                  YES|             |NO
                     v             v
              +-------------+ +-------------------+
              | Write new   | | Don't write       |
              | Return Ok   | | Return Err(current)|
              +-------------+ +-------------------+

This ENTIRE operation is ATOMIC. No other thread can see
an intermediate state.

The ABA Problem: The Subtle Killer

The ABA problem is the most insidious bug in lock-free programming. It occurs when:

  1. Thread A reads value A from a location
  2. Thread A is preempted
  3. Thread B changes A -> B -> A (back to A)
  4. Thread A resumes, sees A, and thinks nothing changed!
+-------------------------------------------------------------------------+
|                         THE ABA PROBLEM                                 |
+-------------------------------------------------------------------------+

Initial state: HEAD -> [Node A] -> [Node B] -> [Node C] -> NULL

Thread A (POP operation):
  1. Read HEAD = A
  2. Prepare: new_head = A.next = B
  3. *** PREEMPTED ***

                    Thread B runs:
                    4. Pop A (HEAD = B)
                    5. Pop B (HEAD = C)
                    6. Free A, Free B
                    7. Allocate new node (reuses A's memory!)
                    8. Push new A (HEAD = A)

                    New state: HEAD -> [Node A*] -> [Node C] -> NULL
                                        (* new content, same address!)

Thread A resumes:
  9. CAS(HEAD, A, B)
  10. Succeeds! (HEAD still points to address A)
  11. HEAD = B

  PROBLEM: B was freed! HEAD now points to freed memory!

Solutions:

  1. Tagged Pointers: Add a counter alongside the pointer; CAS checks both
  2. Hazard Pointers: Threads publish which pointers they’re using
  3. Epoch-Based Reclamation: Memory freed only when all threads pass an epoch

Cache Coherence and False Sharing

Modern CPUs read cache lines (typically 64 bytes). When two threads modify variables on the same cache line, they fight over that line:

+-------------------------------------------------------------------------+
|                         FALSE SHARING                                   |
+-------------------------------------------------------------------------+

                       CACHE LINE (64 bytes)
  +-------+-------+-------+-------+-------+-------+-------+-------+
  | head  | tail  |       |       |       |       |       |       |
  | (8B)  | (8B)  | ...   | ...   | ...   | ...   | ...   | ...   |
  +---+---+---+---+-------+-------+-------+-------+-------+-------+
      |       |
      |       +-- Thread B writes tail
      |           -> Invalidates ENTIRE cache line
      |           -> Thread A must reload head!
      |
      +-- Thread A writes head
          -> Invalidates ENTIRE cache line
          -> Thread B must reload tail!

Both threads constantly invalidate each other's cache.
Performance tanks even though they write DIFFERENT variables!

Solution: Pad data structures so hot variables live on separate cache lines.

2.2 Why This Matters

Lock-free data structures power the world’s most demanding systems:

Domain Example Why Lock-Free
Audio Processing Digital Audio Workstations Buffer underruns cause audible clicks
Game Engines Unity, Unreal job systems Frame deadlines are absolute (16.67ms for 60 FPS)
Financial Trading High-Frequency Trading Microsecond advantages = millions of dollars
Databases RocksDB, ScyllaDB Transaction log must never block
Operating Systems Linux/Windows kernel Interrupt handlers cannot sleep
Message Passing Tokio, Crossbeam, Rayon Channels between async tasks

2.3 Historical Context

Lock-free programming has evolved significantly:

  • 1991: Maurice Herlihy proves universal construction for lock-free objects
  • 2001: Michael-Scott queue becomes the gold standard
  • 2004: Hazard pointers introduced for safe memory reclamation
  • 2010: Epoch-based reclamation simplifies memory management
  • 2015: Rust’s type system enables safe lock-free programming

2.4 Common Misconceptions

Misconception Reality
“Lock-free is always faster” Under low contention, locks can be faster
“SeqCst is always correct” It’s correct but often unnecessarily slow
“Works on x86 = works everywhere” ARM has weaker memory model - bugs manifest there
“No locks = no bugs” ABA, memory ordering bugs are subtle and dangerous
“Lock-free = wait-free” Lock-free guarantees system progress, not individual thread progress

3. Project Specification

3.1 What You Will Build

A bounded, lock-free SPSC (Single-Producer Single-Consumer) queue with these characteristics:

  1. Capacity: Fixed at construction time (power of 2 for efficient modulo)
  2. Blocking: Never. push and pop return immediately
  3. Ordering: FIFO (First-In First-Out)
  4. Thread Safety: Single producer, single consumer
  5. Memory: No heap allocation after construction
  6. Performance: 18+ million operations per second

3.2 Functional Requirements

pub struct SpscQueue<T, const N: usize> {
    // Implementation details
}

impl<T, const N: usize> SpscQueue<T, N> {
    /// Create a new queue with capacity N.
    /// N must be a power of 2.
    pub fn new() -> Self;

    /// Attempt to push a value into the queue.
    /// Returns Err(value) if the queue is full.
    pub fn push(&self, value: T) -> Result<(), T>;

    /// Attempt to pop a value from the queue.
    /// Returns None if the queue is empty.
    pub fn pop(&self) -> Option<T>;

    /// Check if the queue is empty.
    pub fn is_empty(&self) -> bool;

    /// Check if the queue is full.
    pub fn is_full(&self) -> bool;

    /// Get the current number of elements.
    pub fn len(&self) -> usize;

    /// Get the maximum capacity.
    pub fn capacity(&self) -> usize;
}

unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}

3.3 Non-Functional Requirements

  • No Mutex, RwLock, or blocking primitives
  • May use AtomicUsize, AtomicPtr from std::sync::atomic
  • May use std::cell::UnsafeCell for interior mutability
  • Must use correct memory orderings (not just SeqCst everywhere)
  • Must prevent false sharing via cache line padding

3.4 Example Usage / Output

$ cargo run --release --example benchmark

+======================================================================+
|              LOCK-FREE QUEUE PERFORMANCE BENCHMARK                   |
+======================================================================+

System Information:
  CPU: Apple M2 Pro (12 cores)
  RAM: 32 GB
  OS: macOS 14.0

Test Configuration:
  Messages: 10,000,000
  Message Size: 64 bytes
  Iterations: 5

+----------------------------------------------------------------------+
|                         BENCHMARK RESULTS                            |
+----------------------------------------------------------------------+

Queue Type          | Throughput    | Latency (p50) | Latency (p99)
--------------------|---------------|---------------|---------------
std::sync::mpsc     | 4.83M ops/s   | 207 ns        | 892 ns
crossbeam-channel   | 12.39M ops/s  | 81 ns         | 234 ns
YOUR Lock-Free      | 19.02M ops/s  | 53 ns         | 127 ns

SPEEDUP vs std::mpsc:    3.94x
SPEEDUP vs crossbeam:    1.53x

+----------------------------------------------------------------------+
|                     MEMORY ORDERING VERIFICATION                     |
+----------------------------------------------------------------------+

Running correctness tests with Loom...

test spsc_basic ... ok (explored 1,247 interleavings)
test spsc_wrap_around ... ok (explored 3,891 interleavings)
test spsc_full_queue ... ok (explored 2,156 interleavings)

All orderings verified correct!

3.5 Real World Outcome

Your queue will:

  • Achieve 19+ million operations per second
  • Use 53 nanosecond median latency
  • Make zero heap allocations in hot path
  • Have verified memory ordering correctness
  • Demonstrate optimal cache behavior with padding

4. Solution Architecture

4.1 High-Level Design

+-------------------------------------------------------------------------+
|                         RING BUFFER STRUCTURE                           |
+-------------------------------------------------------------------------+

Capacity = 8 (indices 0-7)

Initial state (empty):
  HEAD = 0 (next slot to READ from)
  TAIL = 0 (next slot to WRITE to)

      HEAD
      TAIL
        |
        v
  +---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |   |
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+

After push(A), push(B), push(C):
  HEAD = 0
  TAIL = 3

      HEAD            TAIL
        |               |
        v               v
  +---+---+---+---+---+---+---+---+
  | A | B | C |   |   |   |   |   |
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+

4.2 Key Components

  1. Ring Buffer: Pre-allocated array with wrap-around indexing
  2. Head Pointer: Owned by consumer, points to next read slot
  3. Tail Pointer: Owned by producer, points to next write slot
  4. Cache Line Padding: Separates head and tail to prevent false sharing

4.3 Data Structures

use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
use std::sync::atomic::AtomicUsize;

#[repr(align(64))]
struct CachePadded<T> {
    value: T,
}

pub struct SpscQueue<T, const N: usize> {
    head: CachePadded<AtomicUsize>,  // Cache line 1
    tail: CachePadded<AtomicUsize>,  // Cache line 2
    buffer: [UnsafeCell<MaybeUninit<T>>; N],  // Data storage
}

4.4 Algorithm Overview

PUSH Operation (Producer only):

  1. Load tail (Relaxed - we own this variable)
  2. Load head (Acquire - synchronize with consumer)
  3. If (tail - head) == capacity: FULL, return Err
  4. Write data to buffer[tail % capacity]
  5. Store tail + 1 (Release - publish the new data)

POP Operation (Consumer only):

  1. Load head (Relaxed - we own this variable)
  2. Load tail (Acquire - synchronize with producer)
  3. If head == tail: EMPTY, return None
  4. Read data from buffer[head % capacity]
  5. Store head + 1 (Release - publish the consumption)

5. Implementation Guide

5.1 Development Environment Setup

cargo new lock-free-queue
cd lock-free-queue

# Add dependencies to Cargo.toml
cargo add crossbeam-utils   # For cache line padding
cargo add criterion --dev   # For benchmarking
cargo add loom --dev        # For concurrency testing

5.2 Project Structure

lock-free-queue/
├── Cargo.toml
├── src/
│   ├── lib.rs           # Main queue implementation
│   └── cache_padded.rs  # Cache line padding utilities
├── examples/
│   └── benchmark.rs     # Performance benchmarks
├── tests/
│   ├── unit_tests.rs    # Basic functionality tests
│   └── loom_tests.rs    # Concurrency verification
└── benches/
    └── throughput.rs    # Criterion benchmarks

5.3 The Core Question You’re Answering

How can multiple threads safely share and modify data without locks?

The answer lies in:

  1. Atomic operations that can’t be interrupted
  2. Memory ordering that ensures visibility guarantees
  3. Careful ownership of which thread modifies which data

5.4 Concepts You Must Understand First

Before starting, verify you can answer:

  • What is an atomic operation and why is it different from a regular operation?
  • Why do CPUs reorder instructions?
  • What does “cache line” mean and why does it matter?
  • How does Arc<T> differ from a raw pointer?
  • What is UnsafeCell and when would you use it?

5.5 Questions to Guide Your Design

Memory Ordering:

  • Which operations need Acquire? Which need Release?
  • Why can the producer read tail with Relaxed ordering?
  • What would break if you used Relaxed everywhere?

Data Structure:

  • Why use power-of-2 capacity?
  • Why MaybeUninit<T> instead of Option<T>?
  • How does wrap-around work with wrapping arithmetic?

Safety:

  • Why is UnsafeCell needed for the buffer?
  • What invariants does your unsafe impl Sync rely on?

5.6 Thinking Exercise

Before coding, trace through this scenario on paper:

Initial: head=5, tail=7, capacity=8

Thread A (Producer):        Thread B (Consumer):
  push(X)                     pop()
    1. tail_local = 7         1. head_local = 5
    2. head_snap = 5          2. tail_snap = ???
    3. not full (7-5=2<8)     3. ???
    4. write buffer[7]        4. ???
    5. tail = 8               5. ???

What values does Thread B see for tail_snap?
What happens if B reads tail before A writes it?
What happens if B reads tail after A writes it?

5.7 Hints in Layers

Hint 1 - Starting Point: Begin with a single-threaded version that doesn’t use atomics. Get the ring buffer logic correct first.

Hint 2 - Next Level: Convert to atomics one variable at a time. Use SeqCst everywhere initially - correctness before performance.

Hint 3 - Technical Details: The key insight for SPSC: only producer writes tail, only consumer writes head. This means no CAS needed - just atomic loads and stores!

Hint 4 - Optimization: Replace SeqCst with appropriate orderings:

  • Owner reads their pointer: Relaxed
  • Reading other’s pointer: Acquire
  • Writing own pointer: Release

5.8 The Interview Questions They’ll Ask

  1. “What is a ‘lock-free’ data structure?”
    • A data structure where at least one thread makes progress in finite steps, regardless of other threads.
  2. “Explain the difference between Acquire and Release memory ordering.”
    • Release publishes writes; Acquire receives them. Together they create a happens-before relationship.
  3. “What is ‘False Sharing’ and how do you prevent it?”
    • When threads modify different variables on the same cache line. Prevent with 64-byte alignment/padding.
  4. “Why is SeqCst often slower than Acquire/Release?”
    • SeqCst requires expensive memory barriers on most architectures; Acquire/Release is often sufficient.
  5. “Explain the ABA problem and how to solve it.”
    • CAS succeeds on same value that was changed and changed back. Solve with tagged pointers or epoch reclamation.

5.9 Books That Will Help

Topic Book Chapter
Atomic Fundamentals “Rust Atomics and Locks” by Mara Bos Ch. 1, 3
Lock-Free Data Structures “Rust Atomics and Locks” by Mara Bos Ch. 9
Cache Architecture “Computer Systems: A Programmer’s Perspective” Ch. 6
Hardware Memory Models “A Primer on Memory Consistency” Ch. 1-4

5.10 Implementation Phases

Phase 1: Basic Structure

pub struct SpscQueue<T, const N: usize> {
    head: AtomicUsize,
    tail: AtomicUsize,
    buffer: [UnsafeCell<MaybeUninit<T>>; N],
}

impl<T, const N: usize> SpscQueue<T, N> {
    pub fn new() -> Self {
        assert!(N.is_power_of_two());
        let buffer = unsafe {
            MaybeUninit::<[UnsafeCell<MaybeUninit<T>>; N]>::uninit()
                .assume_init()
        };
        Self {
            head: AtomicUsize::new(0),
            tail: AtomicUsize::new(0),
            buffer,
        }
    }
}

Phase 2: Push and Pop Operations

pub fn push(&self, value: T) -> Result<(), T> {
    let tail = self.tail.load(Ordering::Relaxed);
    let head = self.head.load(Ordering::Acquire);

    if tail.wrapping_sub(head) == N {
        return Err(value); // Full
    }

    let index = tail & (N - 1);
    unsafe {
        (*self.buffer[index].get()).write(value);
    }

    self.tail.store(tail.wrapping_add(1), Ordering::Release);
    Ok(())
}

pub fn pop(&self) -> Option<T> {
    let head = self.head.load(Ordering::Relaxed);
    let tail = self.tail.load(Ordering::Acquire);

    if head == tail {
        return None; // Empty
    }

    let index = head & (N - 1);
    let value = unsafe {
        (*self.buffer[index].get()).assume_init_read()
    };

    self.head.store(head.wrapping_add(1), Ordering::Release);
    Some(value)
}

Phase 3: Cache Line Padding

use crossbeam_utils::CachePadded;

pub struct SpscQueue<T, const N: usize> {
    head: CachePadded<AtomicUsize>,
    tail: CachePadded<AtomicUsize>,
    buffer: [UnsafeCell<MaybeUninit<T>>; N],
}

Phase 4: Benchmarking and Verification

  • Create benchmark comparing against std::mpsc and crossbeam
  • Add Loom tests for memory ordering verification

5.11 Key Implementation Decisions

Decision Choice Reasoning
Capacity Power of 2 Enables bitwise AND for modulo
Buffer type MaybeUninit<T> Avoids requiring Default
Index overflow Wrapping arithmetic Works for billions of operations
Padding 64-byte alignment Matches common cache line size

6. Testing Strategy

6.1 Unit Tests

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_single_push_pop() {
        let queue: SpscQueue<i32, 4> = SpscQueue::new();
        assert!(queue.push(42).is_ok());
        assert_eq!(queue.pop(), Some(42));
    }

    #[test]
    fn test_fifo_ordering() {
        let queue: SpscQueue<i32, 8> = SpscQueue::new();
        for i in 0..5 {
            queue.push(i).unwrap();
        }
        for i in 0..5 {
            assert_eq!(queue.pop(), Some(i));
        }
    }

    #[test]
    fn test_wrap_around() {
        let queue: SpscQueue<i32, 4> = SpscQueue::new();
        for _ in 0..10 {
            for i in 0..4 {
                queue.push(i).unwrap();
            }
            for i in 0..4 {
                assert_eq!(queue.pop(), Some(i));
            }
        }
    }
}

6.2 Integration Tests

#[test]
fn test_concurrent_spsc() {
    use std::thread;
    const ITERATIONS: usize = 100_000;

    let queue: SpscQueue<usize, 1024> = SpscQueue::new();
    let queue = std::sync::Arc::new(queue);

    let producer_queue = queue.clone();
    let producer = thread::spawn(move || {
        for i in 0..ITERATIONS {
            while producer_queue.push(i).is_err() {
                std::hint::spin_loop();
            }
        }
    });

    let consumer_queue = queue.clone();
    let consumer = thread::spawn(move || {
        let mut sum = 0usize;
        let mut count = 0;
        while count < ITERATIONS {
            if let Some(v) = consumer_queue.pop() {
                sum = sum.wrapping_add(v);
                count += 1;
            }
        }
        sum
    });

    producer.join().unwrap();
    let sum = consumer.join().unwrap();
    let expected: usize = (0..ITERATIONS).sum();
    assert_eq!(sum, expected);
}

6.3 Property-Based Tests

Consider testing properties like:

  • All pushed values are eventually popped
  • Pop order matches push order (FIFO)
  • Queue never exceeds capacity
  • No values are lost or duplicated

6.4 Loom Concurrency Tests

#[cfg(loom)]
mod loom_tests {
    use loom::sync::atomic::{AtomicUsize, Ordering};
    use loom::thread;

    #[test]
    fn loom_spsc_basic() {
        loom::model(|| {
            // Test all possible interleavings
            // ...
        });
    }
}

Run with: RUSTFLAGS="--cfg loom" cargo test --release


7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Works on x86, fails on ARM Data corruption under load Wrong memory ordering Use Loom; test on ARM
Performance regression Multi-threaded slower than single False sharing Add cache line padding
Mysterious corruption Intermittent crashes ABA problem (in MPMC) Use tagged pointers
Queue stops working Infinite loop Overflow without wrapping Use wrapping_add
UB from uninitialized memory Random values Reading uninitialized slot Use MaybeUninit correctly
Padding doesn’t help Still slow Wrong alignment Use #[repr(align(64))]

8. Extensions & Challenges

  1. Wait-Free Pop: Make pop wait-free by having producer help consumer
  2. Variable-Size Elements: Support elements of different sizes with length prefix
  3. Batch Operations: Push/pop multiple elements atomically
  4. Block-Based Ring Buffer: Operate on 4KB blocks for higher throughput
  5. NUMA-Aware Allocation: Allocate buffer on consumer’s NUMA node
  6. MPMC Extension: Add multi-producer multi-consumer support with CAS loops

9. Real-World Connections

  • Crossbeam: Production-quality lock-free data structures for Rust
  • LMAX Disruptor: Famous Java lock-free ring buffer (inspired many designs)
  • Linux kernel: Uses lock-free structures throughout
  • Audio software: All professional DAWs use lock-free queues for audio threads
  • Game engines: Job systems rely on lock-free work queues

10. Resources

10.1 Documentation

10.2 Crates

  • crossbeam-utils - Cache padding utilities
  • crossbeam-channel - Production lock-free channels
  • loom - Concurrency testing framework

10.3 Articles

  • “Lock-Free Programming” by Herb Sutter
  • “Memory Barriers: a Hardware View” by Paul McKenney
  • “What every systems programmer should know about concurrency” (LWN)

11. Self-Assessment Checklist

  • I can explain why lock-free is better than locks for real-time systems
  • I can choose the correct memory ordering for a given operation
  • I understand why CAS loops work and when they might fail
  • I can identify false sharing and know how to fix it
  • I can explain the ABA problem and its solutions
  • My queue passes stress tests with millions of operations
  • My queue’s performance beats std::mpsc
  • I’ve verified correctness with Loom or similar tools
  • I can draw the ring buffer state at any point during execution
  • I understand the difference between lock-free and wait-free

12. Submission / Completion Criteria

Your project is complete when:

  1. Functionality: All API methods work correctly
  2. Performance: Achieves 18+ million ops/sec on modern hardware
  3. Correctness: Passes Loom verification for all orderings
  4. Testing: Comprehensive test suite including stress tests
  5. Documentation: Clear comments explaining memory ordering choices
  6. Benchmarks: Comparison against std::mpsc and crossbeam-channel

Bonus Points:

  • MPMC extension implemented
  • Zero allocations verified with custom allocator
  • Works correctly on ARM (tested on Apple Silicon or Raspberry Pi)