Project 6: Atomic Lock-Free Queue (The Concurrency Beast)
Project 6: Atomic Lock-Free Queue (The Concurrency Beast)
âLock-free programming is like juggling chainsaws while blindfolded. The chainsaws are memory barriers, and the blindfold is the CPU reordering your instructions.â - Herb Sutter
Project Metadata
- Main Programming Language: Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Difficulty: Level 5: Master
- Knowledge Area: Concurrency / Low-Level Atomics
- Time Estimate: 2-3 weeks
- Prerequisites: Strong understanding of Rust ownership, basic threading (
std::thread), understanding of memory layout and pointers
What You Will Build
A Single-Producer Single-Consumer (SPSC) or Multi-Producer Multi-Consumer (MPMC) lock-free queue using only atomic operations. You will not use Mutex or RwLock. Your queue will achieve 18+ million operations per second, dramatically outperforming standard library channels.
Learning Objectives
By the end of this project, you will be able to:
- Explain lock-free vs. lock-based data structures and articulate the trade-offs of each approach
- Master Rustâs atomic memory orderings (Relaxed, Acquire, Release, AcqRel, SeqCst) and apply the correct ordering to each operation
- Implement Compare-and-Swap (CAS) loops correctly and understand their role in lock-free algorithms
- Identify and solve the ABA problem using tagged pointers or epoch-based reclamation
- Prevent false sharing through cache line padding and alignment
- Design a ring buffer with efficient head/tail pointer management
- Benchmark and verify lock-free data structures using stress tests and formal verification tools like Loom
Deep Theoretical Foundation
Before writing any code, you must internalize the concepts that make lock-free programming one of the most challenging areas in computer science. This section provides the mental framework you need.
What Is Lock-Free Programming and Why Does It Matter?
Traditional concurrent programming uses locks (mutexes, semaphores) to protect shared data. When Thread A holds a lock, Thread B must wait. This waiting is the fundamental problem with locks:
+-------------------------------------------------------------------------+
| LOCK-BASED QUEUE: THE WAITING GAME |
+-------------------------------------------------------------------------+
Timeline: 0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms
| | | | | | | |
Thread A: [===LOCK===][WRITE][===UNLOCK===]
^
Thread B: [TRY LOCK]..WAITING..WAITING......[===LOCK===][READ]
^ |
| Finally gets lock!
Blocked here for 50ms
Problem: Thread B does NOTHING for 50ms. CPU cycles wasted.
If Thread A crashes while holding lock -> Deadlock.
If Thread A is preempted by OS -> Priority inversion.
Lock-free programming eliminates this waiting. In a lock-free data structure:
At least one thread is guaranteed to make progress in a finite number of steps, regardless of what other threads are doing.
This means:
- No deadlocks (no locks to hold)
- No priority inversion (no waiting on lower-priority threads)
- Better worst-case latency (no unbounded waits)
- Often better throughput under contention
+-------------------------------------------------------------------------+
| LOCK-FREE QUEUE: CONCURRENT PROGRESS |
+-------------------------------------------------------------------------+
Timeline: 0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms
| | | | | | | |
Thread A: [CAS][WRITE][CAS][WRITE][CAS][WRITE]...
| | | | | |
Thread B: [CAS][READ][CAS][FAIL-RETRY][CAS][READ]...
^
Retry, don't wait!
Key insight: Both threads ALWAYS make progress.
No thread ever blocks waiting for another.
"Failed" operations immediately retry.
When to use lock-free structures:
| Use Case | Why Lock-Free Wins |
|---|---|
| Audio/Video Processing | Cannot tolerate latency spikes; buffer underruns cause audible glitches |
| Game Engines | Frame deadlines are absolute; blocking = stutter |
| Trading Systems | Microseconds matter; latency = lost money |
| Database Engines | High-throughput transaction logging |
| Message Passing | Actor systems, channel implementations |
Book Reference: âRust Atomics and Locksâ by Mara Bos, Chapter 1: âBasics of Rust Concurrencyâ
Memory Ordering: The Heart of Lock-Free Programming
Modern CPUs donât execute instructions in the order you write them. They reorder for performance. The memory ordering model defines what guarantees you get about when writes become visible to other threads.
Rust provides five memory orderings, from weakest to strongest:
+-------------------------------------------------------------------------+
| MEMORY ORDERING SPECTRUM |
+-------------------------------------------------------------------------+
Weakest (Fastest) Strongest (Slowest)
| |
v v
+---------+ +----------+ +----------+ +--------+ +--------+
| Relaxed | -> | Acquire | -> | Release | -> | AcqRel | -> | SeqCst |
+---------+ +----------+ +----------+ +--------+ +--------+
| | | | |
| | | | |
No ordering Prevents Prevents Both Global
guarantees reads from writes from Acquire total
between being being AND order
threads reordered reordered Release visible
BEFORE AFTER to all
this load this store threads
Relaxed Ordering
The compiler and CPU can reorder operations freely. Only guarantees atomicity (no torn reads/writes).
// Relaxed: Only use for counters where order doesn't matter
static COUNTER: AtomicUsize = AtomicUsize::new(0);
COUNTER.fetch_add(1, Ordering::Relaxed); // Just count, order irrelevant
Acquire/Release Ordering
The workhorse of lock-free programming. Creates a âhappens-beforeâ relationship.
+-------------------------------------------------------------------------+
| ACQUIRE/RELEASE SYNCHRONIZATION |
+-------------------------------------------------------------------------+
Thread A (Producer) Thread B (Consumer)
| |
| data = 42; |
| // All writes BEFORE |
| // this Release are |
| // visible to any thread |
| // that does an Acquire |
| // on the same location |
| | |
v v |
flag.store(true, Release) |
| |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~> |
| (synchronization via flag) |
| v
| flag.load(Acquire)
| // After this Acquire,
| // all writes from before
| // the Release are visible
| |
| v
| assert_eq!(data, 42); // SAFE!
| |
v v
The Release "publishes" all prior writes.
The Acquire "receives" all those writes.
The Critical Rule: A Release store âsynchronizes withâ an Acquire load on the same memory location. All writes before the Release become visible after the Acquire.
use std::sync::atomic::{AtomicBool, Ordering};
static FLAG: AtomicBool = AtomicBool::new(false);
static mut DATA: i32 = 0;
// Thread A
unsafe { DATA = 42; }
FLAG.store(true, Ordering::Release); // "Publishes" DATA = 42
// Thread B
while !FLAG.load(Ordering::Acquire) { /* spin */ } // "Receives" publication
unsafe { assert_eq!(DATA, 42); } // Guaranteed to see 42!
SeqCst (Sequentially Consistent)
Provides a global total order visible to all threads. The strongest guarantee, but often unnecessary and slower.
+-------------------------------------------------------------------------+
| SEQUENTIALLY CONSISTENT ORDERING |
+-------------------------------------------------------------------------+
With SeqCst, ALL threads agree on a single total order of ALL SeqCst
operations. It's like everyone is watching the same movie.
Thread A Thread B Thread C
| | |
v v v
x.store(1) y.store(1) observe()
| | |
| | |
+--------+--------+ |
| |
v v
Global total order Sees either:
that ALL threads (x=1, y=0) then (x=1, y=1)
observe: OR
(x=0, y=1) then (x=1, y=1)
[x=1] -> [y=1]
OR But NEVER:
[y=1] -> [x=1] (x=1, y=1) then (x=0, y=0)
When to use SeqCst: Only when you need multiple atomic variables to appear consistently ordered across all threads. For single-variable synchronization, Acquire/Release is usually sufficient and faster.
Book Reference: âRust Atomics and Locksâ by Mara Bos, Chapter 3: âMemory Orderingâ
Compare-and-Swap (CAS): The Lock-Free Primitive
CAS is the fundamental operation that makes lock-free programming possible. It atomically:
- Reads the current value
- Compares it to an expected value
- If equal, writes a new value and returns success
- If not equal, returns failure (and the actual current value)
+-------------------------------------------------------------------------+
| COMPARE-AND-SWAP OPERATION |
+-------------------------------------------------------------------------+
CAS(address, expected, new)
|
v
+----------------------------+
| Read current value |
| from address |
+-------------+--------------+
|
v
+----------------------------+
| current == expected? |
+------+-------------+-------+
| |
YES| |NO
v v
+-------------+ +-------------------+
| Write new | | Don't write |
| Return Ok | | Return Err(current)|
+-------------+ +-------------------+
This ENTIRE operation is ATOMIC. No other thread can see
an intermediate state.
CAS Loop Pattern (the bread and butter of lock-free code):
use std::sync::atomic::{AtomicUsize, Ordering};
fn lock_free_increment(counter: &AtomicUsize) {
loop {
let current = counter.load(Ordering::Relaxed);
let new = current + 1;
// Try to swap. If another thread changed the value, retry.
match counter.compare_exchange(
current, // expected
new, // new value
Ordering::Release, // success ordering
Ordering::Relaxed // failure ordering
) {
Ok(_) => break, // We won! Our write succeeded.
Err(_) => continue, // Someone else wrote. Reload and retry.
}
}
}
+-------------------------------------------------------------------------+
| CAS LOOP VISUALIZATION |
+-------------------------------------------------------------------------+
Thread A Thread B
| |
| current = 5 | current = 5
| new = 6 | new = 6
| |
v v
CAS(5 -> 6) ----+ CAS(5 -> 6) ----+
| |
| (Race condition!) |
| |
+------+------+ +------+------+
| Thread A | | Thread B |
| wins! | | loses! |
| counter = 6 | | sees 6 != 5 |
+-------------+ +-------------+
|
v
RETRY:
current = 6
new = 7
CAS(6 -> 7) -> SUCCESS!
counter = 7
Both threads made progress. No blocking occurred.
The ABA Problem: The Subtle Killer
The ABA problem is the most insidious bug in lock-free programming. It occurs when:
- Thread A reads value A from a location
- Thread A is preempted
- Thread B changes A -> B -> A (back to A)
- Thread A resumes, sees A, and thinks nothing changed!
+-------------------------------------------------------------------------+
| THE ABA PROBLEM |
+-------------------------------------------------------------------------+
Initial state: HEAD -> [Node A] -> [Node B] -> [Node C] -> NULL
Thread A (POP operation):
1. Read HEAD = A
2. Prepare: new_head = A.next = B
3. *** PREEMPTED ***
Thread B runs:
4. Pop A (HEAD = B)
5. Pop B (HEAD = C)
6. Free A, Free B
7. Allocate new node (reuses A's memory!)
8. Push new A (HEAD = A)
New state: HEAD -> [Node A*] -> [Node C] -> NULL
(* new content, same address!)
Thread A resumes:
9. CAS(HEAD, A, B)
10. Succeeds! (HEAD still points to address A)
11. HEAD = B
PROBLEM: B was freed! HEAD now points to freed memory!
After Thread A's CAS:
HEAD -> [FREED MEMORY] -> ???
^
CRASH / DATA CORRUPTION
Why this happens: CAS only checks the pointer value, not what the pointer points to. The memory at address A was reused, but the address itself is the same.
Solutions to the ABA problem:
- Tagged Pointers (what weâll use):
- Add a counter alongside the pointer
- Increment the counter on every modification
- CAS checks both pointer AND counter
+----------------+----------------+ | Counter | Pointer | | (32 bits) | (32 bits) | +----------------+----------------+ Even if pointer returns to same address, counter will be different! -
Hazard Pointers: Threads publish which pointers theyâre using; no reuse until safe
- Epoch-Based Reclamation: Memory freed only when all threads have passed an epoch
Book Reference: âRust Atomics and Locksâ by Mara Bos, Chapter 9: âLock-Free Linked Listsâ (covers ABA in depth)
Cache Coherence and False Sharing
Modern CPUs donât read individual bytes from RAM. They read cache lines (typically 64 bytes). When two threads modify variables on the same cache line, they fight over that cache line, even if theyâre modifying different variables!
+-------------------------------------------------------------------------+
| FALSE SHARING |
+-------------------------------------------------------------------------+
CACHE LINE (64 bytes)
+-------+-------+-------+-------+-------+-------+-------+-------+
| head | tail | | | | | | |
| (8B) | (8B) | ... | ... | ... | ... | ... | ... |
+---+---+---+---+-------+-------+-------+-------+-------+-------+
| |
| +-- Thread B writes tail
| -> Invalidates ENTIRE cache line
| -> Thread A must reload head!
|
+-- Thread A writes head
-> Invalidates ENTIRE cache line
-> Thread B must reload tail!
Both threads constantly invalidate each other's cache.
Performance tanks even though they write DIFFERENT variables!
The solution: Pad your data structures so hot variables live on separate cache lines.
+-------------------------------------------------------------------------+
| CACHE LINE PADDING |
+-------------------------------------------------------------------------+
CACHE LINE 1 (64 bytes)
+-------+-------+-------+-------+-------+-------+-------+-------+
| head | PAD | PAD | PAD | PAD | PAD | PAD | PAD |
| (8B) | (56B padding to fill line) |
+-------+-------+-------+-------+-------+-------+-------+-------+
CACHE LINE 2 (64 bytes)
+-------+-------+-------+-------+-------+-------+-------+-------+
| tail | PAD | PAD | PAD | PAD | PAD | PAD | PAD |
| (8B) | (56B padding to fill line) |
+-------+-------+-------+-------+-------+-------+-------+-------+
Now Thread A and Thread B have their own cache lines.
No invalidation! Full speed!
In Rust:
#[repr(align(64))] // Align to cache line boundary
struct CacheLinePadded<T> {
value: T,
_padding: [u8; 64 - std::mem::size_of::<T>()],
}
Hardware vs. Software Memory Models
Understanding the difference between what the hardware does and what your programming language promises is crucial.
+-------------------------------------------------------------------------+
| MEMORY MODEL HIERARCHY |
+-------------------------------------------------------------------------+
+----------------------------------+
| Your Rust Code |
| (uses Ordering::Acquire, etc.) |
+----------------+-----------------+
|
v
+----------------------------------+
| Rust Memory Model |
| (Abstract machine semantics) |
| - Defines valid executions |
| - Compiler inserts barriers |
+----------------+-----------------+
|
v
+----------------------------------+
| LLVM / Machine Code |
| (Platform-specific barriers) |
| x86: MFENCE, LOCK prefix |
| ARM: DMB, DSB, ISB |
+----------------+-----------------+
|
v
+----------------------------------+
| Hardware Memory Model |
| (What the CPU actually does) |
| x86-64: Strong (TSO) |
| ARM: Weak |
+----------------------------------+
Different CPUs have different memory models:
x86-64 (Intel/AMD):
- Total Store Order (TSO)
- Relatively strong: stores are seen in order
- Your code might work "by accident" even with wrong orderings!
ARM/RISC-V:
- Weak memory model
- Stores can be reordered aggressively
- Bugs in ordering WILL manifest on these platforms
This is why testing on x86 isn't enough. Use Loom or test on ARM!
Book Reference: âComputer Systems: A Programmerâs Perspectiveâ by Bryant & OâHallaron, Chapter 6: âThe Memory Hierarchyâ
Lock-Free vs. Wait-Free vs. Obstruction-Free
These terms define different levels of progress guarantees:
+-------------------------------------------------------------------------+
| PROGRESS GUARANTEE HIERARCHY |
+-------------------------------------------------------------------------+
BLOCKING (Locks)
- A thread can prevent all others from progressing indefinitely
- Example: Thread holds mutex, gets preempted
OBSTRUCTION-FREE
- A thread makes progress if it runs alone (no contention)
- Under contention, no guarantees
- Weakest non-blocking guarantee
LOCK-FREE
- At least ONE thread makes progress in finite steps
- System as a whole makes progress
- Individual threads might retry forever (starvation)
- OUR TARGET
WAIT-FREE
- EVERY thread makes progress in finite steps
- No starvation possible
- Hardest to achieve
- Often requires helping mechanisms
Strength of Guarantee
^
| +-------------+
| | Wait-Free | Every thread progresses
| +-------------+
| ^
| |
| +-------------+
| | Lock-Free | System progresses (our target)
| +-------------+
| ^
| |
| +-------------------+
| | Obstruction-Free | Progress if alone
| +-------------------+
| ^
| |
| +-------------+
+--- | Blocking | Can halt entirely
+-------------+
What weâre building: A lock-free SPSC queue (simplest) that can be extended to MPMC.
Real-World Use Cases
Lock-free data structures power the worldâs most demanding systems:
| Domain | Example | Why Lock-Free |
|---|---|---|
| Audio Processing | Digital Audio Workstations (DAWs) | Buffer underruns cause audible clicks; cannot tolerate latency spikes |
| Game Engines | Unity, Unreal Engine job systems | Frame deadlines are absolute (16.67ms for 60 FPS) |
| Financial Trading | High-Frequency Trading platforms | Microsecond advantages = millions of dollars |
| Databases | RocksDB, ScyllaDB | Transaction log must never block |
| Operating Systems | Linux kernel, Windows kernel | Interrupt handlers cannot sleep |
| Message Passing | Tokio, Crossbeam, Rayon | Channels between async tasks |
| Real-time Systems | Medical devices, avionics | Certifiable worst-case latency |
Real World Outcome
When you complete this project, you will have a high-performance lock-free queue that dramatically outperforms the standard library. Hereâs what running your benchmarks will look like:
$ cargo new lock-free-queue
Created library `lock-free-queue` package
$ cd lock-free-queue
$ cargo add crossbeam-utils # For cache line padding
Updating crates.io index
Adding crossbeam-utils v0.8.19 to dependencies
$ cargo build --release
Compiling lock-free-queue v0.1.0 (/Users/you/lock-free-queue)
Finished release [optimized] target(s) in 2.34s
$ cargo run --release --example benchmark
Compiling lock-free-queue v0.1.0
Finished release [optimized] target(s) in 0.89s
Running `target/release/examples/benchmark`
+======================================================================+
| LOCK-FREE QUEUE PERFORMANCE BENCHMARK |
+======================================================================+
System Information:
CPU: Apple M2 Pro (12 cores)
RAM: 32 GB
OS: macOS 14.0
Test Configuration:
Messages: 10,000,000
Message Size: 64 bytes
Iterations: 5
+----------------------------------------------------------------------+
| SINGLE-PRODUCER SINGLE-CONSUMER |
+----------------------------------------------------------------------+
Running std::sync::mpsc benchmark...
Iteration 1: 4.82M ops/sec
Iteration 2: 4.91M ops/sec
Iteration 3: 4.76M ops/sec
Iteration 4: 4.88M ops/sec
Iteration 5: 4.79M ops/sec
Average: 4.83M ops/sec
Running crossbeam-channel benchmark...
Iteration 1: 12.34M ops/sec
Iteration 2: 12.51M ops/sec
Iteration 3: 12.28M ops/sec
Iteration 4: 12.45M ops/sec
Iteration 5: 12.39M ops/sec
Average: 12.39M ops/sec
Running YOUR lock-free queue benchmark...
Iteration 1: 18.92M ops/sec
Iteration 2: 19.11M ops/sec
Iteration 3: 18.87M ops/sec
Iteration 4: 19.23M ops/sec
Iteration 5: 18.95M ops/sec
Average: 19.02M ops/sec
+----------------------------------------------------------------------+
| BENCHMARK RESULTS |
+----------------------------------------------------------------------+
Queue Type | Throughput | Latency (p50) | Latency (p99)
--------------------|---------------|---------------|---------------
std::sync::mpsc | 4.83M ops/s | 207 ns | 892 ns
crossbeam-channel | 12.39M ops/s | 81 ns | 234 ns
YOUR Lock-Free | 19.02M ops/s | 53 ns | 127 ns
SPEEDUP vs std::mpsc: 3.94x
SPEEDUP vs crossbeam: 1.53x
+----------------------------------------------------------------------+
| MEMORY ORDERING VERIFICATION |
+----------------------------------------------------------------------+
Running correctness tests with Loom...
test spsc_basic ... ok (explored 1,247 interleavings)
test spsc_wrap_around ... ok (explored 3,891 interleavings)
test spsc_full_queue ... ok (explored 2,156 interleavings)
test spsc_empty_queue ... ok (explored 987 interleavings)
test spsc_concurrent_stress ... ok (explored 12,456 interleavings)
All orderings verified correct!
+----------------------------------------------------------------------+
| FALSE SHARING ANALYSIS |
+----------------------------------------------------------------------+
Without cache line padding:
Throughput: 8.23M ops/sec
L1 cache misses: 4,231,892
With cache line padding:
Throughput: 19.02M ops/sec
L1 cache misses: 12,456
Improvement: 2.31x throughput, 339x fewer cache misses!
+----------------------------------------------------------------------+
| SUMMARY |
+----------------------------------------------------------------------+
Your lock-free SPSC queue achieves:
* 19+ million operations per second
* 53 nanosecond median latency
* Zero heap allocations in hot path
* Verified memory ordering correctness
* Optimal cache behavior with padding
Congratulations! You've built a production-quality lock-free data structure.
$ cargo test
Compiling lock-free-queue v0.1.0
Finished test [unoptimized + debuginfo] target(s) in 1.45s
Running unittests src/lib.rs
running 12 tests
test tests::test_single_push_pop ... ok
test tests::test_empty_pop_returns_none ... ok
test tests::test_full_push_returns_err ... ok
test tests::test_fifo_ordering ... ok
test tests::test_wrap_around ... ok
test tests::test_capacity ... ok
test tests::test_concurrent_spsc ... ok
test tests::test_high_contention ... ok
test tests::test_memory_ordering ... ok
test loom_tests::loom_spsc_basic ... ok
test loom_tests::loom_concurrent_access ... ok
test loom_tests::loom_wrap_around ... ok
test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Complete Project Specification
Core Requirements
You will implement a bounded, lock-free SPSC (Single-Producer Single-Consumer) queue with these characteristics:
- Capacity: Fixed at construction time (power of 2 for efficient modulo via bitwise AND)
- Blocking: Never.
pushandpopreturn immediately - Ordering: FIFO (First-In First-Out)
- Thread Safety: Single producer thread, single consumer thread (no synchronization needed between multiple producers or consumers)
- Memory: No heap allocation after construction; the ring buffer is pre-allocated
- Performance: 18+ million operations per second on modern hardware
API Design
pub struct SpscQueue<T, const N: usize> {
// Implementation details
}
impl<T, const N: usize> SpscQueue<T, N> {
/// Create a new queue with capacity N.
/// N must be a power of 2.
pub fn new() -> Self;
/// Attempt to push a value into the queue.
/// Returns Err(value) if the queue is full.
pub fn push(&self, value: T) -> Result<(), T>;
/// Attempt to pop a value from the queue.
/// Returns None if the queue is empty.
pub fn pop(&self) -> Option<T>;
/// Check if the queue is empty.
pub fn is_empty(&self) -> bool;
/// Check if the queue is full.
pub fn is_full(&self) -> bool;
/// Get the current number of elements in the queue.
pub fn len(&self) -> usize;
/// Get the maximum capacity of the queue.
pub fn capacity(&self) -> usize;
}
// The queue is Send + Sync because:
// - Only one thread ever calls push (the producer)
// - Only one thread ever calls pop (the consumer)
unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}
Constraints
- You may NOT use
Mutex,RwLock, or any other blocking synchronization primitive - You MAY use
AtomicUsize,AtomicPtr, and related types fromstd::sync::atomic - You MAY use
std::cell::UnsafeCellfor interior mutability - You MUST correctly use memory orderings (not just SeqCst everywhere)
- You MUST prevent false sharing via cache line padding
Solution Architecture
Ring Buffer Structure
The queue is implemented as a ring buffer (circular buffer) with head and tail pointers:
+-------------------------------------------------------------------------+
| RING BUFFER STRUCTURE |
+-------------------------------------------------------------------------+
Capacity = 8 (indices 0-7)
Initial state (empty):
HEAD = 0 (next slot to READ from)
TAIL = 0 (next slot to WRITE to)
HEAD
TAIL
|
v
+---+---+---+---+---+---+---+---+
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+---+---+---+---+---+---+---+
After push(A), push(B), push(C):
HEAD = 0
TAIL = 3
HEAD TAIL
| |
v v
+---+---+---+---+---+---+---+---+
| A | B | C | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+---+---+---+---+---+---+---+
^^^^^^^^^
Readable data (HEAD to TAIL-1)
After pop() returns A:
HEAD = 1
TAIL = 3
HEAD TAIL
| |
v v
+---+---+---+---+---+---+---+---+
| | B | C | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+---+---+---+---+---+---+---+
^^^^^^^
Readable data
Wrap-around example:
After many operations, HEAD = 6, TAIL = 2
HEAD
|
v
+---+---+---+---+---+---+---+---+
| X | Y | | | | | W | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+---+---+---+---+---+---+---+
^ ^^^^^^^
| Readable: W
TAIL
Readable: X, Y (wrapped around!)
To read: index = HEAD % CAPACITY = 6 % 8 = 6
To write: index = TAIL % CAPACITY = 2 % 8 = 2
Head/Tail Pointer Management
The key insight is that in SPSC:
- Only the producer ever modifies
tail - Only the consumer ever modifies
head
This means no CAS is needed! Simple atomic loads and stores suffice.
+-------------------------------------------------------------------------+
| SPSC OPERATION FLOW |
+-------------------------------------------------------------------------+
PUSH OPERATION (Producer thread only):
1. Load tail (Relaxed - we own this variable)
2. Load head (Acquire - synchronize with consumer's Release)
3. If (tail - head) == capacity: FULL, return Err
4. Write data to buffer[tail % capacity]
5. Store tail + 1 (Release - publish the new data)
Producer Consumer
| |
| tail_local = tail.load(Relaxed) |
| head_snapshot = head.load(Acquire)
| <---(sync)--- |
| if full: return Err |
| buffer[tail_local] = data |
| tail.store(tail_local+1, Release)
| ---(sync)---> |
| |
POP OPERATION (Consumer thread only):
1. Load head (Relaxed - we own this variable)
2. Load tail (Acquire - synchronize with producer's Release)
3. If head == tail: EMPTY, return None
4. Read data from buffer[head % capacity]
5. Store head + 1 (Release - publish the consumption)
| |
| | head_local = head.load(Relaxed)
| | tail_snapshot = tail.load(Acquire)
| <---(sync)--- |
| | if empty: return None
| | data = buffer[head_local]
| | head.store(head_local+1, Release)
| <---(sync)--- |
Cache Line Padding Strategy
To prevent false sharing, we pad head and tail to separate cache lines:
+-------------------------------------------------------------------------+
| MEMORY LAYOUT WITH PADDING |
+-------------------------------------------------------------------------+
struct SpscQueue<T, N> {
CACHE LINE 1 (64 bytes)
+----------------------------------------------------------+
| head: AtomicUsize | _pad1: [u8; 56] |
| (8 bytes) | (56 bytes padding) |
+----------------------------------------------------------+
CACHE LINE 2 (64 bytes)
+----------------------------------------------------------+
| tail: AtomicUsize | _pad2: [u8; 56] |
| (8 bytes) | (56 bytes padding) |
+----------------------------------------------------------+
CACHE LINES 3+ (N * size_of::<T>())
+----------------------------------------------------------+
| buffer: [UnsafeCell<MaybeUninit<T>>; N] |
| (The actual ring buffer data) |
+----------------------------------------------------------+
}
With this layout:
- Producer writes to CACHE LINE 2 (tail) and CACHE LINES 3+ (buffer)
- Consumer reads from CACHE LINE 2 and writes to CACHE LINE 1 (head)
- No false sharing between head and tail!
Memory Ordering Choices
Hereâs the reasoning for each ordering:
| Operation | Ordering | Reasoning |
|---|---|---|
| Producer reads tail | Relaxed | Producer âownsâ tail; no sync needed |
| Producer reads head | Acquire | Must see consumerâs writes to head |
| Producer writes buffer | (none) | Protected by tail/head synchronization |
| Producer writes tail | Release | âPublishesâ buffer write to consumer |
| Consumer reads head | Relaxed | Consumer âownsâ head; no sync needed |
| Consumer reads tail | Acquire | Must see producerâs writes to tail |
| Consumer reads buffer | (none) | Protected by tail/head synchronization |
| Consumer writes head | Release | âPublishesâ consumption to producer |
Phased Implementation Guide
Phase 1: Basic SPSC Queue with Acquire/Release
Goal: Get a working single-threaded queue that compiles.
// src/lib.rs
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
use std::sync::atomic::{AtomicUsize, Ordering};
pub struct SpscQueue<T, const N: usize> {
head: AtomicUsize,
tail: AtomicUsize,
buffer: [UnsafeCell<MaybeUninit<T>>; N],
}
impl<T, const N: usize> SpscQueue<T, N> {
pub fn new() -> Self {
assert!(N.is_power_of_two(), "Capacity must be a power of 2");
// Safe because MaybeUninit doesn't require initialization
let buffer = unsafe {
MaybeUninit::<[UnsafeCell<MaybeUninit<T>>; N]>::uninit().assume_init()
};
Self {
head: AtomicUsize::new(0),
tail: AtomicUsize::new(0),
buffer,
}
}
pub fn push(&self, value: T) -> Result<(), T> {
let tail = self.tail.load(Ordering::Relaxed);
let head = self.head.load(Ordering::Acquire);
if tail.wrapping_sub(head) == N {
return Err(value); // Queue is full
}
let index = tail & (N - 1); // Efficient modulo for power of 2
unsafe {
(*self.buffer[index].get()).write(value);
}
self.tail.store(tail.wrapping_add(1), Ordering::Release);
Ok(())
}
pub fn pop(&self) -> Option<T> {
let head = self.head.load(Ordering::Relaxed);
let tail = self.tail.load(Ordering::Acquire);
if head == tail {
return None; // Queue is empty
}
let index = head & (N - 1);
let value = unsafe {
(*self.buffer[index].get()).assume_init_read()
};
self.head.store(head.wrapping_add(1), Ordering::Release);
Some(value)
}
}
unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}
Checkpoint: Write a basic test that pushes and pops a single value.
Phase 2: Bounded Ring Buffer with Proper Indexing
Goal: Handle wrap-around correctly and add capacity checks.
impl<T, const N: usize> SpscQueue<T, N> {
pub fn is_empty(&self) -> bool {
let head = self.head.load(Ordering::Relaxed);
let tail = self.tail.load(Ordering::Acquire);
head == tail
}
pub fn is_full(&self) -> bool {
let tail = self.tail.load(Ordering::Relaxed);
let head = self.head.load(Ordering::Acquire);
tail.wrapping_sub(head) == N
}
pub fn len(&self) -> usize {
let tail = self.tail.load(Ordering::Acquire);
let head = self.head.load(Ordering::Acquire);
tail.wrapping_sub(head)
}
pub fn capacity(&self) -> usize {
N
}
}
Checkpoint: Test wrap-around by pushing more than capacity items (some will fail), popping, and pushing again.
Phase 3: Cache Line Padding to Prevent False Sharing
Goal: Separate head and tail onto different cache lines.
use crossbeam_utils::CachePadded;
pub struct SpscQueue<T, const N: usize> {
head: CachePadded<AtomicUsize>,
tail: CachePadded<AtomicUsize>,
buffer: [UnsafeCell<MaybeUninit<T>>; N],
}
Or implement your own padding:
#[repr(align(64))]
struct Padded<T> {
value: T,
}
pub struct SpscQueue<T, const N: usize> {
head: Padded<AtomicUsize>,
tail: Padded<AtomicUsize>,
buffer: [UnsafeCell<MaybeUninit<T>>; N],
}
Checkpoint: Benchmark with and without padding. You should see 2x+ improvement.
Phase 4: MPMC Extension with CAS Loops
Goal: Extend to Multi-Producer Multi-Consumer (optional, advanced).
For MPMC, multiple producers might try to write to the same slot. We need CAS:
pub fn mpmc_push(&self, value: T) -> Result<(), T> {
loop {
let tail = self.tail.load(Ordering::Relaxed);
let head = self.head.load(Ordering::Acquire);
if tail.wrapping_sub(head) == N {
return Err(value); // Queue is full
}
// Try to claim this slot
match self.tail.compare_exchange_weak(
tail,
tail.wrapping_add(1),
Ordering::Release,
Ordering::Relaxed,
) {
Ok(_) => {
// We claimed the slot, write the value
let index = tail & (N - 1);
unsafe {
(*self.buffer[index].get()).write(value);
}
return Ok(());
}
Err(_) => continue, // Another producer won, retry
}
}
}
Caution: MPMC is significantly more complex. The above is a simplified version; a production MPMC queue needs more sophisticated handling (e.g., sequence numbers per slot).
Phase 5: Benchmarking and Verification
Goal: Measure performance and verify correctness.
Create a benchmark in examples/benchmark.rs:
use std::thread;
use std::time::Instant;
use lock_free_queue::SpscQueue;
fn main() {
const MESSAGES: usize = 10_000_000;
let queue: SpscQueue<usize, 1024> = SpscQueue::new();
let queue = &queue; // Share reference
let producer = thread::spawn(move || {
for i in 0..MESSAGES {
while queue.push(i).is_err() {
std::hint::spin_loop();
}
}
});
let consumer = thread::spawn(move || {
let mut received = 0;
while received < MESSAGES {
if queue.pop().is_some() {
received += 1;
} else {
std::hint::spin_loop();
}
}
});
let start = Instant::now();
producer.join().unwrap();
consumer.join().unwrap();
let elapsed = start.elapsed();
let ops_per_sec = MESSAGES as f64 / elapsed.as_secs_f64();
println!("Throughput: {:.2}M ops/sec", ops_per_sec / 1_000_000.0);
}
Testing Strategy
Unit Tests
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_single_push_pop() {
let queue: SpscQueue<i32, 4> = SpscQueue::new();
assert!(queue.push(42).is_ok());
assert_eq!(queue.pop(), Some(42));
}
#[test]
fn test_empty_pop_returns_none() {
let queue: SpscQueue<i32, 4> = SpscQueue::new();
assert_eq!(queue.pop(), None);
}
#[test]
fn test_full_push_returns_err() {
let queue: SpscQueue<i32, 2> = SpscQueue::new();
assert!(queue.push(1).is_ok());
assert!(queue.push(2).is_ok());
assert!(queue.push(3).is_err());
}
#[test]
fn test_fifo_ordering() {
let queue: SpscQueue<i32, 8> = SpscQueue::new();
for i in 0..5 {
queue.push(i).unwrap();
}
for i in 0..5 {
assert_eq!(queue.pop(), Some(i));
}
}
#[test]
fn test_wrap_around() {
let queue: SpscQueue<i32, 4> = SpscQueue::new();
// Fill and empty multiple times
for _ in 0..10 {
for i in 0..4 {
queue.push(i).unwrap();
}
for i in 0..4 {
assert_eq!(queue.pop(), Some(i));
}
}
}
}
Stress Tests
#[test]
fn test_concurrent_stress() {
use std::thread;
const ITERATIONS: usize = 100_000;
let queue: SpscQueue<usize, 1024> = SpscQueue::new();
let queue = std::sync::Arc::new(queue);
let producer_queue = queue.clone();
let producer = thread::spawn(move || {
for i in 0..ITERATIONS {
while producer_queue.push(i).is_err() {
std::hint::spin_loop();
}
}
});
let consumer_queue = queue.clone();
let consumer = thread::spawn(move || {
let mut sum = 0usize;
let mut count = 0;
while count < ITERATIONS {
if let Some(v) = consumer_queue.pop() {
sum = sum.wrapping_add(v);
count += 1;
}
}
sum
});
producer.join().unwrap();
let sum = consumer.join().unwrap();
// Verify we received all items (sum should match expected)
let expected_sum: usize = (0..ITERATIONS).sum();
assert_eq!(sum, expected_sum);
}
Loom Testing for Concurrency
Loom exhaustively tests all possible thread interleavings:
[dev-dependencies]
loom = "0.7"
#[cfg(loom)]
mod loom_tests {
use loom::sync::atomic::{AtomicUsize, Ordering};
use loom::thread;
#[test]
fn loom_spsc_basic() {
loom::model(|| {
// Use loom's atomic types instead of std
let queue = /* loom-compatible queue */;
let producer = thread::spawn(move || {
queue.push(1).unwrap();
});
let consumer = thread::spawn(move || {
loop {
if let Some(v) = queue.pop() {
assert_eq!(v, 1);
break;
}
}
});
producer.join().unwrap();
consumer.join().unwrap();
});
}
}
Run Loom tests:
RUSTFLAGS="--cfg loom" cargo test --release
Common Pitfalls
Pitfall 1: Using Wrong Memory Orderings
Symptom: Works on x86, fails on ARM. Data corruption under load.
Wrong:
self.tail.store(new_tail, Ordering::Relaxed); // Bug!
Right:
self.tail.store(new_tail, Ordering::Release); // Publishes buffer write
Diagnosis: Use Loom to find ordering bugs. Test on ARM (e.g., Apple Silicon, Raspberry Pi).
Pitfall 2: The ABA Problem in MPMC
Symptom: Mysterious corruption when memory is reused.
Wrong: Using raw pointer comparison in CAS.
Right: Use tagged pointers (pointer + counter) or epoch-based reclamation.
Pitfall 3: False Sharing Destroying Performance
Symptom: Multi-threaded version is slower than single-threaded.
Wrong:
struct Queue {
head: AtomicUsize, // Same cache line as tail!
tail: AtomicUsize,
}
Right:
struct Queue {
head: CachePadded<AtomicUsize>,
tail: CachePadded<AtomicUsize>,
}
Diagnosis: Use perf to measure cache misses.
Pitfall 4: Forgetting #[repr(align)] or Using Wrong Size
Symptom: Padding doesnât help performance.
Wrong: Assuming cache line is 32 bytes (old CPUs) or 128 bytes (some server CPUs).
Right: Use 64 bytes (most common), or detect at runtime.
Pitfall 5: Not Using MaybeUninit
Symptom: Undefined behavior when reading uninitialized memory.
Wrong:
buffer: [UnsafeCell<T>; N], // T might not be Default!
Right:
buffer: [UnsafeCell<MaybeUninit<T>>; N],
Pitfall 6: Overflow in Index Calculation
Symptom: Crash after billions of operations.
Wrong:
let index = tail % N; // Works until tail overflows
Right:
let index = tail & (N - 1); // Works with wrapping arithmetic
// AND: use wrapping_add for head/tail increments
Extensions and Challenges
Extension 1: Wait-Free Pop
Make pop wait-free by having the producer help the consumer:
// Producer leaves a "helping" record when it sees a slow consumer
Extension 2: Variable-Size Elements
Support elements of different sizes using a length prefix:
+--------+------------------+--------+------------------+
| len: 4 | data: "abcd" | len: 2 | data: "xy" |
+--------+------------------+--------+------------------+
Extension 3: Batch Operations
Push/pop multiple elements atomically for better throughput:
fn push_batch(&self, values: &[T]) -> usize; // Returns count pushed
fn pop_batch(&self, out: &mut [T]) -> usize; // Returns count popped
Extension 4: Block-Based Ring Buffer
For very high throughput, operate on blocks rather than elements:
+------------------+------------------+------------------+
| Block 0 (4KB) | Block 1 (4KB) | Block 2 (4KB) |
| 64 messages | 64 messages | 64 messages |
+------------------+------------------+------------------+
Extension 5: NUMA-Aware Allocation
For multi-socket servers, allocate the buffer on the NUMA node where the consumer runs:
// Use libnuma or platform-specific APIs
The Interview Questions They Will Ask
-
âWhat is a âlock-freeâ data structure?â
A data structure where at least one thread is guaranteed to make progress in a finite number of steps, regardless of what other threads are doing. There are no locks that can cause a thread to wait indefinitely. Lock-free structures use atomic operations like CAS to coordinate access.
-
âExplain the difference between Acquire and Release memory ordering.â
Release ordering on a store âpublishesâ all writes that happened before it. Acquire ordering on a load âreceivesâ all writes that happened before the matching Release. Together, they create a âsynchronizes-withâ relationship that establishes a happens-before ordering between threads. Release prevents reordering of earlier writes past the store; Acquire prevents reordering of later reads before the load.
-
âWhat is âFalse Sharingâ and how do you prevent it in Rust?â
False sharing occurs when threads on different cores modify variables that happen to share the same cache line (typically 64 bytes). Each modification invalidates the entire cache line for all other cores, causing excessive cache coherence traffic. Prevent it by padding structures to align hot variables to cache line boundaries using
#[repr(align(64))]or crates likecrossbeam-utils::CachePadded. -
âWhy is
SeqCstthe default, and why is it often slower?âSeqCst provides the strongest guarantee: a single total ordering visible to all threads. Itâs the default because itâs easiest to reason about. However, it requires expensive memory barriers on most architectures. For many algorithms (like SPSC queues), weaker orderings (Acquire/Release) are sufficient and can be 2-10x faster.
-
âExplain the ABA problem and how to solve it.â
The ABA problem occurs when a CAS succeeds because a value changed from A to B and back to A. The CAS sees A and thinks nothing changed, but the context may have changed (e.g., the memory was freed and reallocated). Solutions include: (1) Tagged pointers that combine a pointer with a counter, (2) Hazard pointers that prevent reuse of memory being accessed, (3) Epoch-based reclamation that delays memory reuse.
-
âWhatâs the difference between
compare_exchangeandcompare_exchange_weak?âcompare_exchange_weakmay spuriously fail even when the comparison would succeed. This is cheaper on some architectures (like ARM) because it avoids a memory barrier on failure. Useweakin loops where youâll retry anyway; use the strong version when spurious failures would cause correctness issues. -
âWhy do we use
wrapping_addfor the head and tail indices?âTo handle the case when indices overflow
usize::MAX. With wrapping arithmetic, indices wrap around to 0, and the differencetail - headstill gives the correct queue length (assuming the queue size is much smaller thanusize::MAX). This avoids the need to reset indices to 0, which would require additional synchronization.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Atomic Fundamentals | âRust Atomics and Locksâ by Mara Bos | Ch. 1: Basics of Rust Concurrency |
| Memory Ordering Deep Dive | âRust Atomics and Locksâ by Mara Bos | Ch. 3: Memory Ordering |
| Lock-Free Data Structures | âRust Atomics and Locksâ by Mara Bos | Ch. 9: Lock-Free Linked Lists |
| Concurrency in Rust | âProgramming Rustâ by Jim Blandy | Ch. 19: Concurrency |
| Cache Architecture | âComputer Systems: A Programmerâs Perspectiveâ | Ch. 6: The Memory Hierarchy |
| Hardware Memory Models | âA Primer on Memory Consistencyâ by Sorin et al. | Chapters 1-4 |
Summary
You have now built one of the most challenging data structures in computer science: a lock-free queue. Along the way, you have mastered:
- The difference between lock-free, wait-free, and blocking algorithms
- Rustâs atomic memory orderings and when to use each
- The Compare-and-Swap operation and CAS loops
- The ABA problem and its solutions
- Cache line padding to prevent false sharing
- Ring buffer design with efficient head/tail management
- Verification techniques including Loom
Your queue achieves 18+ million operations per second, dramatically outperforming standard library channels. This is the kind of code that powers game engines, trading systems, and operating system kernels.
You are now equipped to understand and implement the concurrent data structures that underpin modern high-performance systems. The concepts youâve learned apply far beyond queues: theyâre the foundation for lock-free stacks, skip lists, hash maps, and more.
Welcome to the world of lock-free programming. The CPU is now yours.