Project 2: Lock-Free Market Data Handler
Build a high-performance market data distribution system using lock-free queues - zero mutexes, zero blocking, pure atomic operations. Master the dark art of memory ordering that separates HFT engineers from everyone else.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Languages | Rust (Primary), C++, C, Zig (Alternatives) |
| Prerequisites | Project 1 (Order Book), basic threading, pointer fundamentals |
| Key Topics | Atomics, memory ordering, SPSC queues, false sharing, cache lines |
| Coolness Level | Level 5: Black Belt Systems Programming |
| Portfolio Value | Top-Tier HFT Resume |
1. Learning Objectives
By completing this project, you will:
- Implement a lock-free SPSC queue from scratch: Build a Single-Producer, Single-Consumer queue using only atomic operations - no mutexes, no locks, no blocking
- Master memory ordering semantics: Understand and correctly apply Acquire/Release ordering, and know when Relaxed is safe vs dangerous
- Eliminate false sharing: Diagnose and fix cache line contention between threads that silently destroys performance
- Parse binary protocols efficiently: Implement zero-copy deserialization of market data feeds
- Measure concurrent performance accurately: Build latency histograms with p50/p99/p999 percentiles and understand measurement pitfalls
- Understand the MESI protocol: Know how CPU caches coordinate and why this matters for lock-free code
- Build production-grade infrastructure: Create the kind of component that actually runs in HFT systems
2. Theoretical Foundation
2.1 Core Concepts
What Are Atomics?
An atomic operation is one that appears to happen instantaneously from the perspective of all other threads. There is no intermediate state that other threads can observe.
Non-Atomic Increment (DANGEROUS):
Thread A reads x = 5
Thread B reads x = 5
Thread A writes x = 6
Thread B writes x = 6 <-- Lost update!
Expected: x = 7
Actual: x = 6
Atomic Increment (SAFE):
Thread A: atomic_fetch_add(&x, 1) // x becomes 6, returns 5
Thread B: atomic_fetch_add(&x, 1) // x becomes 7, returns 6
Result: x = 7 (correct!)
In Rust:
use std::sync::atomic::{AtomicU64, Ordering};
let counter = AtomicU64::new(0);
counter.fetch_add(1, Ordering::SeqCst); // Atomic increment
In C++:
#include <atomic>
std::atomic<uint64_t> counter{0};
counter.fetch_add(1, std::memory_order_seq_cst); // Atomic increment
Memory Ordering: The Hidden Dimension
Modern CPUs and compilers reorder operations for performance. This is invisible to single-threaded code but breaks multi-threaded code in subtle ways.
Memory Ordering Problem:
Thread A (Producer): Thread B (Consumer):
data = 42; if (ready.load()) {
ready.store(true); print(data); // May see 0!
}
Without proper ordering, the compiler or CPU might:
1. Reorder ready.store() before data = 42
2. Thread B might see ready=true but data=0 (stale)
Memory Ordering Levels (from strongest to weakest):
┌──────────────────────────────────────────────────────────────────────┐
│ MEMORY ORDERING SPECTRUM │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ SeqCst (Sequential Consistency) │
│ ═══════════════════════════════ │
│ • Total global ordering │
│ • All threads see same order │
│ • Slowest, but easiest to reason about │
│ • Use when: Learning, or when correctness > performance │
│ │
│ ▼ │
│ │
│ Acquire / Release │
│ ════════════════ │
│ • Producer uses Release: "publish" changes │
│ • Consumer uses Acquire: "synchronize with" producer │
│ • Creates happens-before relationship │
│ • Use when: Producer-consumer patterns (most lock-free code) │
│ │
│ ▼ │
│ │
│ Relaxed │
│ ═══════ │
│ • No ordering guarantees │
│ • Only atomicity guaranteed │
│ • Fastest, but dangerous │
│ • Use when: Counters, statistics, or with explicit fences │
│ │
└──────────────────────────────────────────────────────────────────────┘
Acquire/Release Pattern (The Key to Lock-Free):
Producer/Consumer with Acquire/Release:
Producer (Thread A): Consumer (Thread B):
┌───────────────────────┐ ┌───────────────────────┐
│ data[idx] = value; │ │ │
│ // Many stores here │ │ │
│ │ │ │
│ tail.store( │────────────│ head = tail.load( │
│ new_tail, │ syncs │ Ordering::Acquire │
│ Ordering::Release); │ with ───▶ │ ); │
│ │ │ │
│ │ │ // Safe to read data! │
│ │ │ value = data[idx]; │
└───────────────────────┘ └───────────────────────┘
Release "publishes" all prior writes.
Acquire "sees" all writes before the Release.
Cache Coherency and the MESI Protocol
CPUs maintain coherent caches using protocols like MESI. Understanding this is crucial for lock-free performance.
MESI Protocol States:
┌──────────────────────────────────────────────────────────────────────┐
│ MESI STATES │
├─────────────┬────────────────────────────────────────────────────────┤
│ State │ Meaning │
├─────────────┼────────────────────────────────────────────────────────┤
│ M (Modified)│ Only this cache has it, it's been written │
│ │ Must write back before another CPU can read │
│ │ │
│ E (Exclusive)│ Only this cache has it, matches memory │
│ │ Can write without bus traffic │
│ │ │
│ S (Shared) │ Multiple caches have it, matches memory │
│ │ Must invalidate others before writing │
│ │ │
│ I (Invalid) │ Not in cache, or stale copy │
│ │ Must fetch from memory or another cache │
└─────────────┴────────────────────────────────────────────────────────┘
Cache Line Ping-Pong (The Performance Killer):
Core 0 Cache Memory Core 1 Cache
┌────────────┐ ┌──────┐ ┌────────────┐
│ Line X: M │──────▶│ │◀─────────│ Line X: I │
└────────────┘ 1. └──────┘ 2. └────────────┘
▲ │
Core 0 writes, │ │ Core 1 wants to write
Line goes to M │ │ Line X invalidated in Core 0
│ │ Core 0 writes back to memory
│ ▼ Core 1 fetches, goes to M
┌────────────┐ │ ┌────────────┐
│ Line X: I │◀──────┴────│ Line X: M │
└────────────┘ └────────────┘
3. 4.
This "ping-pong" costs 100+ cycles each time!
Happens when two threads write to same cache line (FALSE SHARING)
Cache Lines and False Sharing
A cache line is the smallest unit of data transferred between memory and cache, typically 64 bytes on modern x86.
False sharing occurs when two threads write to different variables that happen to share a cache line:
FALSE SHARING DISASTER:
struct BadQueue {
head: u64, // 8 bytes ─┐
tail: u64, // 8 bytes ├── Same 64-byte cache line!
// ... rest // ─┘
};
Producer writes tail ──┐
├──▶ Cache line bounces between cores
Consumer writes head ──┘ Each write invalidates the other core's cache
Performance: 10x-100x slower!
SOLUTION: Padding to separate cache lines
struct GoodQueue {
head: u64, // 8 bytes
_pad1: [u8; 56], // Pad to 64 bytes
// --- cache line boundary ---
tail: u64, // 8 bytes
_pad2: [u8; 56], // Pad to 64 bytes
// --- cache line boundary ---
buffer: [T; N], // Separate cache lines
};
Now: head on one cache line, tail on another
Producer and consumer operate independently
No false sharing!
2.2 Why This Matters for HFT
In HFT systems, mutex contention is a performance killer:
Mutex Contention Impact:
Uncontended Mutex Contended Mutex
───────────────── ────────────────
Time: ~25-50 nanoseconds ~1-10 microseconds
(just lock/unlock) (waiting for other thread)
At 1M messages/second:
• 1 us contention = 1 second wasted per second = IMPOSSIBLE
• You CANNOT process 1M/sec with contended mutexes
Lock-free Queue:
• ~10-50 nanoseconds per operation
• NO WAITING, NO BLOCKING
• 10M+ messages/second achievable
Real-world example:
Market Data Feed Handler Architecture:
Network Thread Strategy Threads
(Producer) (Consumers)
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Parse feed │ │ Strategy 1 │
│ 500K msgs/s │ │ │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌───────────────────────────────────────┐
│ LOCK-FREE QUEUE │
│ │
│ Producer pushes ───▶ Consumer pops │
│ No locks, no blocking, no waiting │
└───────────────────────────────────────┘
│
▼
┌─────────────┐
│ Strategy 2 │
└─────────────┘
If we used mutexes:
• Network thread blocks waiting for strategies
• Strategies block waiting for each other
• Latency spikes of 10-100 microseconds
• Missed market opportunities
With lock-free:
• Network thread NEVER waits
• Each strategy gets data independently
• Consistent ~50 nanosecond latency
• Zero contention
2.3 Historical Context
Evolution of Lock-Free Programming:
Timeline:
1991: Maurice Herlihy proves "wait-free" universality
Any data structure can be made wait-free
(Theoretical breakthrough, impractical at the time)
1996: Maged Michael and Michael Scott publish MS-Queue
First practical lock-free FIFO queue
Still a reference implementation today
2004: C++ introduces std::atomic (TR1, then C++11)
First standardized memory model for systems languages
2007: LMAX builds the Disruptor
Lock-free ring buffer for financial trading
Processes 6 million orders/second on a single thread
Becomes the gold standard for HFT
2011: C++11 formalizes memory ordering
acquire, release, seq_cst become standard vocabulary
2015: Rust 1.0 releases with safe atomics
Compiler-enforced memory safety + zero-cost atomics
Makes lock-free programming more accessible
2018: "Rust Atomics and Locks" by Mara Bos
Becomes the definitive guide to lock-free in Rust
Present: Lock-free is standard practice in HFT
Every major trading firm uses lock-free queues
Sub-microsecond latency is the baseline expectation
The LMAX Disruptor Influence:
The Disruptor Pattern (2007):
Traditional Queue: Disruptor Ring Buffer:
┌────────┐ ┌────────┐ ┌────────────────────────┐
│Producer│ │Consumer│ │ Sequence Numbers │
└───┬────┘ └────┬───┘ │ ┌──┬──┬──┬──┬──┐ │
│ │ │ ──│12│13│14│15│16│── │
▼ ▼ │ └──┴──┴──┴──┴──┘ │
┌──────────────────────┐ │ ▲ ▲ │
│ Linked List + Mutex │ │ Consumer Producer │
│ Allocation per push │ │ Cursor Cursor │
│ Cache-hostile │ │ │
└──────────────────────┘ └────────────────────────┘
Disruptor innovations:
1. Pre-allocated ring buffer (no allocation)
2. Sequence numbers instead of pointers
3. Cache line padding eliminates false sharing
4. Mechanical sympathy with CPU architecture
Result: 6 million events/second
100x faster than traditional queues
2.4 Common Misconceptions
Misconception 1: “volatile is the same as atomic”
// WRONG - volatile does NOT make this safe
volatile bool ready = false;
volatile int data = 0;
// Thread A
data = 42;
ready = true; // May be reordered before data=42!
// Thread B
while (!ready);
print(data); // May see 0!
// CORRECT - use atomics with proper ordering
std::atomic<bool> ready{false};
int data = 0;
// Thread A
data = 42;
ready.store(true, std::memory_order_release);
// Thread B
while (!ready.load(std::memory_order_acquire));
print(data); // Guaranteed to see 42!
volatile only prevents compiler optimizations. It does NOT:
- Prevent CPU reordering
- Provide atomicity for operations wider than a byte
- Create happens-before relationships
Misconception 2: “SeqCst is always correct, so always use it”
SeqCst is safe but expensive:
SeqCst generates:
- Memory fences on x86 (MFENCE or locked instructions)
- Full barriers on ARM
- ~10-50 cycles overhead
Acquire/Release on x86:
- Release is FREE (x86 stores have release semantics)
- Acquire is FREE (x86 loads have acquire semantics)
- Just the atomic operation itself
For SPSC queue:
SeqCst: ~50-100ns per operation
Acq/Rel: ~10-30ns per operation
Know your memory model!
On x86, acquire/release is essentially free.
On ARM, there's a small cost, but still less than SeqCst.
Misconception 3: “Lock-free means faster”
Not always! Lock-free adds complexity:
When mutex wins:
- Low contention (< 2 threads competing)
- Complex operations (many memory accesses)
- Infrequent operations
When lock-free wins:
- High contention (many threads)
- Simple operations (single CAS)
- High frequency (millions per second)
The HFT sweet spot:
SPSC (single producer, single consumer) is the simplest lock-free pattern
Works perfectly for market data distribution
2-4 threads is typical, not 100
Don't use MPMC (multi-producer, multi-consumer) unless you must!
SPSC is 5-10x faster than MPMC.
Misconception 4: “I can debug lock-free code by adding prints”
Printf debugging HIDES concurrency bugs:
The Heisenbug Problem:
- Adding printf changes timing
- Bugs disappear when you look for them
- Bugs reappear in production
Example:
// Bug exists:
data = 42;
ready.store(true, Ordering::Relaxed); // WRONG ORDERING!
// Add debugging:
data = 42;
println!("set data"); // This adds a memory barrier!
ready.store(true, Ordering::Relaxed);
// Bug is now hidden by the barrier inside println!
Correct approach:
1. Use thread sanitizer (ThreadSanitizer, TSan)
2. Use Loom (Rust model checker)
3. Think through orderings on paper FIRST
4. Add asserts that don't change timing
Misconception 5: “The ABA problem only matters for linked structures”
ABA Problem:
Thread 1: Read pointer P pointing to A
(preempted)
Thread 2: Free A, allocate B at same address, change P to point to C
Thread 1: (resumes) CAS(P, A, new_value)
Succeeds because address matches!
But A is gone, this is now a dangling pointer!
For ring buffers (our project):
- No ABA problem! Indices are monotonically increasing
- Index 42 will never wrap to mean the same thing twice
- (Unless you process 2^64 messages, which is... unlikely)
ABA matters for:
- Lock-free linked lists (pointer reuse)
- Lock-free stacks (push/pop same node)
- Memory reclamation (freed memory)
Solutions when ABA matters:
- Tagged pointers (include counter with pointer)
- Hazard pointers
- Epoch-based reclamation
3. Project Specification
3.1 What You Will Build
A lock-free market data distribution system that:
- Receives market data from a simulated feed (or real data if available)
- Parses binary market data into structured messages
- Distributes data through an SPSC lock-free queue to consumers
- Measures and reports latency with accurate percentile statistics
- Handles 1M+ messages per second on commodity hardware
- Demonstrates the performance difference between lock-free and mutex-based approaches
3.2 Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR1 | Implement SPSC lock-free queue with atomic operations only | Must Have |
| FR2 | Producer thread pushes market data messages to queue | Must Have |
| FR3 | Consumer thread pops messages and processes them | Must Have |
| FR4 | Queue uses power-of-2 size for efficient modulo | Must Have |
| FR5 | No memory allocation in push/pop hot path | Must Have |
| FR6 | Parse binary market data messages (quote updates) | Must Have |
| FR7 | Report throughput in messages per second | Must Have |
| FR8 | Report latency histogram (p50, p99, p999) | Must Have |
| FR9 | Compare lock-free vs mutex-based queue performance | Must Have |
| FR10 | Handle queue full gracefully (return error, don’t block) | Must Have |
| FR11 | Handle queue empty gracefully (return None, don’t block) | Must Have |
| FR12 | Cache line padding to prevent false sharing | Should Have |
| FR13 | Support multiple consumer threads (broadcast pattern) | Should Have |
| FR14 | Configurable queue size | Should Have |
| FR15 | Graceful shutdown with producer signaling completion | Should Have |
3.3 Non-Functional Requirements
| Requirement | Target | Measurement |
|---|---|---|
| Push Latency | < 50ns (p50), < 200ns (p99) | rdtsc or high-resolution clock |
| Pop Latency | < 50ns (p50), < 200ns (p99) | Same measurement |
| Throughput | > 10M messages/second (single producer) | Sustained 10-second benchmark |
| Memory | < 1KB overhead beyond buffer | Count struct sizes |
| False Sharing | Zero observed | perf stat cache miss comparison |
| Lock-Free | True wait-freedom for producer | Code review (no spin-wait on consumer state) |
3.4 Example Usage / Output
$ ./market_data_handler --benchmark
=== LOCK-FREE MARKET DATA HANDLER ===
Configuration:
Queue size: 65536 slots (64KB buffer)
Message size: 64 bytes
Benchmark duration: 10 seconds
Starting producer thread...
Starting consumer thread...
Live Statistics (updated every second):
[1s] Throughput: 12.4M msg/sec | p50: 32ns | p99: 89ns | p999: 234ns
[2s] Throughput: 12.6M msg/sec | p50: 31ns | p99: 87ns | p999: 221ns
[3s] Throughput: 12.5M msg/sec | p50: 31ns | p99: 88ns | p999: 228ns
...
[10s] Throughput: 12.5M msg/sec | p50: 31ns | p99: 88ns | p999: 226ns
=== BENCHMARK COMPLETE ===
Total messages: 125,432,891
Average throughput: 12.54M msg/sec
Latency percentiles:
p50: 31 ns
p90: 45 ns
p99: 88 ns
p999: 226 ns
max: 1,234 ns
=== MUTEX BASELINE COMPARISON ===
Running same workload with mutex-based queue...
Mutex-based results:
Throughput: 1.2M msg/sec (10x slower!)
Latency p50: 234ns | p99: 12,456ns | p999: 89,234ns
=== ANALYSIS ===
Lock-free speedup: 10.4x throughput
Latency reduction: 7.5x at p50, 141x at p99
=== SAMPLE MARKET DATA ===
Displaying last 5 messages:
AAPL BID 150.25 x 500 | ASK 150.26 x 300 | ts=14:32:01.234567890
MSFT BID 380.10 x 200 | ASK 380.12 x 450 | ts=14:32:01.234567891
GOOGL BID 140.50 x 1000 | ASK 140.52 x 800 | ts=14:32:01.234567892
AMZN BID 178.30 x 300 | ASK 178.32 x 250 | ts=14:32:01.234567893
NVDA BID 480.00 x 150 | ASK 480.05 x 200 | ts=14:32:01.234567894
3.5 Real World Outcome
When complete, you will have:
-
A working lock-free SPSC queue that you understand at the atomic level
- Benchmark results proving the performance difference:
=== LOCK-FREE vs MUTEX COMPARISON === Metric | Lock-Free | Mutex-Based | Improvement ----------------+---------------+---------------+------------ Throughput | 12.5M msg/sec | 1.2M msg/sec | 10.4x p50 latency | 31ns | 234ns | 7.5x p99 latency | 88ns | 12,456ns | 141x p999 latency | 226ns | 89,234ns | 395x - Deep understanding of:
- Why memory ordering matters
- How false sharing kills performance
- What happens at the cache line level
- When to use Acquire/Release vs SeqCst
- Foundation for Project 3 (Matching Engine) which uses lock-free queues between components
4. Solution Architecture
4.1 High-Level Design
LOCK-FREE MARKET DATA HANDLER ARCHITECTURE
┌──────────────────────────────────────────────────────────────────────────────┐
│ SYSTEM OVERVIEW │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ MARKET DATA SOURCE │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Network Feed │ OR │ Simulated Feed │ │ │
│ │ │ (UDP Multicast) │ │ (Benchmark) │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │
│ │ └──────────┬───────────┘ │ │
│ │ ▼ │ │
│ └───────────────────────┬─────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ PRODUCER THREAD │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ Receive Raw │───▶│ Parse Binary │───▶│ Push to Queue │ │ │
│ │ │ Bytes │ │ Protocol │ │ (Lock-Free) │ │ │
│ │ └───────────────┘ └───────────────┘ └───────┬───────┘ │ │
│ │ │ │ │
│ └──────────────────────────────────────────────────────┼───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ SPSC LOCK-FREE QUEUE │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ RING BUFFER │ │ │
│ │ │ │ │ │
│ │ │ ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ │ │
│ │ │ │ M0 │ M1 │ M2 │ M3 │ M4 │ M5 │ M6 │ M7 │ (Pre-alloc) │ │ │
│ │ │ └────┴────┴────┴────┴────┴────┴────┴────┘ │ │ │
│ │ │ ▲ ▲ │ │ │
│ │ │ │ │ │ │ │
│ │ │ HEAD TAIL │ │ │
│ │ │ (Consumer) (Producer) │ │ │
│ │ │ │ │ │
│ │ │ head: AtomicU64 ─────────── Cache Line 0 ────────────── │ │ │
│ │ │ [padding 56 bytes] │ │ │
│ │ │ tail: AtomicU64 ─────────── Cache Line 1 ────────────── │ │ │
│ │ │ [padding 56 bytes] │ │ │
│ │ │ buffer ───────────────────── Cache Lines 2..N ────────── │ │ │
│ │ │ │ │ │
│ │ └────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CONSUMER THREAD(S) │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ Pop from │───▶│ Process │───▶│ Update │ │ │
│ │ │ Queue │ │ Message │ │ Statistics │ │ │
│ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ STATISTICS COLLECTOR │ │
│ │ │ │
│ │ • Latency Histogram (p50, p99, p999, max) │ │
│ │ • Throughput Counter (messages per second) │ │
│ │ • Cache Miss Counters (via perf) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
4.2 Key Components
| Component | Responsibility | Performance Requirement |
|---|---|---|
| SPSC Queue | Lock-free message passing between threads | < 50ns push/pop |
| Ring Buffer | Pre-allocated storage, no runtime allocation | Zero allocation |
| Producer | Parse feed data, push to queue | Non-blocking push |
| Consumer | Pop messages, process, measure latency | Non-blocking pop |
| Statistics | Accurate latency measurement | Sub-nanosecond precision |
| Binary Parser | Zero-copy deserialization | < 10ns per message |
4.3 Data Structures
SPSC Queue Structure (with Cache Padding)
SPSC Queue Memory Layout:
Offset 0-63: Cache Line 0 (Producer writes, Consumer reads)
┌────────────────────────────────────────────────────────────────┐
│ tail: AtomicU64 (8 bytes) │
│ _pad_tail: [u8; 56] (padding to fill cache line) │
└────────────────────────────────────────────────────────────────┘
Offset 64-127: Cache Line 1 (Consumer writes, Producer reads)
┌────────────────────────────────────────────────────────────────┐
│ head: AtomicU64 (8 bytes) │
│ _pad_head: [u8; 56] (padding to fill cache line) │
└────────────────────────────────────────────────────────────────┘
Offset 128+: Cache Lines 2..N (Data, written by producer, read by consumer)
┌────────────────────────────────────────────────────────────────┐
│ buffer: [Slot<T>; CAPACITY] │
│ │
│ Each Slot: │
│ ┌────────────────────────────────────────────────────┐ │
│ │ data: MaybeUninit<T> │ │
│ │ (Slot size should be power of 2 or cache-aligned) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Capacity: CAPACITY (power of 2, e.g., 65536)
Mask: CAPACITY - 1 (for fast modulo: index & mask)
Rust Implementation Sketch:
use std::sync::atomic::{AtomicU64, Ordering};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
const CACHE_LINE_SIZE: usize = 64;
#[repr(C)]
pub struct SpscQueue<T, const N: usize> {
// Cache line 0: Producer writes tail
tail: AtomicU64,
_pad_tail: [u8; CACHE_LINE_SIZE - 8],
// Cache line 1: Consumer writes head
head: AtomicU64,
_pad_head: [u8; CACHE_LINE_SIZE - 8],
// Remaining cache lines: Buffer
buffer: [UnsafeCell<MaybeUninit<T>>; N],
}
// Safety: Only producer accesses tail and buffer writes
// Only consumer accesses head and buffer reads
unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}
C++ Implementation Sketch:
#include <atomic>
#include <cstdint>
#include <optional>
constexpr size_t CACHE_LINE_SIZE = 64;
template<typename T, size_t N>
class SpscQueue {
static_assert((N & (N - 1)) == 0, "N must be power of 2");
struct alignas(CACHE_LINE_SIZE) PaddedAtomic {
std::atomic<uint64_t> value{0};
char padding[CACHE_LINE_SIZE - sizeof(std::atomic<uint64_t>)];
};
PaddedAtomic tail_; // Producer writes
PaddedAtomic head_; // Consumer writes
alignas(CACHE_LINE_SIZE) T buffer_[N];
public:
static constexpr size_t MASK = N - 1;
bool try_push(const T& item);
std::optional<T> try_pop();
};
Market Data Message Structure
Market Data Message (64 bytes, cache-line aligned):
┌────────────────────────────────────────────────────────────────┐
│ Offset │ Field │ Type │ Size │ Description │
├────────┼────────────────┼───────────┼──────┼───────────────────┤
│ 0 │ timestamp │ u64 │ 8 │ Nanoseconds epoch │
│ 8 │ symbol_id │ u32 │ 4 │ Symbol identifier │
│ 12 │ msg_type │ u8 │ 1 │ Quote/Trade/etc │
│ 13 │ flags │ u8 │ 1 │ Status flags │
│ 14 │ _reserved │ u16 │ 2 │ Alignment │
│ 16 │ bid_price │ i64 │ 8 │ Price in ticks │
│ 24 │ ask_price │ i64 │ 8 │ Price in ticks │
│ 32 │ bid_size │ u32 │ 4 │ Bid quantity │
│ 36 │ ask_size │ u32 │ 4 │ Ask quantity │
│ 40 │ sequence │ u64 │ 8 │ Sequence number │
│ 48 │ _padding │ [u8; 16] │ 16 │ Pad to 64 bytes │
└────────────────────────────────────────────────────────────────┘
Total: 64 bytes (exactly 1 cache line)
Rust:
#[repr(C, align(64))]
#[derive(Clone, Copy)]
pub struct MarketDataMessage {
pub timestamp: u64,
pub symbol_id: u32,
pub msg_type: u8,
pub flags: u8,
_reserved: u16,
pub bid_price: i64,
pub ask_price: i64,
pub bid_size: u32,
pub ask_size: u32,
pub sequence: u64,
_padding: [u8; 16],
}
const _: () = assert!(std::mem::size_of::<MarketDataMessage>() == 64);
const _: () = assert!(std::mem::align_of::<MarketDataMessage>() == 64);
4.4 Algorithm Overview
Lock-Free SPSC Push (Producer)
PUSH ALGORITHM (Lock-Free):
function try_push(item):
// Step 1: Read current tail (producer owns this, can use Relaxed)
current_tail = tail.load(Relaxed)
// Step 2: Calculate next position
next_tail = current_tail + 1
// Step 3: Check if queue is full
// Load head with Acquire to synchronize with consumer's Release
current_head = head.load(Acquire)
if next_tail - current_head > CAPACITY:
return QUEUE_FULL // Don't block, return error
// Step 4: Write data to buffer
// Safe because we know this slot is available
index = current_tail & MASK
buffer[index] = item
// Step 5: Publish tail with Release
// This "releases" all the writes we just did
// Consumer will "acquire" this and see the data
tail.store(next_tail, Release)
return SUCCESS
Memory Ordering Diagram for Push:
Producer Thread Consumer Thread
─────────────── ───────────────
buffer[idx] = item;
│
│ Release barrier
▼
tail.store(next, Release) ─────synchronizes────▶ tail.load(Acquire)
│
│ Acquire barrier
▼
read buffer[idx]
All writes before Release are visible after Acquire!
Lock-Free SPSC Pop (Consumer)
POP ALGORITHM (Lock-Free):
function try_pop():
// Step 1: Read current head (consumer owns this, can use Relaxed)
current_head = head.load(Relaxed)
// Step 2: Check if queue is empty
// Load tail with Acquire to synchronize with producer's Release
current_tail = tail.load(Acquire)
if current_head >= current_tail:
return QUEUE_EMPTY // Don't block, return None
// Step 3: Read data from buffer
// Safe because producer has published this slot
index = current_head & MASK
item = buffer[index]
// Step 4: Advance head with Release
// This "releases" the slot for the producer to reuse
head.store(current_head + 1, Release)
return item
Full Synchronization Flow:
SPSC Queue Synchronization Timeline:
Producer Queue Consumer
──────── ───── ────────
T=0: Write data to slot 5
│
▼
buffer[5] = msg
│
T=1: Release tail
│
▼
tail.store(6, Release) ───────────────────▶
T=2: Acquire tail
│
▼
if (head < tail)
│
T=3: Read data
│
▼
msg = buffer[5]
│
T=4: Release head
│
▼
◀─────────────────────────────────── head.store(6, Release)
│
T=5: Acquire head
│
▼
slot 5 now available for reuse!
5. Implementation Guide
5.1 Development Environment Setup
Rust Setup:
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Create project
cargo new lock_free_handler
cd lock_free_handler
# Add dependencies in Cargo.toml
# [dependencies]
# # (none needed for core implementation)
#
# [dev-dependencies]
# criterion = { version = "0.5", features = ["html_reports"] }
# Verify atomic support
rustc --print target-features | grep atomic
# Install for profiling
cargo install flamegraph
C++ Setup:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install g++ cmake linux-tools-generic
# Verify C++17 support (for std::optional, if constexpr)
g++ --version # Need GCC 7+ or Clang 5+
# Create project structure
mkdir -p lock_free_handler/{src,include,tests,benchmarks}
cd lock_free_handler
Profiling Tools:
# Linux: perf for cache analysis
sudo apt-get install linux-tools-common linux-tools-generic
# Check perf access
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
# ThreadSanitizer (built into GCC/Clang)
# Build with: -fsanitize=thread
# Rust: Miri for undefined behavior
rustup +nightly component add miri
cargo +nightly miri test
5.2 Project Structure
Rust:
lock_free_handler/
├── Cargo.toml
├── src/
│ ├── main.rs # CLI and benchmark entry
│ ├── lib.rs # Library root
│ ├── queue/
│ │ ├── mod.rs # Queue module
│ │ ├── spsc.rs # SPSC lock-free queue
│ │ └── mutex_queue.rs # Mutex baseline for comparison
│ ├── market_data/
│ │ ├── mod.rs # Market data module
│ │ ├── message.rs # Message struct
│ │ └── parser.rs # Binary protocol parser
│ └── stats/
│ ├── mod.rs # Statistics module
│ └── histogram.rs # Latency histogram
├── benches/
│ └── queue_bench.rs # Criterion benchmarks
├── tests/
│ ├── correctness.rs # Functional tests
│ └── concurrent.rs # Multi-threaded tests
└── README.md
C++:
lock_free_handler/
├── CMakeLists.txt
├── include/
│ ├── spsc_queue.hpp # SPSC queue header
│ ├── mutex_queue.hpp # Mutex baseline
│ ├── market_data.hpp # Message definitions
│ └── histogram.hpp # Latency statistics
├── src/
│ ├── main.cpp # Entry point
│ └── benchmark.cpp # Benchmark implementation
├── tests/
│ ├── test_queue.cpp # Queue tests
│ └── test_concurrent.cpp # Threading tests
└── README.md
5.3 The Core Question You’re Answering
“How do you pass data between threads without locks while maintaining correctness and achieving maximum throughput?”
This question requires understanding:
- Atomicity: What operations are atomic at the hardware level?
- Ordering: How do you ensure writes in one thread are visible in another?
- Progress: How do you guarantee the system makes forward progress?
- Performance: How do you avoid hidden costs like false sharing?
The answer is not just “use atomics” but understanding when different orderings are safe and why cache line padding matters.
5.4 Concepts You Must Understand First
| Concept | Self-Assessment Question | Where to Learn |
|---|---|---|
| Atomics | What makes an atomic operation atomic? (Hardware support, not locks) | Rust Atomics Ch. 1-2 |
| Memory Ordering | When is Relaxed safe? When is Acquire/Release needed? | Rust Atomics Ch. 3 |
| Cache Lines | What size is a cache line? What is false sharing? | CS:APP Ch. 6 |
| Ring Buffer | Why must the size be a power of 2? (Fast modulo with & mask) | Any data structures course |
| Producer-Consumer | What is the happens-before relationship? | Rust Atomics Ch. 3 |
| Spin-waiting | What is busy-waiting? When is it acceptable? | OS textbook |
5.5 Questions to Guide Your Design
Atomics and Ordering:
- Why can’t you use
Ordering::Relaxedeverywhere? - What is the difference between
Acquireon load vsReleaseon store? - Why does the producer load
headwithAcquire? - Why does the consumer load
tailwithAcquire? - What would break if you used
Relaxedfor tail store?
Queue Design:
- Why must capacity be a power of 2?
- How do you detect when the queue is full?
- How do you detect when the queue is empty?
- What happens if producer and consumer access the same slot?
- Why don’t you need a lock even though two threads access the buffer?
Performance:
- What is false sharing and how do you prevent it?
- Why put padding between
headandtail? - What cache misses occur during normal operation?
- How does the ring buffer improve cache efficiency vs linked list?
- What is the expected number of atomic operations per push/pop?
Edge Cases:
- What happens when the queue wraps around after 2^64 messages?
- How do you handle shutdown gracefully?
- What if the consumer is slower than the producer?
- What if the producer is slower than the consumer?
5.6 Thinking Exercise
Before writing code, trace through this scenario on paper:
Initial State:
head = 0
tail = 0
buffer = [_, _, _, _, _, _, _, _] (8 slots, empty)
Operations:
1. Producer: push(A)
2. Producer: push(B)
3. Producer: push(C)
4. Consumer: pop() // What is returned?
5. Consumer: pop() // What is returned?
6. Producer: push(D)
7. Producer: push(E)
8. Producer: push(F)
9. Producer: push(G)
10. Producer: push(H)
11. Producer: push(I) // What happens?
12. Consumer: pop()
13. Producer: push(I) // Now what happens?
Questions to answer:
- After step 3, what are
headandtail? What is in the buffer? - After step 5, what are
headandtail? - After step 10, how many items are in the queue? What is returned by step 11?
- Draw the buffer state at each step using indices.
- At which points does the producer need to check
head? - At which points does the consumer need to check
tail?
Memory Ordering Exercise:
Consider this buggy implementation:
fn try_push(&self, item: T) -> bool {
let tail = self.tail.load(Ordering::Relaxed);
let head = self.head.load(Ordering::Relaxed); // BUG!
if tail - head >= N {
return false;
}
self.buffer[tail & MASK] = item;
self.tail.store(tail + 1, Ordering::Relaxed); // BUG!
true
}
- What can go wrong with this code?
- Which orderings need to be changed?
- Draw a timeline showing a specific interleaving that causes a bug.
5.7 Hints in Layers
Hint 1 - Starting Direction
Start with a working but slow version using strong ordering everywhere:
Rust:
use std::sync::atomic::{AtomicUsize, Ordering};
use std::cell::UnsafeCell;
pub struct SpscQueue<T, const N: usize> {
head: AtomicUsize,
tail: AtomicUsize,
buffer: [UnsafeCell<Option<T>>; N],
}
impl<T, const N: usize> SpscQueue<T, N> {
pub fn try_push(&self, item: T) -> bool {
let tail = self.tail.load(Ordering::SeqCst);
let head = self.head.load(Ordering::SeqCst);
if tail.wrapping_sub(head) >= N {
return false; // Full
}
unsafe {
*self.buffer[tail % N].get() = Some(item);
}
self.tail.store(tail.wrapping_add(1), Ordering::SeqCst);
true
}
pub fn try_pop(&self) -> Option<T> {
let head = self.head.load(Ordering::SeqCst);
let tail = self.tail.load(Ordering::SeqCst);
if head >= tail {
return None; // Empty
}
let item = unsafe {
(*self.buffer[head % N].get()).take()
};
self.head.store(head.wrapping_add(1), Ordering::SeqCst);
item
}
}
Get this working first, then optimize orderings.
Hint 2 - Correct Orderings
Now switch to efficient orderings:
pub fn try_push(&self, item: T) -> bool {
// We own tail, can use Relaxed to read our own state
let tail = self.tail.load(Ordering::Relaxed);
// Need Acquire to synchronize with consumer's Release of head
// This ensures we see the consumer's updates to slots it has freed
let head = self.head.load(Ordering::Acquire);
if tail.wrapping_sub(head) >= N {
return false;
}
// Write data (no ordering needed, protected by tail publish)
unsafe {
*self.buffer[tail % N].get() = Some(item);
}
// Release: publish our write to the consumer
// Consumer's Acquire will synchronize with this
self.tail.store(tail.wrapping_add(1), Ordering::Release);
true
}
pub fn try_pop(&self) -> Option<T> {
// We own head, can use Relaxed
let head = self.head.load(Ordering::Relaxed);
// Need Acquire to synchronize with producer's Release of tail
// This ensures we see the data the producer wrote
let tail = self.tail.load(Ordering::Acquire);
if head >= tail {
return None;
}
// Read data (protected by tail acquire)
let item = unsafe {
(*self.buffer[head % N].get()).take()
};
// Release: publish that we've freed this slot
// Producer's Acquire of head will synchronize with this
self.head.store(head.wrapping_add(1), Ordering::Release);
item
}
Key insight: The only cross-thread communication is through head and tail.
Data access is protected by proper ordering of these indices.
Hint 3 - Cache Line Padding
Add padding to prevent false sharing:
use std::sync::atomic::{AtomicU64, Ordering};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
const CACHE_LINE: usize = 64;
#[repr(C)] // Ensure field order is preserved
pub struct SpscQueue<T, const N: usize> {
// Producer-owned fields (cache line 0)
tail: AtomicU64,
_pad_tail: [u8; CACHE_LINE - 8],
// Consumer-owned fields (cache line 1)
head: AtomicU64,
_pad_head: [u8; CACHE_LINE - 8],
// Shared buffer (separate cache lines)
buffer: Box<[UnsafeCell<MaybeUninit<T>>; N]>,
}
impl<T, const N: usize> SpscQueue<T, N> {
const MASK: usize = N - 1;
pub fn new() -> Self {
assert!(N.is_power_of_two(), "Capacity must be power of 2");
// Use Box to allocate on heap with proper alignment
let buffer: Box<[UnsafeCell<MaybeUninit<T>>; N]> = {
// Safety: We initialize all slots before use
unsafe {
let layout = std::alloc::Layout::new::<[UnsafeCell<MaybeUninit<T>>; N]>();
let ptr = std::alloc::alloc(layout) as *mut [UnsafeCell<MaybeUninit<T>>; N];
Box::from_raw(ptr)
}
};
Self {
tail: AtomicU64::new(0),
_pad_tail: [0; CACHE_LINE - 8],
head: AtomicU64::new(0),
_pad_head: [0; CACHE_LINE - 8],
buffer,
}
}
#[inline]
fn index(&self, seq: u64) -> usize {
(seq as usize) & Self::MASK
}
}
The padding ensures:
tailis on its own cache line (producer writes)headis on its own cache line (consumer writes)- No false sharing between threads!
Hint 4 - Measurement and Debugging
Accurate Latency Measurement (Rust):
use std::time::Instant;
pub struct LatencyHistogram {
buckets: [u64; 1000], // 1ns granularity up to 1000ns
overflow: u64,
count: u64,
}
impl LatencyHistogram {
pub fn record(&mut self, nanos: u64) {
self.count += 1;
if nanos < 1000 {
self.buckets[nanos as usize] += 1;
} else {
self.overflow += 1;
}
}
pub fn percentile(&self, p: f64) -> u64 {
let target = (self.count as f64 * p) as u64;
let mut cumulative = 0u64;
for (ns, &count) in self.buckets.iter().enumerate() {
cumulative += count;
if cumulative >= target {
return ns as u64;
}
}
1000 // Overflow bucket
}
}
// Usage in benchmark:
let start = Instant::now();
queue.try_push(item);
let elapsed = start.elapsed().as_nanos() as u64;
histogram.record(elapsed);
Using rdtsc for sub-nanosecond precision (x86):
#[cfg(target_arch = "x86_64")]
fn rdtsc() -> u64 {
unsafe { core::arch::x86_64::_rdtsc() }
}
// Calibrate once at startup
fn calibrate_tsc() -> f64 {
let start_tsc = rdtsc();
let start_time = Instant::now();
std::thread::sleep(std::time::Duration::from_millis(100));
let end_tsc = rdtsc();
let end_time = Instant::now();
let cycles = end_tsc - start_tsc;
let nanos = end_time.duration_since(start_time).as_nanos();
cycles as f64 / nanos as f64 // cycles per nanosecond
}
Debugging with ThreadSanitizer:
# Rust
RUSTFLAGS="-Z sanitizer=thread" cargo +nightly test
# C++
g++ -fsanitize=thread -g -O1 test.cpp -o test
./test
Cache Analysis with perf:
# Measure cache misses
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./benchmark
# Compare with and without padding:
# Without padding: ~50% cache miss rate
# With padding: ~1% cache miss rate
5.8 The Interview Questions They’ll Ask
After completing this project, you should be able to answer:
- “Explain memory ordering and when you would use Acquire vs Release.”
- Expected: Clear explanation of happens-before, release publishes writes, acquire sees published writes
- Bonus: Explain why x86 gets acquire/release “for free” on loads/stores
- “What is false sharing and how do you prevent it?”
- Expected: Two threads writing to same cache line causes ping-pong
- Bonus: Mention MESI protocol, cache line size (64 bytes), show padding solution
- “Implement a lock-free SPSC queue.”
- Expected: Ring buffer, atomics for head/tail, proper ordering
- Bonus: Discuss wait-freedom, capacity management, wraparound handling
- “What’s the difference between lock-free and wait-free?”
- Expected: Lock-free = system makes progress; wait-free = every thread makes progress
- Bonus: SPSC queue is wait-free for both producer and consumer (no spin-waiting)
- “How do you benchmark concurrent code correctly?”
- Expected: Mention warmup, percentiles not averages, avoid coordination overhead
- Bonus: Discuss TSC vs system clock, measurement overhead, cache warming
- “What is the ABA problem and does it affect your queue?”
- Expected: Explain ABA with CAS, explain why SPSC with indices doesn’t have it
- Bonus: Discuss hazard pointers or epoch-based reclamation for when it matters
- “Why is SeqCst slower than Acquire/Release?”
- Expected: SeqCst requires memory fences, Acquire/Release is free on x86
- Bonus: Explain x86 TSO model, why ARM is different
- “How would you extend this to multiple consumers?”
- Expected: Discuss SPMC broadcast vs work-stealing, mention Disruptor pattern
- Bonus: Explain why MPMC is much harder and when to avoid it
- “What happens if the producer is faster than the consumer?”
- Expected: Queue fills up, producer gets QUEUE_FULL, must handle backpressure
- Bonus: Discuss flow control strategies, adaptive batching
- “How do you test lock-free code for correctness?”
- Expected: ThreadSanitizer, stress tests, Loom (Rust)
- Bonus: Discuss model checking, formal verification, randomized scheduling
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Atomics and memory ordering | Rust Atomics and Locks (Mara Bos) | Ch. 1-3 |
| Building lock-free queues | Rust Atomics and Locks | Ch. 4-5 |
| Memory model deep dive | Rust Atomics and Locks | Ch. 6 |
| CPU cache hierarchy | Computer Systems: A Programmer’s Perspective | Ch. 6 |
| C++ atomics | C++ Concurrency in Action (Anthony Williams) | Ch. 5 |
| Lock-free patterns | The Art of Multiprocessor Programming | Ch. 10 |
| LMAX Disruptor | Building Low Latency Applications with C++ | Ch. 7 |
| Performance measurement | Systems Performance (Brendan Gregg) | Ch. 6 |
5.10 Implementation Phases
Phase 1: Mutex Baseline (Days 1-2)
Goals:
- Implement a simple mutex-based queue for baseline comparison
- Create benchmark harness for latency and throughput
- Establish measurement methodology
Tasks:
- Create project structure
- Implement mutex-based queue:
pub struct MutexQueue<T> { data: Mutex<VecDeque<T>>, } - Create simple producer/consumer test
- Implement latency histogram
- Run baseline benchmark, record numbers
Checkpoint: Baseline numbers documented (expect ~200-500ns latency)
Phase 2: Basic Lock-Free SPSC (Days 3-6)
Goals:
- Implement working SPSC queue with SeqCst ordering
- Verify correctness with tests
- Compare performance to baseline
Tasks:
- Implement SPSC queue structure
- Implement
try_pushwith SeqCst - Implement
try_popwith SeqCst - Write unit tests for edge cases
- Run concurrent stress test
- Run ThreadSanitizer
- Benchmark and compare to baseline
Checkpoint: 3-5x improvement over mutex, all tests pass, no TSan errors
Phase 3: Optimize Memory Ordering (Days 7-9)
Goals:
- Switch to Acquire/Release ordering
- Measure improvement
- Understand why it’s correct
Tasks:
- Change orderings to Acquire/Release
- Document why each ordering is correct (comments!)
- Re-run ThreadSanitizer to verify
- Benchmark improvement (expect 2-3x over SeqCst)
- Profile with perf for any remaining bottlenecks
Checkpoint: ~10x improvement over mutex baseline
Phase 4: Cache Optimization (Days 10-12)
Goals:
- Add cache line padding
- Eliminate false sharing
- Measure cache behavior with perf
Tasks:
- Add padding between head and tail
- Ensure proper alignment with repr(C)
- Measure cache misses before/after with perf
- Optimize message struct layout
- Consider cache line alignment for buffer slots
- Final benchmark
Checkpoint: Cache miss rate < 5%, additional 2-3x improvement
Phase 5: Market Data Integration (Days 13-16)
Goals:
- Add market data message structure
- Implement binary protocol parsing
- Create realistic benchmark scenario
Tasks:
- Define MarketDataMessage struct (cache-aligned)
- Implement zero-copy parsing
- Create simulated market data generator
- Run full system benchmark
- Measure end-to-end latency
- Add live statistics output
Checkpoint: Handle 1M+ messages/second with detailed statistics
Phase 6: Documentation and Polish (Days 17-21)
Goals:
- Document everything learned
- Create comparison report
- Clean up code for portfolio
Tasks:
- Write comprehensive README
- Create benchmark comparison chart
- Add architecture diagrams
- Document memory ordering rationale
- Add inline documentation
- Clean up code style
- Create presentation/demo
Checkpoint: Portfolio-ready with clear documentation
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Index type | usize, u32, u64 | u64 | No wraparound concerns for practical use |
| Buffer storage | Array, Vec, Box | Box<[..]> | Heap allocation with proper alignment |
| Empty slot marker | Option |
MaybeUninit | Avoid Option overhead for perf-critical path |
| Ordering strategy | SeqCst everywhere, Acquire/Release | Acquire/Release | Much faster on x86, still correct |
| Capacity | Fixed, configurable | Const generic | Compile-time optimization, no runtime checks |
| Full handling | Block, return error, drop oldest | Return error | Never block in HFT; let caller decide |
6. Testing Strategy
6.1 Unit Tests
| Test Category | What to Test |
|---|---|
| Queue empty | Pop from empty queue returns None |
| Queue full | Push to full queue returns false |
| Single push/pop | Push one item, pop returns same item |
| FIFO order | Items come out in order pushed |
| Capacity boundary | Queue holds exactly N-1 items |
| Wraparound | Sequence numbers wrap correctly |
6.2 Concurrent Tests
#[test]
fn test_concurrent_correctness() {
let queue = Arc::new(SpscQueue::<u64, 1024>::new());
let q_producer = Arc::clone(&queue);
let q_consumer = Arc::clone(&queue);
const NUM_ITEMS: u64 = 1_000_000;
let producer = thread::spawn(move || {
for i in 0..NUM_ITEMS {
while !q_producer.try_push(i) {
// Queue full, spin
}
}
});
let consumer = thread::spawn(move || {
let mut expected = 0u64;
let mut received = 0u64;
while received < NUM_ITEMS {
if let Some(item) = q_consumer.try_pop() {
assert_eq!(item, expected, "Out of order at {}", expected);
expected += 1;
received += 1;
}
}
});
producer.join().unwrap();
consumer.join().unwrap();
}
6.3 Property-Based Tests
Property 1: Every item pushed is eventually popped (no loss)
Property 2: Items are popped in the same order they were pushed (FIFO)
Property 3: No item is popped more than once (no duplication)
Property 4: Queue never exceeds capacity
Property 5: Queue is always either empty, partially filled, or full (consistent state)
6.4 Stress Tests
Stress Test 1: Throughput Saturation
Producer pushes as fast as possible
Consumer pops as fast as possible
Run for 60 seconds
Verify no corruption
Stress Test 2: Unbalanced Load
Producer 10x faster than consumer (queue fills up)
Producer 10x slower than consumer (queue empties)
Verify graceful handling
Stress Test 3: Burst Traffic
Producer sends 10000 messages in burst
Pause
Consumer catches up
Repeat
Verify all messages received
7. Common Pitfalls and Debugging
| Problem | Symptom | Cause | Solution | Verification |
|---|---|---|---|---|
| Data corruption | Wrong values read | Wrong memory ordering | Use Acquire on loads after Release stores | ThreadSanitizer, stress test |
| Missed messages | Consumer doesn’t see all messages | Relaxed ordering on tail store | Use Release on tail store | Count total pushed vs popped |
| False sharing | High latency, poor scaling | head/tail on same cache line | Add 56-byte padding | perf stat cache-misses |
| Full queue deadlock | Producer spins forever | Consumer not running | Return error instead of spin | Never spin-wait on consumer |
| Index overflow | Panic on wraparound | Using usize with subtraction | Use wrapping_sub or u64 | Test near u64::MAX |
| Memory leak | Memory grows unbounded | Not dropping popped items | Ensure take() or ptr::read | Valgrind or miri |
| Uninitialized read | Garbage data | Reading MaybeUninit before write | Proper synchronization | Miri, TSan |
7.1 Debugging Techniques
Visualizing Queue State:
impl<T: std::fmt::Debug, const N: usize> SpscQueue<T, N> {
pub fn debug_state(&self) {
let head = self.head.load(Ordering::SeqCst);
let tail = self.tail.load(Ordering::SeqCst);
println!("Queue State:");
println!(" head: {} (index {})", head, head as usize & Self::MASK);
println!(" tail: {} (index {})", tail, tail as usize & Self::MASK);
println!(" size: {}", tail.wrapping_sub(head));
println!(" full: {}", tail.wrapping_sub(head) >= N as u64);
println!(" empty: {}", head >= tail);
}
}
Detecting Ordering Bugs:
// Add assertions in debug builds
#[cfg(debug_assertions)]
fn try_push(&self, item: T) -> bool {
let tail_before = self.tail.load(Ordering::SeqCst);
// ... push logic ...
let tail_after = self.tail.load(Ordering::SeqCst);
debug_assert!(
tail_after == tail_before || tail_after == tail_before + 1,
"Tail moved unexpectedly: {} -> {}",
tail_before, tail_after
);
true
}
8. Extensions and Challenges
8.1 Beginner Extensions
- Multiple queue sizes: Template/const generic for different capacities
- Bounded latency mode: Spin for limited time, then fail
- Statistics: Track push/pop counts, full/empty counts
- Shutdown signal: Producer can signal “done” to consumer
8.2 Intermediate Extensions
- SPMC broadcast: One producer, multiple consumers each get all messages
- Batched operations: Push/pop multiple items atomically for higher throughput
- Memory-mapped queue: Share queue between processes
- Variable-size messages: Handle different message lengths efficiently
8.3 Advanced Extensions
- MPMC queue: Multiple producers, multiple consumers (significantly harder!)
- Epoch-based reclamation: Safe memory reclamation for linked structures
- Hardware timestamps: Use CPU TSC for sub-nanosecond timing
- NUMA awareness: Pin threads and queues to NUMA nodes
- Kernel bypass: Use DPDK/AF_XDP for network to queue directly
8.4 Research Challenges
- Formal verification: Use TLA+ or Iris to prove correctness
- Weak memory model testing: Use Loom to find ARM-specific bugs
- Custom allocator integration: Pool allocator for message buffers
- Compiler barrier analysis: Study generated assembly for each ordering
9. Real-World Connections
9.1 Production Systems
LMAX Disruptor:
- Java-based lock-free ring buffer
- Powers the LMAX Exchange (6M+ orders/second)
- Pioneered many techniques we use here
- Key innovation: sequence barriers for multi-consumer
CppCon Talks (Search for):
- “Lock-Free Programming” - Herb Sutter
- “The Disruptor Pattern” - Martin Thompson
- “The Bits Between the Bits” - Marshall Clow
- “Latency in Financial Trading Systems” - Carl Cook
Facebook’s Folly Library:
- folly::ProducerConsumerQueue - Similar to what we build
- Used in Facebook infrastructure
- Well-tested production code
9.2 How Production Differs
| Aspect | This Project | Production |
|---|---|---|
| Persistence | In-memory only | Write-ahead log |
| Fault tolerance | None | Replication, recovery |
| Monitoring | Printf | Prometheus, Grafana |
| Testing | Unit/stress tests | Chaos engineering |
| Deployment | Local | Kubernetes, containers |
| Network | Simulated | DPDK/kernel bypass |
9.3 Open Source References
| Project | Description | What to Learn |
|---|---|---|
| crossbeam-deque | Rust concurrent deques | Lock-free queue patterns |
| folly | Facebook C++ library | ProducerConsumerQueue |
| liblfds | Lock-free data structures in C | Many patterns, well-tested |
| LMAX Disruptor | Java ring buffer | Architecture patterns |
| moodycamel | C++ lock-free queue | MPMC implementation |
10. Resources
10.1 Essential Reading
- “Rust Atomics and Locks” by Mara Bos
- THE definitive guide for this project
- Chapters 1-6 cover everything you need
- Examples in Rust that translate to C++
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron
- Chapter 6: Memory hierarchy and caching
- Essential for understanding false sharing
10.2 Articles and Papers
- “Memory Barriers: a Hardware View for Software Hackers” - Paul McKenney
- Deep dive into why memory barriers exist
- Hardware perspective on ordering
- “Lock-Free Data Structures” - CMU lecture
- Academic introduction
- Covers ABA problem, hazard pointers
10.3 Videos
- CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced”
- CppCon 2014: Herb Sutter “Lock-Free Programming (or, Juggling Razor Blades)”
- LMAX Exchange: “How we built a trading system”
10.4 Tools
- ThreadSanitizer (TSan): Built into Clang/GCC, catches data races
- Loom: Rust model checker for concurrency
- perf: Linux profiler for cache analysis
- Miri: Rust undefined behavior detector
11. Self-Assessment Checklist
Understanding
- I can explain what memory ordering is and why it matters
- I can draw the Acquire/Release synchronization pattern
- I understand why false sharing hurts performance
- I know when Relaxed ordering is safe
- I can explain the difference between lock-free and wait-free
- I understand why the queue capacity must be a power of 2
Implementation
- My SPSC queue passes all correctness tests
- ThreadSanitizer reports no data races
- I achieve < 100ns latency at p99
- I handle queue full/empty without blocking
- My code has cache line padding for head/tail
- I can benchmark and measure accurately
Growth
- I profiled cache misses before and after optimization
- I can explain every memory ordering choice in my code
- I compared my implementation to reference implementations
- I could explain my design in a technical interview
- I understand how to extend this to MPMC (even if I haven’t implemented it)
12. Submission / Completion Criteria
Minimum Viable Completion
- SPSC queue implements try_push and try_pop
- Uses atomic operations with some form of ordering
- Basic CLI demonstrates producer/consumer
- At least 5 unit tests pass
- Code compiles without warnings
Full Completion
- All unit tests and concurrent tests pass
- ThreadSanitizer reports no errors
- Acquires/Release ordering correctly applied
- Cache line padding prevents false sharing
- Latency p99 < 200ns
- Throughput > 1M messages/second
- Comparison with mutex baseline documented
Excellence (Going Above and Beyond)
- Latency p99 < 100ns
- Throughput > 10M messages/second
- Market data parsing integrated
- Latency histogram with p50/p99/p999/max
- Cache miss analysis with perf
- SPMC broadcast extension implemented
- Presentation-ready documentation
This guide was expanded from HIGH_FREQUENCY_TRADING_CPP_RUST_LEARNING_PROJECTS_SUMMARY.md. For the complete learning path, see the project index.