Project 2: Lock-Free Market Data Handler

Build a high-performance market data distribution system using lock-free queues - zero mutexes, zero blocking, pure atomic operations. Master the dark art of memory ordering that separates HFT engineers from everyone else.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-3 weeks
Languages	Rust (Primary), C++, C, Zig (Alternatives)
Prerequisites	Project 1 (Order Book), basic threading, pointer fundamentals
Key Topics	Atomics, memory ordering, SPSC queues, false sharing, cache lines
Coolness Level	Level 5: Black Belt Systems Programming
Portfolio Value	Top-Tier HFT Resume

1. Learning Objectives

By completing this project, you will:

Implement a lock-free SPSC queue from scratch: Build a Single-Producer, Single-Consumer queue using only atomic operations - no mutexes, no locks, no blocking
Master memory ordering semantics: Understand and correctly apply Acquire/Release ordering, and know when Relaxed is safe vs dangerous
Eliminate false sharing: Diagnose and fix cache line contention between threads that silently destroys performance
Parse binary protocols efficiently: Implement zero-copy deserialization of market data feeds
Measure concurrent performance accurately: Build latency histograms with p50/p99/p999 percentiles and understand measurement pitfalls
Understand the MESI protocol: Know how CPU caches coordinate and why this matters for lock-free code
Build production-grade infrastructure: Create the kind of component that actually runs in HFT systems

2. Theoretical Foundation

2.1 Core Concepts

What Are Atomics?

An atomic operation is one that appears to happen instantaneously from the perspective of all other threads. There is no intermediate state that other threads can observe.

Non-Atomic Increment (DANGEROUS):
Thread A reads x = 5
                           Thread B reads x = 5
Thread A writes x = 6
                           Thread B writes x = 6    <-- Lost update!

Expected: x = 7
Actual:   x = 6

Atomic Increment (SAFE):
Thread A: atomic_fetch_add(&x, 1)  // x becomes 6, returns 5
                                   Thread B: atomic_fetch_add(&x, 1)  // x becomes 7, returns 6

Result: x = 7 (correct!)

In Rust:

use std::sync::atomic::{AtomicU64, Ordering};

let counter = AtomicU64::new(0);
counter.fetch_add(1, Ordering::SeqCst);  // Atomic increment

In C++:

#include <atomic>

std::atomic<uint64_t> counter{0};
counter.fetch_add(1, std::memory_order_seq_cst);  // Atomic increment

Memory Ordering: The Hidden Dimension

Modern CPUs and compilers reorder operations for performance. This is invisible to single-threaded code but breaks multi-threaded code in subtle ways.

Memory Ordering Problem:

Thread A (Producer):           Thread B (Consumer):
  data = 42;                     if (ready.load()) {
  ready.store(true);                 print(data);  // May see 0!
                                 }

Without proper ordering, the compiler or CPU might:
1. Reorder ready.store() before data = 42
2. Thread B might see ready=true but data=0 (stale)

Memory Ordering Levels (from strongest to weakest):

┌──────────────────────────────────────────────────────────────────────┐
│                    MEMORY ORDERING SPECTRUM                          │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  SeqCst (Sequential Consistency)                                    │
│  ═══════════════════════════════                                    │
│  • Total global ordering                                            │
│  • All threads see same order                                       │
│  • Slowest, but easiest to reason about                             │
│  • Use when: Learning, or when correctness > performance            │
│                                                                      │
│              ▼                                                       │
│                                                                      │
│  Acquire / Release                                                   │
│  ════════════════                                                    │
│  • Producer uses Release: "publish" changes                          │
│  • Consumer uses Acquire: "synchronize with" producer                │
│  • Creates happens-before relationship                               │
│  • Use when: Producer-consumer patterns (most lock-free code)        │
│                                                                      │
│              ▼                                                       │
│                                                                      │
│  Relaxed                                                             │
│  ═══════                                                             │
│  • No ordering guarantees                                            │
│  • Only atomicity guaranteed                                         │
│  • Fastest, but dangerous                                            │
│  • Use when: Counters, statistics, or with explicit fences           │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Acquire/Release Pattern (The Key to Lock-Free):

Producer/Consumer with Acquire/Release:

Producer (Thread A):                 Consumer (Thread B):
┌───────────────────────┐            ┌───────────────────────┐
│ data[idx] = value;    │            │                       │
│ // Many stores here   │            │                       │
│                       │            │                       │
│ tail.store(           │────────────│ head = tail.load(     │
│   new_tail,           │  syncs     │   Ordering::Acquire   │
│   Ordering::Release); │  with ───▶ │ );                    │
│                       │            │                       │
│                       │            │ // Safe to read data! │
│                       │            │ value = data[idx];    │
└───────────────────────┘            └───────────────────────┘

Release "publishes" all prior writes.
Acquire "sees" all writes before the Release.

Cache Coherency and the MESI Protocol

CPUs maintain coherent caches using protocols like MESI. Understanding this is crucial for lock-free performance.

MESI Protocol States:

┌──────────────────────────────────────────────────────────────────────┐
│                         MESI STATES                                   │
├─────────────┬────────────────────────────────────────────────────────┤
│   State     │   Meaning                                              │
├─────────────┼────────────────────────────────────────────────────────┤
│ M (Modified)│ Only this cache has it, it's been written             │
│             │ Must write back before another CPU can read           │
│             │                                                        │
│ E (Exclusive)│ Only this cache has it, matches memory               │
│             │ Can write without bus traffic                          │
│             │                                                        │
│ S (Shared)  │ Multiple caches have it, matches memory               │
│             │ Must invalidate others before writing                  │
│             │                                                        │
│ I (Invalid) │ Not in cache, or stale copy                           │
│             │ Must fetch from memory or another cache               │
└─────────────┴────────────────────────────────────────────────────────┘

Cache Line Ping-Pong (The Performance Killer):

Core 0 Cache          Memory            Core 1 Cache
┌────────────┐       ┌──────┐          ┌────────────┐
│ Line X: M  │──────▶│      │◀─────────│ Line X: I  │
└────────────┘  1.   └──────┘   2.     └────────────┘
                     ▲     │
Core 0 writes,       │     │    Core 1 wants to write
Line goes to M       │     │    Line X invalidated in Core 0
                     │     │    Core 0 writes back to memory
                     │     ▼    Core 1 fetches, goes to M
┌────────────┐       │    ┌────────────┐
│ Line X: I  │◀──────┴────│ Line X: M  │
└────────────┘            └────────────┘
      3.                        4.

This "ping-pong" costs 100+ cycles each time!
Happens when two threads write to same cache line (FALSE SHARING)

A cache line is the smallest unit of data transferred between memory and cache, typically 64 bytes on modern x86.

False sharing occurs when two threads write to different variables that happen to share a cache line:

FALSE SHARING DISASTER:

struct BadQueue {
    head: u64,      // 8 bytes  ─┐
    tail: u64,      // 8 bytes   ├── Same 64-byte cache line!
    // ... rest     //          ─┘
};

Producer writes tail ──┐
                       ├──▶ Cache line bounces between cores
Consumer writes head ──┘     Each write invalidates the other core's cache
                             Performance: 10x-100x slower!

SOLUTION: Padding to separate cache lines

struct GoodQueue {
    head: u64,           // 8 bytes
    _pad1: [u8; 56],     // Pad to 64 bytes
    // --- cache line boundary ---
    tail: u64,           // 8 bytes
    _pad2: [u8; 56],     // Pad to 64 bytes
    // --- cache line boundary ---
    buffer: [T; N],      // Separate cache lines
};

Now: head on one cache line, tail on another
Producer and consumer operate independently
No false sharing!

2.2 Why This Matters for HFT

In HFT systems, mutex contention is a performance killer:

Mutex Contention Impact:

              Uncontended Mutex              Contended Mutex
              ─────────────────              ────────────────
Time:         ~25-50 nanoseconds             ~1-10 microseconds
              (just lock/unlock)             (waiting for other thread)

At 1M messages/second:
  • 1 us contention = 1 second wasted per second = IMPOSSIBLE
  • You CANNOT process 1M/sec with contended mutexes

Lock-free Queue:
  • ~10-50 nanoseconds per operation
  • NO WAITING, NO BLOCKING
  • 10M+ messages/second achievable

Real-world example:

Market Data Feed Handler Architecture:

Network Thread                    Strategy Threads
(Producer)                        (Consumers)
     │                                 │
     ▼                                 ▼
┌─────────────┐                  ┌─────────────┐
│ Parse feed  │                  │ Strategy 1  │
│ 500K msgs/s │                  │             │
└──────┬──────┘                  └──────┬──────┘
       │                                │
       ▼                                ▼
   ┌───────────────────────────────────────┐
   │          LOCK-FREE QUEUE              │
   │                                        │
   │   Producer pushes ───▶ Consumer pops  │
   │   No locks, no blocking, no waiting   │
   └───────────────────────────────────────┘
                                │
                                ▼
                         ┌─────────────┐
                         │ Strategy 2  │
                         └─────────────┘

If we used mutexes:
  • Network thread blocks waiting for strategies
  • Strategies block waiting for each other
  • Latency spikes of 10-100 microseconds
  • Missed market opportunities

With lock-free:
  • Network thread NEVER waits
  • Each strategy gets data independently
  • Consistent ~50 nanosecond latency
  • Zero contention

2.3 Historical Context

Evolution of Lock-Free Programming:

Timeline:

1991: Maurice Herlihy proves "wait-free" universality
      Any data structure can be made wait-free
      (Theoretical breakthrough, impractical at the time)

1996: Maged Michael and Michael Scott publish MS-Queue
      First practical lock-free FIFO queue
      Still a reference implementation today

2004: C++ introduces std::atomic (TR1, then C++11)
      First standardized memory model for systems languages

2007: LMAX builds the Disruptor
      Lock-free ring buffer for financial trading
      Processes 6 million orders/second on a single thread
      Becomes the gold standard for HFT

2011: C++11 formalizes memory ordering
      acquire, release, seq_cst become standard vocabulary

2015: Rust 1.0 releases with safe atomics
      Compiler-enforced memory safety + zero-cost atomics
      Makes lock-free programming more accessible

2018: "Rust Atomics and Locks" by Mara Bos
      Becomes the definitive guide to lock-free in Rust

Present: Lock-free is standard practice in HFT
         Every major trading firm uses lock-free queues
         Sub-microsecond latency is the baseline expectation

The LMAX Disruptor Influence:

The Disruptor Pattern (2007):

Traditional Queue:                  Disruptor Ring Buffer:
┌────────┐   ┌────────┐            ┌────────────────────────┐
│Producer│   │Consumer│            │    Sequence Numbers    │
└───┬────┘   └────┬───┘            │   ┌──┬──┬──┬──┬──┐    │
    │            │                 │ ──│12│13│14│15│16│──  │
    ▼            ▼                 │   └──┴──┴──┴──┴──┘    │
┌──────────────────────┐           │      ▲         ▲      │
│ Linked List + Mutex  │           │  Consumer   Producer  │
│ Allocation per push  │           │  Cursor     Cursor    │
│ Cache-hostile        │           │                        │
└──────────────────────┘           └────────────────────────┘

Disruptor innovations:
1. Pre-allocated ring buffer (no allocation)
2. Sequence numbers instead of pointers
3. Cache line padding eliminates false sharing
4. Mechanical sympathy with CPU architecture

Result: 6 million events/second
        100x faster than traditional queues

2.4 Common Misconceptions

Misconception 1: “volatile is the same as atomic”

// WRONG - volatile does NOT make this safe
volatile bool ready = false;
volatile int data = 0;

// Thread A
data = 42;
ready = true;  // May be reordered before data=42!

// Thread B
while (!ready);
print(data);  // May see 0!

// CORRECT - use atomics with proper ordering
std::atomic<bool> ready{false};
int data = 0;

// Thread A
data = 42;
ready.store(true, std::memory_order_release);

// Thread B
while (!ready.load(std::memory_order_acquire));
print(data);  // Guaranteed to see 42!

volatile only prevents compiler optimizations. It does NOT:

Prevent CPU reordering
Provide atomicity for operations wider than a byte
Create happens-before relationships

Misconception 2: “SeqCst is always correct, so always use it”

SeqCst is safe but expensive:

SeqCst generates:
  - Memory fences on x86 (MFENCE or locked instructions)
  - Full barriers on ARM
  - ~10-50 cycles overhead

Acquire/Release on x86:
  - Release is FREE (x86 stores have release semantics)
  - Acquire is FREE (x86 loads have acquire semantics)
  - Just the atomic operation itself

For SPSC queue:
  SeqCst:    ~50-100ns per operation
  Acq/Rel:   ~10-30ns per operation

Know your memory model!
On x86, acquire/release is essentially free.
On ARM, there's a small cost, but still less than SeqCst.

Misconception 3: “Lock-free means faster”

Not always! Lock-free adds complexity:

When mutex wins:
  - Low contention (< 2 threads competing)
  - Complex operations (many memory accesses)
  - Infrequent operations

When lock-free wins:
  - High contention (many threads)
  - Simple operations (single CAS)
  - High frequency (millions per second)

The HFT sweet spot:
  SPSC (single producer, single consumer) is the simplest lock-free pattern
  Works perfectly for market data distribution
  2-4 threads is typical, not 100

Don't use MPMC (multi-producer, multi-consumer) unless you must!
SPSC is 5-10x faster than MPMC.

Misconception 4: “I can debug lock-free code by adding prints”

Printf debugging HIDES concurrency bugs:

The Heisenbug Problem:
  - Adding printf changes timing
  - Bugs disappear when you look for them
  - Bugs reappear in production

Example:
  // Bug exists:
  data = 42;
  ready.store(true, Ordering::Relaxed);  // WRONG ORDERING!

  // Add debugging:
  data = 42;
  println!("set data");  // This adds a memory barrier!
  ready.store(true, Ordering::Relaxed);

  // Bug is now hidden by the barrier inside println!

Correct approach:
  1. Use thread sanitizer (ThreadSanitizer, TSan)
  2. Use Loom (Rust model checker)
  3. Think through orderings on paper FIRST
  4. Add asserts that don't change timing

Misconception 5: “The ABA problem only matters for linked structures”

ABA Problem:

Thread 1: Read pointer P pointing to A
          (preempted)
Thread 2: Free A, allocate B at same address, change P to point to C
Thread 1: (resumes) CAS(P, A, new_value)
          Succeeds because address matches!
          But A is gone, this is now a dangling pointer!

For ring buffers (our project):
  - No ABA problem! Indices are monotonically increasing
  - Index 42 will never wrap to mean the same thing twice
  - (Unless you process 2^64 messages, which is... unlikely)

ABA matters for:
  - Lock-free linked lists (pointer reuse)
  - Lock-free stacks (push/pop same node)
  - Memory reclamation (freed memory)

Solutions when ABA matters:
  - Tagged pointers (include counter with pointer)
  - Hazard pointers
  - Epoch-based reclamation

3. Project Specification

3.1 What You Will Build

A lock-free market data distribution system that:

Receives market data from a simulated feed (or real data if available)
Parses binary market data into structured messages
Distributes data through an SPSC lock-free queue to consumers
Measures and reports latency with accurate percentile statistics
Handles 1M+ messages per second on commodity hardware
Demonstrates the performance difference between lock-free and mutex-based approaches

3.2 Functional Requirements

ID	Requirement	Priority
FR1	Implement SPSC lock-free queue with atomic operations only	Must Have
FR2	Producer thread pushes market data messages to queue	Must Have
FR3	Consumer thread pops messages and processes them	Must Have
FR4	Queue uses power-of-2 size for efficient modulo	Must Have
FR5	No memory allocation in push/pop hot path	Must Have
FR6	Parse binary market data messages (quote updates)	Must Have
FR7	Report throughput in messages per second	Must Have
FR8	Report latency histogram (p50, p99, p999)	Must Have
FR9	Compare lock-free vs mutex-based queue performance	Must Have
FR10	Handle queue full gracefully (return error, don’t block)	Must Have
FR11	Handle queue empty gracefully (return None, don’t block)	Must Have
FR12	Cache line padding to prevent false sharing	Should Have
FR13	Support multiple consumer threads (broadcast pattern)	Should Have
FR14	Configurable queue size	Should Have
FR15	Graceful shutdown with producer signaling completion	Should Have

3.3 Non-Functional Requirements

Requirement	Target	Measurement
Push Latency	< 50ns (p50), < 200ns (p99)	rdtsc or high-resolution clock
Pop Latency	< 50ns (p50), < 200ns (p99)	Same measurement
Throughput	> 10M messages/second (single producer)	Sustained 10-second benchmark
Memory	< 1KB overhead beyond buffer	Count struct sizes
False Sharing	Zero observed	perf stat cache miss comparison
Lock-Free	True wait-freedom for producer	Code review (no spin-wait on consumer state)

3.4 Example Usage / Output

$ ./market_data_handler --benchmark

=== LOCK-FREE MARKET DATA HANDLER ===

Configuration:
  Queue size: 65536 slots (64KB buffer)
  Message size: 64 bytes
  Benchmark duration: 10 seconds

Starting producer thread...
Starting consumer thread...

Live Statistics (updated every second):
  [1s] Throughput: 12.4M msg/sec | p50: 32ns | p99: 89ns | p999: 234ns
  [2s] Throughput: 12.6M msg/sec | p50: 31ns | p99: 87ns | p999: 221ns
  [3s] Throughput: 12.5M msg/sec | p50: 31ns | p99: 88ns | p999: 228ns
  ...
  [10s] Throughput: 12.5M msg/sec | p50: 31ns | p99: 88ns | p999: 226ns

=== BENCHMARK COMPLETE ===

Total messages: 125,432,891
Average throughput: 12.54M msg/sec
Latency percentiles:
  p50:  31 ns
  p90:  45 ns
  p99:  88 ns
  p999: 226 ns
  max:  1,234 ns

=== MUTEX BASELINE COMPARISON ===

Running same workload with mutex-based queue...

Mutex-based results:
  Throughput: 1.2M msg/sec (10x slower!)
  Latency p50: 234ns | p99: 12,456ns | p999: 89,234ns

=== ANALYSIS ===

Lock-free speedup: 10.4x throughput
Latency reduction: 7.5x at p50, 141x at p99

=== SAMPLE MARKET DATA ===

Displaying last 5 messages:
  AAPL BID 150.25 x 500  | ASK 150.26 x 300  | ts=14:32:01.234567890
  MSFT BID 380.10 x 200  | ASK 380.12 x 450  | ts=14:32:01.234567891
  GOOGL BID 140.50 x 1000 | ASK 140.52 x 800  | ts=14:32:01.234567892
  AMZN BID 178.30 x 300  | ASK 178.32 x 250  | ts=14:32:01.234567893
  NVDA BID 480.00 x 150  | ASK 480.05 x 200  | ts=14:32:01.234567894

3.5 Real World Outcome

When complete, you will have:

A working lock-free SPSC queue that you understand at the atomic level

Benchmark results proving the performance difference:

=== LOCK-FREE vs MUTEX COMPARISON ===

Metric          | Lock-Free     | Mutex-Based   | Improvement
----------------+---------------+---------------+------------
Throughput      | 12.5M msg/sec | 1.2M msg/sec  | 10.4x
p50 latency     | 31ns          | 234ns         | 7.5x
p99 latency     | 88ns          | 12,456ns      | 141x
p999 latency    | 226ns         | 89,234ns      | 395x

Deep understanding of:
- Why memory ordering matters
- How false sharing kills performance
- What happens at the cache line level
- When to use Acquire/Release vs SeqCst
Foundation for Project 3 (Matching Engine) which uses lock-free queues between components

4. Solution Architecture

4.1 High-Level Design

                     LOCK-FREE MARKET DATA HANDLER ARCHITECTURE

┌──────────────────────────────────────────────────────────────────────────────┐
│                              SYSTEM OVERVIEW                                  │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                     MARKET DATA SOURCE                               │   │
│   │                                                                      │   │
│   │   ┌─────────────────┐    ┌─────────────────┐                        │   │
│   │   │ Network Feed    │ OR │ Simulated Feed  │                        │   │
│   │   │ (UDP Multicast) │    │ (Benchmark)     │                        │   │
│   │   └────────┬────────┘    └────────┬────────┘                        │   │
│   │            │                      │                                  │   │
│   │            └──────────┬───────────┘                                  │   │
│   │                       ▼                                              │   │
│   └───────────────────────┬─────────────────────────────────────────────┘   │
│                           │                                                  │
│                           ▼                                                  │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                      PRODUCER THREAD                                 │   │
│   │                                                                      │   │
│   │   ┌───────────────┐    ┌───────────────┐    ┌───────────────┐      │   │
│   │   │ Receive Raw   │───▶│ Parse Binary  │───▶│ Push to Queue │      │   │
│   │   │ Bytes         │    │ Protocol      │    │ (Lock-Free)   │      │   │
│   │   └───────────────┘    └───────────────┘    └───────┬───────┘      │   │
│   │                                                      │               │   │
│   └──────────────────────────────────────────────────────┼───────────────┘   │
│                                                          │                   │
│                                                          ▼                   │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                      SPSC LOCK-FREE QUEUE                            │   │
│   │                                                                      │   │
│   │   ┌────────────────────────────────────────────────────────────┐    │   │
│   │   │                    RING BUFFER                              │    │   │
│   │   │                                                             │    │   │
│   │   │    ┌────┬────┬────┬────┬────┬────┬────┬────┐              │    │   │
│   │   │    │ M0 │ M1 │ M2 │ M3 │ M4 │ M5 │ M6 │ M7 │  (Pre-alloc) │    │   │
│   │   │    └────┴────┴────┴────┴────┴────┴────┴────┘              │    │   │
│   │   │      ▲                        ▲                            │    │   │
│   │   │      │                        │                            │    │   │
│   │   │    HEAD                     TAIL                           │    │   │
│   │   │  (Consumer)              (Producer)                        │    │   │
│   │   │                                                             │    │   │
│   │   │   head: AtomicU64 ─────────── Cache Line 0 ──────────────  │    │   │
│   │   │   [padding 56 bytes]                                        │    │   │
│   │   │   tail: AtomicU64 ─────────── Cache Line 1 ──────────────  │    │   │
│   │   │   [padding 56 bytes]                                        │    │   │
│   │   │   buffer ───────────────────── Cache Lines 2..N ──────────  │    │   │
│   │   │                                                             │    │   │
│   │   └────────────────────────────────────────────────────────────┘    │   │
│   │                                                                      │   │
│   └──────────────────────────────────────────────────────────────────────┘   │
│                                                          │                   │
│                                                          ▼                   │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                     CONSUMER THREAD(S)                               │   │
│   │                                                                      │   │
│   │   ┌───────────────┐    ┌───────────────┐    ┌───────────────┐      │   │
│   │   │ Pop from      │───▶│ Process       │───▶│ Update        │      │   │
│   │   │ Queue         │    │ Message       │    │ Statistics    │      │   │
│   │   └───────────────┘    └───────────────┘    └───────────────┘      │   │
│   │                                                                      │   │
│   └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                     STATISTICS COLLECTOR                             │   │
│   │                                                                      │   │
│   │   • Latency Histogram (p50, p99, p999, max)                         │   │
│   │   • Throughput Counter (messages per second)                         │   │
│   │   • Cache Miss Counters (via perf)                                   │   │
│   │                                                                      │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component	Responsibility	Performance Requirement
SPSC Queue	Lock-free message passing between threads	< 50ns push/pop
Ring Buffer	Pre-allocated storage, no runtime allocation	Zero allocation
Producer	Parse feed data, push to queue	Non-blocking push
Consumer	Pop messages, process, measure latency	Non-blocking pop
Statistics	Accurate latency measurement	Sub-nanosecond precision
Binary Parser	Zero-copy deserialization	< 10ns per message

4.3 Data Structures

SPSC Queue Structure (with Cache Padding)

SPSC Queue Memory Layout:

Offset 0-63:   Cache Line 0 (Producer writes, Consumer reads)
┌────────────────────────────────────────────────────────────────┐
│ tail: AtomicU64 (8 bytes)                                      │
│ _pad_tail: [u8; 56]  (padding to fill cache line)              │
└────────────────────────────────────────────────────────────────┘

Offset 64-127: Cache Line 1 (Consumer writes, Producer reads)
┌────────────────────────────────────────────────────────────────┐
│ head: AtomicU64 (8 bytes)                                      │
│ _pad_head: [u8; 56]  (padding to fill cache line)              │
└────────────────────────────────────────────────────────────────┘

Offset 128+:   Cache Lines 2..N (Data, written by producer, read by consumer)
┌────────────────────────────────────────────────────────────────┐
│ buffer: [Slot<T>; CAPACITY]                                    │
│                                                                │
│   Each Slot:                                                   │
│   ┌────────────────────────────────────────────────────┐      │
│   │ data: MaybeUninit<T>                               │      │
│   │ (Slot size should be power of 2 or cache-aligned)  │      │
│   └────────────────────────────────────────────────────┘      │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Capacity: CAPACITY (power of 2, e.g., 65536)
Mask: CAPACITY - 1 (for fast modulo: index & mask)

Rust Implementation Sketch:

use std::sync::atomic::{AtomicU64, Ordering};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;

const CACHE_LINE_SIZE: usize = 64;

#[repr(C)]
pub struct SpscQueue<T, const N: usize> {
    // Cache line 0: Producer writes tail
    tail: AtomicU64,
    _pad_tail: [u8; CACHE_LINE_SIZE - 8],

    // Cache line 1: Consumer writes head
    head: AtomicU64,
    _pad_head: [u8; CACHE_LINE_SIZE - 8],

    // Remaining cache lines: Buffer
    buffer: [UnsafeCell<MaybeUninit<T>>; N],
}

// Safety: Only producer accesses tail and buffer writes
//         Only consumer accesses head and buffer reads
unsafe impl<T: Send, const N: usize> Send for SpscQueue<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscQueue<T, N> {}

C++ Implementation Sketch:

#include <atomic>
#include <cstdint>
#include <optional>

constexpr size_t CACHE_LINE_SIZE = 64;

template<typename T, size_t N>
class SpscQueue {
    static_assert((N & (N - 1)) == 0, "N must be power of 2");

    struct alignas(CACHE_LINE_SIZE) PaddedAtomic {
        std::atomic<uint64_t> value{0};
        char padding[CACHE_LINE_SIZE - sizeof(std::atomic<uint64_t>)];
    };

    PaddedAtomic tail_;  // Producer writes
    PaddedAtomic head_;  // Consumer writes

    alignas(CACHE_LINE_SIZE) T buffer_[N];

public:
    static constexpr size_t MASK = N - 1;

    bool try_push(const T& item);
    std::optional<T> try_pop();
};

Market Data Message Structure

Market Data Message (64 bytes, cache-line aligned):

┌────────────────────────────────────────────────────────────────┐
│ Offset │ Field          │ Type      │ Size │ Description       │
├────────┼────────────────┼───────────┼──────┼───────────────────┤
│ 0      │ timestamp      │ u64       │ 8    │ Nanoseconds epoch │
│ 8      │ symbol_id      │ u32       │ 4    │ Symbol identifier │
│ 12     │ msg_type       │ u8        │ 1    │ Quote/Trade/etc   │
│ 13     │ flags          │ u8        │ 1    │ Status flags      │
│ 14     │ _reserved      │ u16       │ 2    │ Alignment         │
│ 16     │ bid_price      │ i64       │ 8    │ Price in ticks    │
│ 24     │ ask_price      │ i64       │ 8    │ Price in ticks    │
│ 32     │ bid_size       │ u32       │ 4    │ Bid quantity      │
│ 36     │ ask_size       │ u32       │ 4    │ Ask quantity      │
│ 40     │ sequence       │ u64       │ 8    │ Sequence number   │
│ 48     │ _padding       │ [u8; 16]  │ 16   │ Pad to 64 bytes   │
└────────────────────────────────────────────────────────────────┘

Total: 64 bytes (exactly 1 cache line)

Rust:

#[repr(C, align(64))]
#[derive(Clone, Copy)]
pub struct MarketDataMessage {
    pub timestamp: u64,
    pub symbol_id: u32,
    pub msg_type: u8,
    pub flags: u8,
    _reserved: u16,
    pub bid_price: i64,
    pub ask_price: i64,
    pub bid_size: u32,
    pub ask_size: u32,
    pub sequence: u64,
    _padding: [u8; 16],
}

const _: () = assert!(std::mem::size_of::<MarketDataMessage>() == 64);
const _: () = assert!(std::mem::align_of::<MarketDataMessage>() == 64);

4.4 Algorithm Overview

Lock-Free SPSC Push (Producer)

PUSH ALGORITHM (Lock-Free):

function try_push(item):
    // Step 1: Read current tail (producer owns this, can use Relaxed)
    current_tail = tail.load(Relaxed)

    // Step 2: Calculate next position
    next_tail = current_tail + 1

    // Step 3: Check if queue is full
    // Load head with Acquire to synchronize with consumer's Release
    current_head = head.load(Acquire)

    if next_tail - current_head > CAPACITY:
        return QUEUE_FULL  // Don't block, return error

    // Step 4: Write data to buffer
    // Safe because we know this slot is available
    index = current_tail & MASK
    buffer[index] = item

    // Step 5: Publish tail with Release
    // This "releases" all the writes we just did
    // Consumer will "acquire" this and see the data
    tail.store(next_tail, Release)

    return SUCCESS

Memory Ordering Diagram for Push:

Producer Thread                              Consumer Thread
───────────────                              ───────────────

buffer[idx] = item;
    │
    │ Release barrier
    ▼
tail.store(next, Release) ─────synchronizes────▶ tail.load(Acquire)
                                                     │
                                                     │ Acquire barrier
                                                     ▼
                                                 read buffer[idx]

All writes before Release are visible after Acquire!

Lock-Free SPSC Pop (Consumer)

POP ALGORITHM (Lock-Free):

function try_pop():
    // Step 1: Read current head (consumer owns this, can use Relaxed)
    current_head = head.load(Relaxed)

    // Step 2: Check if queue is empty
    // Load tail with Acquire to synchronize with producer's Release
    current_tail = tail.load(Acquire)

    if current_head >= current_tail:
        return QUEUE_EMPTY  // Don't block, return None

    // Step 3: Read data from buffer
    // Safe because producer has published this slot
    index = current_head & MASK
    item = buffer[index]

    // Step 4: Advance head with Release
    // This "releases" the slot for the producer to reuse
    head.store(current_head + 1, Release)

    return item

Full Synchronization Flow:

SPSC Queue Synchronization Timeline:

Producer                    Queue                     Consumer
────────                    ─────                     ────────

T=0: Write data to slot 5
         │
         ▼
     buffer[5] = msg
         │
T=1: Release tail
         │
         ▼
     tail.store(6, Release) ───────────────────▶
                                                T=2: Acquire tail
                                                         │
                                                         ▼
                                                    if (head < tail)
                                                         │
                                                T=3: Read data
                                                         │
                                                         ▼
                                                    msg = buffer[5]
                                                         │
                                                T=4: Release head
                                                         │
                                                         ▼
     ◀─────────────────────────────────── head.store(6, Release)
         │
T=5: Acquire head
         │
         ▼
     slot 5 now available for reuse!

5. Implementation Guide

5.1 Development Environment Setup

Rust Setup:

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Create project
cargo new lock_free_handler
cd lock_free_handler

# Add dependencies in Cargo.toml
# [dependencies]
# # (none needed for core implementation)
#

# [dev-dependencies]
# criterion = { version = "0.5", features = ["html_reports"] }

# Verify atomic support
rustc --print target-features | grep atomic

# Install for profiling
cargo install flamegraph

C++ Setup:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install g++ cmake linux-tools-generic

# Verify C++17 support (for std::optional, if constexpr)
g++ --version  # Need GCC 7+ or Clang 5+

# Create project structure
mkdir -p lock_free_handler/{src,include,tests,benchmarks}
cd lock_free_handler

Profiling Tools:

# Linux: perf for cache analysis
sudo apt-get install linux-tools-common linux-tools-generic

# Check perf access
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# ThreadSanitizer (built into GCC/Clang)
# Build with: -fsanitize=thread

# Rust: Miri for undefined behavior
rustup +nightly component add miri
cargo +nightly miri test

5.2 Project Structure

Rust:

lock_free_handler/
├── Cargo.toml
├── src/
│   ├── main.rs                 # CLI and benchmark entry
│   ├── lib.rs                  # Library root
│   ├── queue/
│   │   ├── mod.rs              # Queue module
│   │   ├── spsc.rs             # SPSC lock-free queue
│   │   └── mutex_queue.rs      # Mutex baseline for comparison
│   ├── market_data/
│   │   ├── mod.rs              # Market data module
│   │   ├── message.rs          # Message struct
│   │   └── parser.rs           # Binary protocol parser
│   └── stats/
│       ├── mod.rs              # Statistics module
│       └── histogram.rs        # Latency histogram
├── benches/
│   └── queue_bench.rs          # Criterion benchmarks
├── tests/
│   ├── correctness.rs          # Functional tests
│   └── concurrent.rs           # Multi-threaded tests
└── README.md

C++:

lock_free_handler/
├── CMakeLists.txt
├── include/
│   ├── spsc_queue.hpp          # SPSC queue header
│   ├── mutex_queue.hpp         # Mutex baseline
│   ├── market_data.hpp         # Message definitions
│   └── histogram.hpp           # Latency statistics
├── src/
│   ├── main.cpp                # Entry point
│   └── benchmark.cpp           # Benchmark implementation
├── tests/
│   ├── test_queue.cpp          # Queue tests
│   └── test_concurrent.cpp     # Threading tests
└── README.md

5.3 The Core Question You’re Answering

“How do you pass data between threads without locks while maintaining correctness and achieving maximum throughput?”

This question requires understanding:

Atomicity: What operations are atomic at the hardware level?
Ordering: How do you ensure writes in one thread are visible in another?
Progress: How do you guarantee the system makes forward progress?
Performance: How do you avoid hidden costs like false sharing?

The answer is not just “use atomics” but understanding when different orderings are safe and why cache line padding matters.

5.4 Concepts You Must Understand First

Concept	Self-Assessment Question	Where to Learn
Atomics	What makes an atomic operation atomic? (Hardware support, not locks)	Rust Atomics Ch. 1-2
Memory Ordering	When is Relaxed safe? When is Acquire/Release needed?	Rust Atomics Ch. 3
Cache Lines	What size is a cache line? What is false sharing?	CS:APP Ch. 6
Ring Buffer	Why must the size be a power of 2? (Fast modulo with & mask)	Any data structures course
Producer-Consumer	What is the happens-before relationship?	Rust Atomics Ch. 3
Spin-waiting	What is busy-waiting? When is it acceptable?	OS textbook

5.5 Questions to Guide Your Design

Atomics and Ordering:

Why can’t you use Ordering::Relaxed everywhere?
What is the difference between Acquire on load vs Release on store?
Why does the producer load head with Acquire?
Why does the consumer load tail with Acquire?
What would break if you used Relaxed for tail store?

Queue Design:

Why must capacity be a power of 2?
How do you detect when the queue is full?
How do you detect when the queue is empty?
What happens if producer and consumer access the same slot?
Why don’t you need a lock even though two threads access the buffer?

Performance:

What is false sharing and how do you prevent it?
Why put padding between head and tail?
What cache misses occur during normal operation?
How does the ring buffer improve cache efficiency vs linked list?
What is the expected number of atomic operations per push/pop?

Edge Cases:

What happens when the queue wraps around after 2^64 messages?
How do you handle shutdown gracefully?
What if the consumer is slower than the producer?
What if the producer is slower than the consumer?

5.6 Thinking Exercise

Before writing code, trace through this scenario on paper:

Initial State:
  head = 0
  tail = 0
  buffer = [_, _, _, _, _, _, _, _]  (8 slots, empty)

Operations:
Producer: push(A)
Producer: push(B)
Producer: push(C)
Consumer: pop()     // What is returned?
Consumer: pop()     // What is returned?
Producer: push(D)
Producer: push(E)
Producer: push(F)
Producer: push(G)
Producer: push(H)
Producer: push(I)  // What happens?
Consumer: pop()
Producer: push(I)  // Now what happens?

Questions to answer:

After step 3, what are head and tail? What is in the buffer?
After step 5, what are head and tail?
After step 10, how many items are in the queue? What is returned by step 11?
Draw the buffer state at each step using indices.
At which points does the producer need to check head?
At which points does the consumer need to check tail?

Memory Ordering Exercise:

Consider this buggy implementation:

fn try_push(&self, item: T) -> bool {
    let tail = self.tail.load(Ordering::Relaxed);
    let head = self.head.load(Ordering::Relaxed);  // BUG!

    if tail - head >= N {
        return false;
    }

    self.buffer[tail & MASK] = item;
    self.tail.store(tail + 1, Ordering::Relaxed);  // BUG!
    true
}

What can go wrong with this code?
Which orderings need to be changed?
Draw a timeline showing a specific interleaving that causes a bug.

5.7 Hints in Layers

Hint 1 - Starting Direction

Start with a working but slow version using strong ordering everywhere:

Rust:

use std::sync::atomic::{AtomicUsize, Ordering};
use std::cell::UnsafeCell;

pub struct SpscQueue<T, const N: usize> {
    head: AtomicUsize,
    tail: AtomicUsize,
    buffer: [UnsafeCell<Option<T>>; N],
}

impl<T, const N: usize> SpscQueue<T, N> {
    pub fn try_push(&self, item: T) -> bool {
        let tail = self.tail.load(Ordering::SeqCst);
        let head = self.head.load(Ordering::SeqCst);

        if tail.wrapping_sub(head) >= N {
            return false;  // Full
        }

        unsafe {
            *self.buffer[tail % N].get() = Some(item);
        }

        self.tail.store(tail.wrapping_add(1), Ordering::SeqCst);
        true
    }

    pub fn try_pop(&self) -> Option<T> {
        let head = self.head.load(Ordering::SeqCst);
        let tail = self.tail.load(Ordering::SeqCst);

        if head >= tail {
            return None;  // Empty
        }

        let item = unsafe {
            (*self.buffer[head % N].get()).take()
        };

        self.head.store(head.wrapping_add(1), Ordering::SeqCst);
        item
    }
}

Get this working first, then optimize orderings.

Hint 2 - Correct Orderings

Now switch to efficient orderings:

pub fn try_push(&self, item: T) -> bool {
    // We own tail, can use Relaxed to read our own state
    let tail = self.tail.load(Ordering::Relaxed);

    // Need Acquire to synchronize with consumer's Release of head
    // This ensures we see the consumer's updates to slots it has freed
    let head = self.head.load(Ordering::Acquire);

    if tail.wrapping_sub(head) >= N {
        return false;
    }

    // Write data (no ordering needed, protected by tail publish)
    unsafe {
        *self.buffer[tail % N].get() = Some(item);
    }

    // Release: publish our write to the consumer
    // Consumer's Acquire will synchronize with this
    self.tail.store(tail.wrapping_add(1), Ordering::Release);
    true
}

pub fn try_pop(&self) -> Option<T> {
    // We own head, can use Relaxed
    let head = self.head.load(Ordering::Relaxed);

    // Need Acquire to synchronize with producer's Release of tail
    // This ensures we see the data the producer wrote
    let tail = self.tail.load(Ordering::Acquire);

    if head >= tail {
        return None;
    }

    // Read data (protected by tail acquire)
    let item = unsafe {
        (*self.buffer[head % N].get()).take()
    };

    // Release: publish that we've freed this slot
    // Producer's Acquire of head will synchronize with this
    self.head.store(head.wrapping_add(1), Ordering::Release);
    item
}

Key insight: The only cross-thread communication is through head and tail. Data access is protected by proper ordering of these indices.

Hint 3 - Cache Line Padding

Add padding to prevent false sharing:

use std::sync::atomic::{AtomicU64, Ordering};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;

const CACHE_LINE: usize = 64;

#[repr(C)]  // Ensure field order is preserved
pub struct SpscQueue<T, const N: usize> {
    // Producer-owned fields (cache line 0)
    tail: AtomicU64,
    _pad_tail: [u8; CACHE_LINE - 8],

    // Consumer-owned fields (cache line 1)
    head: AtomicU64,
    _pad_head: [u8; CACHE_LINE - 8],

    // Shared buffer (separate cache lines)
    buffer: Box<[UnsafeCell<MaybeUninit<T>>; N]>,
}

impl<T, const N: usize> SpscQueue<T, N> {
    const MASK: usize = N - 1;

    pub fn new() -> Self {
        assert!(N.is_power_of_two(), "Capacity must be power of 2");

        // Use Box to allocate on heap with proper alignment
        let buffer: Box<[UnsafeCell<MaybeUninit<T>>; N]> = {
            // Safety: We initialize all slots before use
            unsafe {
                let layout = std::alloc::Layout::new::<[UnsafeCell<MaybeUninit<T>>; N]>();
                let ptr = std::alloc::alloc(layout) as *mut [UnsafeCell<MaybeUninit<T>>; N];
                Box::from_raw(ptr)
            }
        };

        Self {
            tail: AtomicU64::new(0),
            _pad_tail: [0; CACHE_LINE - 8],
            head: AtomicU64::new(0),
            _pad_head: [0; CACHE_LINE - 8],
            buffer,
        }
    }

    #[inline]
    fn index(&self, seq: u64) -> usize {
        (seq as usize) & Self::MASK
    }
}

The padding ensures:

tail is on its own cache line (producer writes)
head is on its own cache line (consumer writes)
No false sharing between threads!

Hint 4 - Measurement and Debugging

Accurate Latency Measurement (Rust):

use std::time::Instant;

pub struct LatencyHistogram {
    buckets: [u64; 1000],  // 1ns granularity up to 1000ns
    overflow: u64,
    count: u64,
}

impl LatencyHistogram {
    pub fn record(&mut self, nanos: u64) {
        self.count += 1;
        if nanos < 1000 {
            self.buckets[nanos as usize] += 1;
        } else {
            self.overflow += 1;
        }
    }

    pub fn percentile(&self, p: f64) -> u64 {
        let target = (self.count as f64 * p) as u64;
        let mut cumulative = 0u64;

        for (ns, &count) in self.buckets.iter().enumerate() {
            cumulative += count;
            if cumulative >= target {
                return ns as u64;
            }
        }

        1000  // Overflow bucket
    }
}

// Usage in benchmark:
let start = Instant::now();
queue.try_push(item);
let elapsed = start.elapsed().as_nanos() as u64;
histogram.record(elapsed);

Using rdtsc for sub-nanosecond precision (x86):

#[cfg(target_arch = "x86_64")]
fn rdtsc() -> u64 {
    unsafe { core::arch::x86_64::_rdtsc() }
}

// Calibrate once at startup
fn calibrate_tsc() -> f64 {
    let start_tsc = rdtsc();
    let start_time = Instant::now();

    std::thread::sleep(std::time::Duration::from_millis(100));

    let end_tsc = rdtsc();
    let end_time = Instant::now();

    let cycles = end_tsc - start_tsc;
    let nanos = end_time.duration_since(start_time).as_nanos();

    cycles as f64 / nanos as f64  // cycles per nanosecond
}

Debugging with ThreadSanitizer:

# Rust
RUSTFLAGS="-Z sanitizer=thread" cargo +nightly test

# C++
g++ -fsanitize=thread -g -O1 test.cpp -o test
./test

Cache Analysis with perf:

# Measure cache misses
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./benchmark

# Compare with and without padding:
# Without padding: ~50% cache miss rate
# With padding: ~1% cache miss rate

5.8 The Interview Questions They’ll Ask

After completing this project, you should be able to answer:

“Explain memory ordering and when you would use Acquire vs Release.”
- Expected: Clear explanation of happens-before, release publishes writes, acquire sees published writes
- Bonus: Explain why x86 gets acquire/release “for free” on loads/stores
“What is false sharing and how do you prevent it?”
- Expected: Two threads writing to same cache line causes ping-pong
- Bonus: Mention MESI protocol, cache line size (64 bytes), show padding solution
“Implement a lock-free SPSC queue.”
- Expected: Ring buffer, atomics for head/tail, proper ordering
- Bonus: Discuss wait-freedom, capacity management, wraparound handling
“What’s the difference between lock-free and wait-free?”
- Expected: Lock-free = system makes progress; wait-free = every thread makes progress
- Bonus: SPSC queue is wait-free for both producer and consumer (no spin-waiting)
“How do you benchmark concurrent code correctly?”
- Expected: Mention warmup, percentiles not averages, avoid coordination overhead
- Bonus: Discuss TSC vs system clock, measurement overhead, cache warming
“What is the ABA problem and does it affect your queue?”
- Expected: Explain ABA with CAS, explain why SPSC with indices doesn’t have it
- Bonus: Discuss hazard pointers or epoch-based reclamation for when it matters
“Why is SeqCst slower than Acquire/Release?”
- Expected: SeqCst requires memory fences, Acquire/Release is free on x86
- Bonus: Explain x86 TSO model, why ARM is different
“How would you extend this to multiple consumers?”
- Expected: Discuss SPMC broadcast vs work-stealing, mention Disruptor pattern
- Bonus: Explain why MPMC is much harder and when to avoid it
“What happens if the producer is faster than the consumer?”
- Expected: Queue fills up, producer gets QUEUE_FULL, must handle backpressure
- Bonus: Discuss flow control strategies, adaptive batching
“How do you test lock-free code for correctness?”
- Expected: ThreadSanitizer, stress tests, Loom (Rust)
- Bonus: Discuss model checking, formal verification, randomized scheduling

5.9 Books That Will Help

Topic	Book	Chapter
Atomics and memory ordering	Rust Atomics and Locks (Mara Bos)	Ch. 1-3
Building lock-free queues	Rust Atomics and Locks	Ch. 4-5
Memory model deep dive	Rust Atomics and Locks	Ch. 6
CPU cache hierarchy	Computer Systems: A Programmer’s Perspective	Ch. 6
C++ atomics	C++ Concurrency in Action (Anthony Williams)	Ch. 5
Lock-free patterns	The Art of Multiprocessor Programming	Ch. 10
LMAX Disruptor	Building Low Latency Applications with C++	Ch. 7
Performance measurement	Systems Performance (Brendan Gregg)	Ch. 6

5.10 Implementation Phases

Phase 1: Mutex Baseline (Days 1-2)

Goals:

Implement a simple mutex-based queue for baseline comparison
Create benchmark harness for latency and throughput
Establish measurement methodology

Tasks:

Create project structure

Implement mutex-based queue:

pub struct MutexQueue<T> {
    data: Mutex<VecDeque<T>>,
}

Create simple producer/consumer test
Implement latency histogram
Run baseline benchmark, record numbers

Checkpoint: Baseline numbers documented (expect ~200-500ns latency)

Phase 2: Basic Lock-Free SPSC (Days 3-6)

Goals:

Implement working SPSC queue with SeqCst ordering
Verify correctness with tests
Compare performance to baseline

Tasks:

Implement SPSC queue structure
Implement try_push with SeqCst
Implement try_pop with SeqCst
Write unit tests for edge cases
Run concurrent stress test
Run ThreadSanitizer
Benchmark and compare to baseline

Checkpoint: 3-5x improvement over mutex, all tests pass, no TSan errors

Phase 3: Optimize Memory Ordering (Days 7-9)

Goals:

Switch to Acquire/Release ordering
Measure improvement
Understand why it’s correct

Tasks:

Change orderings to Acquire/Release
Document why each ordering is correct (comments!)
Re-run ThreadSanitizer to verify
Benchmark improvement (expect 2-3x over SeqCst)
Profile with perf for any remaining bottlenecks

Checkpoint: ~10x improvement over mutex baseline

Phase 4: Cache Optimization (Days 10-12)

Goals:

Add cache line padding
Eliminate false sharing
Measure cache behavior with perf

Tasks:

Add padding between head and tail
Ensure proper alignment with repr(C)
Measure cache misses before/after with perf
Optimize message struct layout
Consider cache line alignment for buffer slots
Final benchmark

Checkpoint: Cache miss rate < 5%, additional 2-3x improvement

Phase 5: Market Data Integration (Days 13-16)

Goals:

Add market data message structure
Implement binary protocol parsing
Create realistic benchmark scenario

Tasks:

Define MarketDataMessage struct (cache-aligned)
Implement zero-copy parsing
Create simulated market data generator
Run full system benchmark
Measure end-to-end latency
Add live statistics output

Checkpoint: Handle 1M+ messages/second with detailed statistics

Phase 6: Documentation and Polish (Days 17-21)

Goals:

Document everything learned
Create comparison report
Clean up code for portfolio

Tasks:

Write comprehensive README
Create benchmark comparison chart
Add architecture diagrams
Document memory ordering rationale
Add inline documentation
Clean up code style
Create presentation/demo

Checkpoint: Portfolio-ready with clear documentation

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Index type	usize, u32, u64	u64	No wraparound concerns for practical use
Buffer storage	Array, Vec, Box	Box<[..]>	Heap allocation with proper alignment
Empty slot marker	Option, MaybeUninit	MaybeUninit	Avoid Option overhead for perf-critical path
Ordering strategy	SeqCst everywhere, Acquire/Release	Acquire/Release	Much faster on x86, still correct
Capacity	Fixed, configurable	Const generic	Compile-time optimization, no runtime checks
Full handling	Block, return error, drop oldest	Return error	Never block in HFT; let caller decide

6. Testing Strategy

6.1 Unit Tests

Test Category	What to Test
Queue empty	Pop from empty queue returns None
Queue full	Push to full queue returns false
Single push/pop	Push one item, pop returns same item
FIFO order	Items come out in order pushed
Capacity boundary	Queue holds exactly N-1 items
Wraparound	Sequence numbers wrap correctly

6.2 Concurrent Tests

#[test]
fn test_concurrent_correctness() {
    let queue = Arc::new(SpscQueue::<u64, 1024>::new());
    let q_producer = Arc::clone(&queue);
    let q_consumer = Arc::clone(&queue);

    const NUM_ITEMS: u64 = 1_000_000;

    let producer = thread::spawn(move || {
        for i in 0..NUM_ITEMS {
            while !q_producer.try_push(i) {
                // Queue full, spin
            }
        }
    });

    let consumer = thread::spawn(move || {
        let mut expected = 0u64;
        let mut received = 0u64;

        while received < NUM_ITEMS {
            if let Some(item) = q_consumer.try_pop() {
                assert_eq!(item, expected, "Out of order at {}", expected);
                expected += 1;
                received += 1;
            }
        }
    });

    producer.join().unwrap();
    consumer.join().unwrap();
}

6.3 Property-Based Tests

Property 1: Every item pushed is eventually popped (no loss)
Property 2: Items are popped in the same order they were pushed (FIFO)
Property 3: No item is popped more than once (no duplication)
Property 4: Queue never exceeds capacity
Property 5: Queue is always either empty, partially filled, or full (consistent state)

6.4 Stress Tests

Stress Test 1: Throughput Saturation
  Producer pushes as fast as possible
  Consumer pops as fast as possible
  Run for 60 seconds
  Verify no corruption

Stress Test 2: Unbalanced Load
  Producer 10x faster than consumer (queue fills up)
  Producer 10x slower than consumer (queue empties)
  Verify graceful handling

Stress Test 3: Burst Traffic
  Producer sends 10000 messages in burst
  Pause
  Consumer catches up
  Repeat
  Verify all messages received

7. Common Pitfalls and Debugging

Problem	Symptom	Cause	Solution	Verification
Data corruption	Wrong values read	Wrong memory ordering	Use Acquire on loads after Release stores	ThreadSanitizer, stress test
Missed messages	Consumer doesn’t see all messages	Relaxed ordering on tail store	Use Release on tail store	Count total pushed vs popped
False sharing	High latency, poor scaling	head/tail on same cache line	Add 56-byte padding	perf stat cache-misses
Full queue deadlock	Producer spins forever	Consumer not running	Return error instead of spin	Never spin-wait on consumer
Index overflow	Panic on wraparound	Using usize with subtraction	Use wrapping_sub or u64	Test near u64::MAX
Memory leak	Memory grows unbounded	Not dropping popped items	Ensure take() or ptr::read	Valgrind or miri
Uninitialized read	Garbage data	Reading MaybeUninit before write	Proper synchronization	Miri, TSan

7.1 Debugging Techniques

Visualizing Queue State:

impl<T: std::fmt::Debug, const N: usize> SpscQueue<T, N> {
    pub fn debug_state(&self) {
        let head = self.head.load(Ordering::SeqCst);
        let tail = self.tail.load(Ordering::SeqCst);

        println!("Queue State:");
        println!("  head: {} (index {})", head, head as usize & Self::MASK);
        println!("  tail: {} (index {})", tail, tail as usize & Self::MASK);
        println!("  size: {}", tail.wrapping_sub(head));
        println!("  full: {}", tail.wrapping_sub(head) >= N as u64);
        println!("  empty: {}", head >= tail);
    }
}

Detecting Ordering Bugs:

// Add assertions in debug builds
#[cfg(debug_assertions)]
fn try_push(&self, item: T) -> bool {
    let tail_before = self.tail.load(Ordering::SeqCst);

    // ... push logic ...

    let tail_after = self.tail.load(Ordering::SeqCst);
    debug_assert!(
        tail_after == tail_before || tail_after == tail_before + 1,
        "Tail moved unexpectedly: {} -> {}",
        tail_before, tail_after
    );

    true
}

8. Extensions and Challenges

8.1 Beginner Extensions

Multiple queue sizes: Template/const generic for different capacities
Bounded latency mode: Spin for limited time, then fail
Statistics: Track push/pop counts, full/empty counts
Shutdown signal: Producer can signal “done” to consumer

8.2 Intermediate Extensions

SPMC broadcast: One producer, multiple consumers each get all messages
Batched operations: Push/pop multiple items atomically for higher throughput
Memory-mapped queue: Share queue between processes
Variable-size messages: Handle different message lengths efficiently

8.3 Advanced Extensions

MPMC queue: Multiple producers, multiple consumers (significantly harder!)
Epoch-based reclamation: Safe memory reclamation for linked structures
Hardware timestamps: Use CPU TSC for sub-nanosecond timing
NUMA awareness: Pin threads and queues to NUMA nodes
Kernel bypass: Use DPDK/AF_XDP for network to queue directly

8.4 Research Challenges

Formal verification: Use TLA+ or Iris to prove correctness
Weak memory model testing: Use Loom to find ARM-specific bugs
Custom allocator integration: Pool allocator for message buffers
Compiler barrier analysis: Study generated assembly for each ordering

9. Real-World Connections

9.1 Production Systems

LMAX Disruptor:

Java-based lock-free ring buffer
Powers the LMAX Exchange (6M+ orders/second)
Pioneered many techniques we use here
Key innovation: sequence barriers for multi-consumer

CppCon Talks (Search for):

“Lock-Free Programming” - Herb Sutter
“The Disruptor Pattern” - Martin Thompson
“The Bits Between the Bits” - Marshall Clow
“Latency in Financial Trading Systems” - Carl Cook

Facebook’s Folly Library:

folly::ProducerConsumerQueue - Similar to what we build
Used in Facebook infrastructure
Well-tested production code

9.2 How Production Differs

Aspect	This Project	Production
Persistence	In-memory only	Write-ahead log
Fault tolerance	None	Replication, recovery
Monitoring	Printf	Prometheus, Grafana
Testing	Unit/stress tests	Chaos engineering
Deployment	Local	Kubernetes, containers
Network	Simulated	DPDK/kernel bypass

9.3 Open Source References

Project	Description	What to Learn
crossbeam-deque	Rust concurrent deques	Lock-free queue patterns
folly	Facebook C++ library	ProducerConsumerQueue
liblfds	Lock-free data structures in C	Many patterns, well-tested
LMAX Disruptor	Java ring buffer	Architecture patterns
moodycamel	C++ lock-free queue	MPMC implementation

10. Resources

10.1 Essential Reading

“Rust Atomics and Locks” by Mara Bos
- THE definitive guide for this project
- Chapters 1-6 cover everything you need
- Examples in Rust that translate to C++
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron
- Chapter 6: Memory hierarchy and caching
- Essential for understanding false sharing

10.2 Articles and Papers

“Memory Barriers: a Hardware View for Software Hackers” - Paul McKenney
- Deep dive into why memory barriers exist
- Hardware perspective on ordering
“Lock-Free Data Structures” - CMU lecture
- Academic introduction
- Covers ABA problem, hazard pointers

10.3 Videos

CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced”
CppCon 2014: Herb Sutter “Lock-Free Programming (or, Juggling Razor Blades)”
LMAX Exchange: “How we built a trading system”

10.4 Tools

ThreadSanitizer (TSan): Built into Clang/GCC, catches data races
Loom: Rust model checker for concurrency
perf: Linux profiler for cache analysis
Miri: Rust undefined behavior detector

11. Self-Assessment Checklist

Understanding

I can explain what memory ordering is and why it matters
I can draw the Acquire/Release synchronization pattern
I understand why false sharing hurts performance
I know when Relaxed ordering is safe
I can explain the difference between lock-free and wait-free
I understand why the queue capacity must be a power of 2

Implementation

My SPSC queue passes all correctness tests
ThreadSanitizer reports no data races
I achieve < 100ns latency at p99
I handle queue full/empty without blocking
My code has cache line padding for head/tail
I can benchmark and measure accurately

Growth

I profiled cache misses before and after optimization
I can explain every memory ordering choice in my code
I compared my implementation to reference implementations
I could explain my design in a technical interview
I understand how to extend this to MPMC (even if I haven’t implemented it)

12. Submission / Completion Criteria

Minimum Viable Completion

SPSC queue implements try_push and try_pop
Uses atomic operations with some form of ordering
Basic CLI demonstrates producer/consumer
At least 5 unit tests pass
Code compiles without warnings

Full Completion

All unit tests and concurrent tests pass
ThreadSanitizer reports no errors
Acquires/Release ordering correctly applied
Cache line padding prevents false sharing
Latency p99 < 200ns
Throughput > 1M messages/second
Comparison with mutex baseline documented

Excellence (Going Above and Beyond)

Latency p99 < 100ns
Throughput > 10M messages/second
Market data parsing integrated
Latency histogram with p50/p99/p999/max
Cache miss analysis with perf
SPMC broadcast extension implemented
Presentation-ready documentation

This guide was expanded from HIGH_FREQUENCY_TRADING_CPP_RUST_LEARNING_PROJECTS_SUMMARY.md. For the complete learning path, see the project index.

Project 2: Lock-Free Market Data Handler

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

What Are Atomics?

Memory Ordering: The Hidden Dimension

Cache Coherency and the MESI Protocol

Cache Lines and False Sharing

2.2 Why This Matters for HFT

2.3 Historical Context

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

SPSC Queue Structure (with Cache Padding)

Market Data Message Structure

4.4 Algorithm Overview

Lock-Free SPSC Push (Producer)

Lock-Free SPSC Pop (Consumer)

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 Hints in Layers

5.8 The Interview Questions They’ll Ask

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Mutex Baseline (Days 1-2)

Phase 2: Basic Lock-Free SPSC (Days 3-6)

Phase 3: Optimize Memory Ordering (Days 7-9)

Phase 4: Cache Optimization (Days 10-12)

Phase 5: Market Data Integration (Days 13-16)

Phase 6: Documentation and Polish (Days 17-21)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Unit Tests

6.2 Concurrent Tests

6.3 Property-Based Tests

6.4 Stress Tests

7. Common Pitfalls and Debugging

7.1 Debugging Techniques

8. Extensions and Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

8.4 Research Challenges

9. Real-World Connections

9.1 Production Systems

9.2 How Production Differs

9.3 Open Source References

10. Resources

10.1 Essential Reading

10.2 Articles and Papers

10.3 Videos

10.4 Tools

11. Self-Assessment Checklist

Understanding

Implementation

Growth

12. Submission / Completion Criteria

Minimum Viable Completion

Full Completion

Excellence (Going Above and Beyond)