Project 2: Lock-Free Market Data Handler

Build a system that receives market data, parses it, and distributes it to consumers using lock-free queues - zero mutexes, zero blocking, pure atomic operations.


Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-3 weeks
Language Rust (recommended) or C++
Prerequisites Solid pointers/references, basic threading, completed Project 1 (Order Book)
Key Topics Atomics, memory ordering, SPSC queues, false sharing, binary protocol parsing

1. Learning Objectives

By completing this project, you will be able to:

  1. Explain why locks are unacceptable in HFT systems and articulate the latency costs of mutex contention in nanoseconds
  2. Implement a lock-free SPSC queue using only atomic operations, achieving 1M+ messages per second
  3. Master memory ordering semantics - understand when to use Relaxed, Acquire, Release, and SeqCst, and why each matters
  4. Identify and eliminate false sharing by understanding cache line architecture and applying proper padding
  5. Design a binary protocol parser that operates without heap allocation in the critical path
  6. Measure and interpret latency distributions - understand p50, p99, p999, and why tail latency matters in trading
  7. Debug concurrent code using tools like ThreadSanitizer (TSAN), Loom, and systematic testing approaches
  8. Make informed tradeoffs between different lock-free queue designs (bounded vs unbounded, SPSC vs MPMC)

2. Theoretical Foundation

2.1 Core Concepts

Lock-Free vs Lock-Based: The Fundamental Difference

In traditional concurrent programming, threads coordinate using locks (mutexes, semaphores). When Thread A holds a lock, Thread B must wait. This waiting is the fundamental problem:

                    LOCK-BASED QUEUE: THE WAITING GAME
+------------------------------------------------------------------------+
|                                                                        |
| Timeline:        0us    50us   100us   150us   200us   250us   300us  |
|                   |      |       |       |       |       |       |    |
| Thread A:      [==ACQUIRE LOCK==][==WRITE DATA==][==RELEASE==]        |
|                                                        ^               |
| Thread B:      [TRY LOCK].....BLOCKED.....BLOCKED.....[ACQUIRE]       |
|                         ^                              |               |
|                         |                              |               |
|                    Waiting for A                 Finally runs!         |
|                    (doing NOTHING)                                     |
|                                                                        |
| Problem #1: Thread B wastes 200us doing absolutely nothing             |
| Problem #2: If A crashes while holding lock -> deadlock                |
| Problem #3: If A is preempted by OS -> priority inversion              |
| Problem #4: Lock acquisition has its own latency (~20-50ns minimum)    |
|                                                                        |
+------------------------------------------------------------------------+

In HFT, 200 microseconds is an eternity. Market conditions can change, orders can be filled, and arbitrage opportunities can vanish in that time.

Lock-free programming eliminates blocking entirely:

                    LOCK-FREE QUEUE: CONCURRENT PROGRESS
+------------------------------------------------------------------------+
|                                                                        |
| Timeline:        0ns    50ns   100ns   150ns   200ns   250ns   300ns  |
|                   |      |       |       |       |       |       |    |
| Thread A:      [CAS][WRITE][CAS][WRITE][CAS][WRITE][CAS][WRITE]       |
|                   |      |       |       |       |       |            |
| Thread B:      [CAS][READ][CAS][RETRY][CAS][READ][CAS][READ]          |
|                                    ^                                   |
|                            Contention detected -                       |
|                            retry IMMEDIATELY                           |
|                            (no blocking!)                              |
|                                                                        |
| Key insight: Both threads ALWAYS make progress.                        |
|              No thread ever blocks waiting for another.                |
|              "Failed" CAS operations immediately retry.                |
|              System as a whole guarantees forward progress.            |
|                                                                        |
+------------------------------------------------------------------------+

The formal definition: A data structure is lock-free if at least one thread is guaranteed to make progress in a finite number of steps, regardless of what other threads are doing.

This means:

  • No deadlocks (there are no locks to hold)
  • No priority inversion (no waiting on lower-priority threads)
  • Bounded worst-case latency (no unbounded waits)
  • Better throughput under contention (retries are faster than context switches)

Memory Ordering: The Heart of Lock-Free Programming

Modern CPUs are liars. They don’t execute instructions in the order you write them. They reorder, speculate, buffer writes, and do whatever they can to maximize throughput. The memory ordering model defines what guarantees you get about when writes become visible to other threads.

                      THE CPU'S PERSPECTIVE
+------------------------------------------------------------------------+
|                                                                        |
|  What you wrote:              What the CPU actually does:              |
|                                                                        |
|  data = 42;                   [Store buffer: data = 42]                |
|  flag = true;                 [Store buffer: flag = true]              |
|                                        |                               |
|                               [Reorder for efficiency...]              |
|                                        |                               |
|                               [Commit flag = true first!]              |
|                               [Commit data = 42 later]                 |
|                                                                        |
|  Another thread sees flag=true but data=0!                             |
|  This is NOT a bug - it's the memory model working as designed.        |
|                                                                        |
+------------------------------------------------------------------------+

This is why memory ordering matters. Without proper orderings, your lock-free code will work on x86 (strong memory model) and break mysteriously on ARM (weak memory model).

The Memory Ordering Spectrum:

      Weakest (Fastest)                           Strongest (Slowest)
            |                                            |
            v                                            v
      +---------+    +---------+    +---------+    +---------+
      | Relaxed | -> | Acquire | -> | Release | -> | SeqCst  |
      +---------+    +---------+    +---------+    +---------+
           |              |              |              |
           |              |              |              |
      No ordering   Prevents       Prevents       Global
      guarantees    reads from     writes from    total order
      between       being moved    being moved    visible to
      threads       BEFORE this    AFTER this     ALL threads
                    load           store

Relaxed Ordering: Only guarantees atomicity (no torn reads/writes). The CPU and compiler can reorder freely. Use only for counters where the order truly does not matter.

Acquire Ordering: A load with Acquire ordering prevents the CPU from reordering any reads or writes that appear AFTER the load to BEFORE the load. Think of it as “acquiring” a view of memory that includes all writes before the matching Release.

Release Ordering: A store with Release ordering prevents the CPU from reordering any reads or writes that appear BEFORE the store to AFTER the store. Think of it as “releasing” or “publishing” all your prior writes.

SeqCst (Sequentially Consistent): Provides a single global total order that all threads agree on. The strongest guarantee, but also the slowest because it requires full memory barriers.

                ACQUIRE/RELEASE SYNCHRONIZATION
+------------------------------------------------------------------------+
|                                                                        |
|  Thread A (Producer)                    Thread B (Consumer)            |
|         |                                      |                       |
|         |  buffer[0] = 42;                     |                       |
|         |  buffer[1] = 43;                     |                       |
|         |  buffer[2] = 44;                     |                       |
|         |         |                            |                       |
|         |         | All these writes           |                       |
|         |         | are "published" by         |                       |
|         |         v the Release store          |                       |
|         |                                      |                       |
|   ready.store(true, Release)                   |                       |
|         |                                      |                       |
|         |   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~>      |                       |
|         |   (synchronization happens)          |                       |
|         |                                      v                       |
|         |                      ready.load(Acquire)                     |
|         |                              |                               |
|         |                              | After this Acquire,           |
|         |                              | ALL writes before the         |
|         |                              v Release are visible           |
|         |                                      |                       |
|         |                        x = buffer[0]; // sees 42             |
|         |                        y = buffer[1]; // sees 43             |
|         |                        z = buffer[2]; // sees 44             |
|         v                              v                               |
|                                                                        |
|  The Release "publishes" all prior writes.                             |
|  The Acquire "receives" all those writes.                              |
|  This is the fundamental synchronization primitive of lock-free code.  |
|                                                                        |
+------------------------------------------------------------------------+

The Critical Rule: An Acquire load “synchronizes with” a Release store on the same memory location. All writes that happened before the Release become visible to any code running after the Acquire.


SPSC vs MPSC vs MPMC Queues

Queue designs differ in how many producers and consumers they support:

           QUEUE TYPE COMPARISON
+------------------------------------------------------------------------+
|                                                                        |
|  SPSC (Single-Producer Single-Consumer)                                |
|  ========================================                              |
|                                                                        |
|       Producer -----> [Queue] -----> Consumer                          |
|                                                                        |
|  - Simplest to implement                                               |
|  - No CAS needed! Simple load/store suffices                           |
|  - Best performance (no contention on same variable)                   |
|  - Use when: dedicated threads, audio processing, message passing      |
|                                                                        |
|  Performance: ~50ns per operation                                      |
|                                                                        |
+------------------------------------------------------------------------+
|                                                                        |
|  MPSC (Multi-Producer Single-Consumer)                                 |
|  ======================================                                |
|                                                                        |
|       Producer 1 ----\                                                 |
|       Producer 2 -----> [Queue] -----> Consumer                        |
|       Producer 3 ----/                                                 |
|                                                                        |
|  - Producers need CAS to coordinate writes                             |
|  - Consumer still simple (single reader)                               |
|  - Common pattern: multiple sources -> single aggregator               |
|                                                                        |
|  Performance: ~100-200ns per operation                                 |
|                                                                        |
+------------------------------------------------------------------------+
|                                                                        |
|  MPMC (Multi-Producer Multi-Consumer)                                  |
|  =====================================                                 |
|                                                                        |
|       Producer 1 ----\              /---- Consumer 1                   |
|       Producer 2 -----> [Queue] ---+---- Consumer 2                    |
|       Producer 3 ----/              \---- Consumer 3                   |
|                                                                        |
|  - Both producers and consumers need CAS                               |
|  - Most complex to implement correctly                                 |
|  - Watch out for the ABA problem!                                      |
|  - Use when: work-stealing, thread pools                               |
|                                                                        |
|  Performance: ~200-500ns per operation                                 |
|                                                                        |
+------------------------------------------------------------------------+

For HFT market data handling, SPSC is typically preferred because:

  1. You can dedicate one thread to receiving network data (producer)
  2. You can dedicate one thread to processing each consumer
  3. Multiple SPSC queues avoid all contention between unrelated consumers

False Sharing: The Performance Killer

Modern CPUs don’t operate on individual bytes. They operate on cache lines (typically 64 bytes). When two threads modify variables that happen to share the same cache line, they constantly invalidate each other’s caches:

                    FALSE SHARING EXPLAINED
+------------------------------------------------------------------------+
|                                                                        |
|  CPU Core 0                              CPU Core 1                    |
|  +----------------+                      +----------------+            |
|  | L1 Cache       |                      | L1 Cache       |            |
|  |                |                      |                |            |
|  | Cache Line X:  |                      | Cache Line X:  |            |
|  | [head][tail]   |                      | [head][tail]   |            |
|  +-------+--------+                      +-------+--------+            |
|          |                                       |                     |
|          |    MESI Protocol Traffic              |                     |
|          |    ========================           |                     |
|          |                                       |                     |
|    Core 0 writes head                      Core 1 writes tail          |
|          |                                       |                     |
|          +---> INVALIDATE LINE X ------->        |                     |
|          |                                       |                     |
|          |     <------- REQUEST LINE X <---------+                     |
|          |                                       |                     |
|          +---> SEND LINE X (Modified) ---------->|                     |
|          |                                       |                     |
|          |     <------- INVALIDATE LINE X -------+ (tail modified)     |
|          |                                       |                     |
|    ... and the ping-pong continues ...           |                     |
|                                                                        |
|  RESULT: Each write to head OR tail causes a ~100-cycle penalty        |
|          on the OTHER core. Performance drops 5-10x!                   |
|                                                                        |
+------------------------------------------------------------------------+

The solution is to ensure that variables accessed by different threads live on different cache lines:

                    CACHE LINE PADDING
+------------------------------------------------------------------------+
|                                                                        |
|  BEFORE (False Sharing):                                               |
|                                                                        |
|  +------ Cache Line (64 bytes) ------+                                 |
|  | head (8B) | tail (8B) | unused... |                                 |
|  +------------------------------------+                                |
|       ^            ^                                                   |
|       |            |                                                   |
|    Core 0       Core 1                                                 |
|    (producer)   (consumer)                                             |
|                                                                        |
|  AFTER (Properly Padded):                                              |
|                                                                        |
|  +------ Cache Line 1 (64 bytes) ----+                                 |
|  | head (8B) | padding (56 bytes)    |                                 |
|  +------------------------------------+                                |
|       ^                                                                |
|       |                                                                |
|    Core 0 (owns this line exclusively)                                 |
|                                                                        |
|  +------ Cache Line 2 (64 bytes) ----+                                 |
|  | tail (8B) | padding (56 bytes)    |                                 |
|  +------------------------------------+                                |
|       ^                                                                |
|       |                                                                |
|    Core 1 (owns this line exclusively)                                 |
|                                                                        |
|  RESULT: No cache line sharing, no invalidation traffic, full speed!   |
|                                                                        |
+------------------------------------------------------------------------+

2.2 Why This Matters in HFT

Real HFT systems process millions of market data messages per second. Each message represents a quote update, trade, or order book change. The system that processes this data fastest has a competitive advantage.

Consider the cost of contention:

                    LATENCY BUDGET IN HFT
+------------------------------------------------------------------------+
|                                                                        |
|  Total latency budget: ~10 microseconds (10,000 nanoseconds)           |
|                                                                        |
|  Network receive:        ~500ns  ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░     |
|  Protocol parsing:       ~200ns  ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     |
|  Queue to strategy:      ~50ns   █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     |
|  Strategy calculation:   ~2000ns ████████████████░░░░░░░░░░░░░░░░     |
|  Order creation:         ~100ns  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     |
|  Network send:           ~500ns  ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░     |
|                          -------                                       |
|  Total:                  ~3350ns (leaves margin for spikes)            |
|                                                                        |
+------------------------------------------------------------------------+
|                                                                        |
|  Now add a mutex to the queue:                                         |
|                                                                        |
|  Mutex acquisition:      ~20-50ns minimum (uncontended)                |
|  Mutex with contention:  ~500-2000ns (context switch!)                 |
|                                                                        |
|  A SINGLE contentious mutex can consume your ENTIRE latency budget!    |
|                                                                        |
+------------------------------------------------------------------------+

This is why HFT systems use lock-free queues. Not because they’re “cool” or “theoretically interesting,” but because they’re the only way to meet the latency requirements.


2.3 Historical Context

Lock-free programming has a fascinating history:

1960s-1970s: Early concurrent programming relied on semaphores (Dijkstra) and monitors (Hoare). These were sufficient for early timesharing systems.

1983: Leslie Lamport introduces the concept of wait-free algorithms in his paper “Specifying Concurrent Programming Modules.” He proves that wait-free consensus is impossible for more than 2 processes using only read/write registers.

1991: Maurice Herlihy publishes “Wait-Free Synchronization,” proving that Compare-and-Swap (CAS) is a universal primitive - you can build wait-free versions of any sequential data structure using CAS.

2000s: Lock-free data structures become practical with widespread availability of CAS instructions (CMPXCHG on x86, LL/SC on ARM).

2004: Maged Michael introduces Hazard Pointers as a practical solution to safe memory reclamation in lock-free structures.

2010s: The C++11 memory model formalizes atomics and memory ordering. Languages like Rust adopt similar models with stronger safety guarantees.

Today: Lock-free structures are essential in:

  • Operating system kernels (Linux RCU, Windows interlocked lists)
  • Database systems (RocksDB, ScyllaDB)
  • Game engines (job systems, message passing)
  • HFT systems (market data, order routing)
  • Async runtimes (Tokio, work-stealing schedulers)

2.4 Common Misconceptions

Misconception 1: “Lock-free means no synchronization”

False. Lock-free code uses atomic operations for synchronization. The difference is that these operations don’t cause threads to block - they either succeed or retry immediately.

Misconception 2: “Lock-free is always faster”

False. Lock-free code trades predictable latency for sometimes higher average latency. Under low contention, a mutex might actually be faster because it doesn’t need retry loops. Lock-free shines under high contention and when tail latency matters.

Misconception 3: “SeqCst is safe, so use it everywhere”

Dangerous. SeqCst is safe, but it’s also slow (requires full memory barriers). More importantly, using SeqCst everywhere teaches you nothing about memory ordering. You need to understand Acquire/Release to write correct, performant lock-free code.

Misconception 4: “If it works on my machine, it’s correct”

Extremely dangerous. x86 has a strong memory model (Total Store Order) that hides many ordering bugs. Your code might work perfectly on x86 and fail mysteriously on ARM. Always test on weak memory architectures or use tools like Loom.

Misconception 5: “Lock-free code is simpler because there are no locks”

False. Lock-free code is significantly harder to write correctly. You’re trading the simplicity of mutual exclusion for the complexity of reasoning about memory ordering, ABA problems, and safe memory reclamation.


3. Project Specification

3.1 What You Will Build

A complete market data handling system with these components:

              SYSTEM ARCHITECTURE
+------------------------------------------------------------------------+
|                                                                        |
|   +---------------+     +------------------+     +-----------------+   |
|   |   Network     |     |    Parser        |     |   SPSC Queue    |   |
|   |   Simulator   | --> |    (Zero-copy,   | --> |   (Lock-free,   |   |
|   |   (Producer)  |     |     no alloc)    |     |    padded)      |   |
|   +---------------+     +------------------+     +--------+--------+   |
|                                                          |             |
|                                    +---------------------+             |
|                                    |                                   |
|                                    v                                   |
|   +------------------+     +------------------+     +-----------------+|
|   |   Consumer 1     |     |   Consumer 2     |     |  Stats/Monitor  ||
|   |   (Strategy A)   |     |   (Strategy B)   |     |  (Latency hist) ||
|   +------------------+     +------------------+     +-----------------+|
|                                                                        |
+------------------------------------------------------------------------+

The system will:

  1. Generate simulated market data at configurable rates (10K to 1M+ messages/second)
  2. Parse a binary market data protocol without heap allocation
  3. Distribute parsed data to multiple consumers via lock-free SPSC queues
  4. Measure and report latency statistics (p50, p99, p999)
  5. Compare performance against a mutex-based baseline

3.2 Functional Requirements

FR1: Market Data Generation

  • Generate realistic market data for configurable symbols (AAPL, MSFT, etc.)
  • Support bid/ask quotes with price, size, and timestamp
  • Configurable message rate from 10,000 to 1,000,000+ messages per second
  • Timestamp resolution of nanoseconds

FR2: Binary Protocol Parsing

  • Define and parse a compact binary protocol (fixed-size messages)
  • Zero-copy parsing where possible
  • No heap allocation in the parsing hot path
  • Validate message checksums

FR3: Lock-Free Distribution

  • Implement bounded SPSC queues with configurable capacity
  • Support multiple independent consumers (each with its own queue)
  • Non-blocking push and pop operations
  • Proper memory ordering for correctness across architectures

FR4: Consumer Processing

  • Display received quotes in human-readable format
  • Track per-symbol statistics (message count, last price)
  • Detect sequence gaps (missed messages)

FR5: Latency Measurement

  • Measure message latency from producer to consumer
  • Report percentiles: p50, p75, p90, p95, p99, p999
  • Optionally generate latency histograms (HDR Histogram format)

3.3 Non-Functional Requirements

NFR1: Performance

  • Throughput: 1,000,000+ messages per second on modern hardware
  • Latency (p50): < 100 nanoseconds for queue operations
  • Latency (p99): < 500 nanoseconds for queue operations
  • Zero heap allocation in steady-state operation

NFR2: Correctness

  • No data races (verifiable with ThreadSanitizer)
  • Correct memory ordering (verifiable with Loom for Rust)
  • FIFO ordering preserved
  • No lost or duplicated messages

NFR3: Resource Usage

  • Bounded memory usage (pre-allocated buffers)
  • Predictable CPU usage (no GC pauses, no unexpected allocations)

3.4 Example Usage / Output

$ ./market-data-handler --symbols AAPL,MSFT,GOOGL --rate 1000000 --consumers 3

================================================================================
                    LOCK-FREE MARKET DATA HANDLER
================================================================================

Configuration:
  Symbols:     AAPL, MSFT, GOOGL
  Target rate: 1,000,000 msg/sec
  Consumers:   3
  Queue size:  65,536 (power of 2)
  Duration:    10 seconds

Starting producer thread...
Starting consumer threads...

[PRODUCER] Generating market data...
[CONSUMER-1] AAPL BID 150.25 x 500  | ASK 150.26 x 300  | seq=1
[CONSUMER-2] MSFT BID 380.10 x 1000 | ASK 380.12 x 800  | seq=1
[CONSUMER-3] GOOGL BID 140.05 x 200 | ASK 140.08 x 150  | seq=1
[CONSUMER-1] AAPL BID 150.24 x 600  | ASK 150.26 x 400  | seq=2
...

================================================================================
                         BENCHMARK RESULTS
================================================================================

Duration: 10.00 seconds
Messages produced: 10,000,000
Messages consumed: 10,000,000 (0 dropped)

Throughput: 1.00M msg/sec

Latency Distribution (producer to consumer):
  p50:  45 ns
  p75:  62 ns
  p90:  85 ns
  p95:  98 ns
  p99:  127 ns
  p999: 892 ns
  max:  12,456 ns

Queue Statistics:
  Push operations:  10,000,000
  Pop operations:   10,000,000
  Full events:      0
  Empty polls:      1,247,893 (consumer faster than producer at times)

Cache Performance:
  L1 cache misses:  12,456 (with padding)
  Without padding:  4,231,892 (339x worse!)

================================================================================
                    COMPARISON: LOCK-FREE vs MUTEX
================================================================================

Implementation        Throughput      p50       p99       p999
------------------------------------------------------------------
Lock-Free (yours)     1.00M/s        45 ns     127 ns    892 ns
Mutex-Based           0.15M/s        612 ns    8,934 ns  45,678 ns
Crossbeam SPSC        0.95M/s        52 ns     145 ns    1,023 ns

Your implementation: 6.7x faster than mutex, 5% faster than crossbeam!

================================================================================

3.5 Real World Outcome

When you complete this project, you will have:

  1. A working lock-free market data handler that can process 1M+ messages per second with sub-100ns latency
  2. Deep understanding of atomics and memory ordering that you can apply to any concurrent data structure
  3. Practical experience with performance measurement including latency percentiles and cache analysis
  4. Portfolio-ready code demonstrating advanced systems programming skills
  5. Interview preparation for HFT firms, trading infrastructure roles, and systems programming positions

4. Solution Architecture

4.1 High-Level Design

                    COMPONENT ARCHITECTURE
+------------------------------------------------------------------------+
|                                                                        |
|                        +------------------+                            |
|                        |   Market Data    |                            |
|                        |   Generator      |                            |
|                        |   (Producer)     |                            |
|                        +--------+---------+                            |
|                                 |                                      |
|                                 | MarketDataMessage                    |
|                                 v                                      |
|                        +------------------+                            |
|                        |   Binary         |                            |
|                        |   Encoder        |                            |
|                        |   (Wire format)  |                            |
|                        +--------+---------+                            |
|                                 |                                      |
|                                 | [u8; 64]                             |
|                                 v                                      |
|         +-------------------+---+---+-------------------+              |
|         |                   |       |                   |              |
|         v                   v       v                   v              |
|  +-------------+     +-------------+     +-------------+               |
|  |  SPSC Queue |     |  SPSC Queue |     |  SPSC Queue |               |
|  |  (Consumer1)|     |  (Consumer2)|     |  (Consumer3)|               |
|  +------+------+     +------+------+     +------+------+               |
|         |                   |                   |                      |
|         v                   v                   v                      |
|  +-------------+     +-------------+     +-------------+               |
|  |   Binary    |     |   Binary    |     |   Binary    |               |
|  |   Decoder   |     |   Decoder   |     |   Decoder   |               |
|  +------+------+     +------+------+     +------+------+               |
|         |                   |                   |                      |
|         v                   v                   v                      |
|  +-------------+     +-------------+     +-------------+               |
|  |  Consumer 1 |     |  Consumer 2 |     |  Consumer 3 |               |
|  |  (Strategy) |     |  (Strategy) |     |  (Monitor)  |               |
|  +-------------+     +-------------+     +-------------+               |
|                                                                        |
+------------------------------------------------------------------------+

4.2 Key Components

1. Market Data Generator (Producer)

  • Generates realistic bid/ask quotes for configured symbols
  • Uses high-resolution timestamps (nanoseconds)
  • Maintains sequence numbers for gap detection
  • Encodes data in binary wire format

2. Binary Protocol

  • Fixed-size messages for predictable parsing
  • No variable-length fields in hot path
  • Includes checksum for data integrity
  • Designed for zero-copy parsing

3. SPSC Queue

  • Bounded ring buffer with power-of-2 capacity
  • Cache-line-padded head and tail pointers
  • Acquire/Release memory ordering
  • Returns Option/Result for non-blocking operation

4. Consumer

  • Decodes binary messages
  • Processes market data (display, statistics, strategy)
  • Reports latency measurements

4.3 Data Structures

Market Data Message (Application Level):

+------------------------------------------------------------------------+
|                    MARKET DATA MESSAGE (64 bytes)                      |
+------------------------------------------------------------------------+
|                                                                        |
|  Offset  Size   Field           Type        Description                |
|  ------  ----   -----           ----        -----------                |
|  0       8      timestamp       u64         Nanoseconds since epoch    |
|  8       8      sequence        u64         Monotonic sequence number  |
|  16      8      symbol          [u8; 8]     Symbol (null-padded ASCII) |
|  24      8      bid_price       u64         Bid price (fixed-point)    |
|  32      4      bid_size        u32         Bid quantity               |
|  36      8      ask_price       u64         Ask price (fixed-point)    |
|  44      4      ask_size        u32         Ask quantity               |
|  48      1      message_type    u8          Quote=1, Trade=2, etc.     |
|  49      1      flags           u8          Bitfield for flags         |
|  50      2      source_id       u16         Market/exchange identifier |
|  52      4      checksum        u32         CRC32 of bytes 0-51        |
|  56      8      _reserved       [u8; 8]     Padding to 64 bytes        |
|                                                                        |
|  Total: 64 bytes (one cache line)                                      |
|                                                                        |
+------------------------------------------------------------------------+

SPSC Queue Structure:

+------------------------------------------------------------------------+
|                    SPSC QUEUE MEMORY LAYOUT                            |
+------------------------------------------------------------------------+
|                                                                        |
|  +--------------------- Cache Line 1 (64 bytes) ---------------------+ |
|  |                                                                   | |
|  |  head: AtomicUsize (8 bytes)                                      | |
|  |  _pad1: [u8; 56] (padding to fill cache line)                     | |
|  |                                                                   | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
|  +--------------------- Cache Line 2 (64 bytes) ---------------------+ |
|  |                                                                   | |
|  |  tail: AtomicUsize (8 bytes)                                      | |
|  |  _pad2: [u8; 56] (padding to fill cache line)                     | |
|  |                                                                   | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
|  +------------------- Cache Lines 3+ (N * 64 bytes) -----------------+ |
|  |                                                                   | |
|  |  buffer: [UnsafeCell<MaybeUninit<MarketDataMessage>>; N]          | |
|  |                                                                   | |
|  |  +--------+--------+--------+--------+--------+--------+          | |
|  |  | Slot 0 | Slot 1 | Slot 2 | Slot 3 | Slot 4 | ...    |          | |
|  |  | (64B)  | (64B)  | (64B)  | (64B)  | (64B)  |        |          | |
|  |  +--------+--------+--------+--------+--------+--------+          | |
|  |                                                                   | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
|  NOTE: Each message is exactly one cache line (64 bytes).              |
|        This means each slot is cache-line-aligned by default.          |
|                                                                        |
+------------------------------------------------------------------------+

Ring Buffer Operation:

+------------------------------------------------------------------------+
|                    RING BUFFER STATE MACHINE                           |
+------------------------------------------------------------------------+
|                                                                        |
|  Capacity = 8 (power of 2), indices 0-7                                |
|                                                                        |
|  EMPTY STATE: head == tail                                             |
|  +---+---+---+---+---+---+---+---+                                      |
|  |   |   |   |   |   |   |   |   |                                      |
|  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |                                      |
|  +---+---+---+---+---+---+---+---+                                      |
|    ^                                                                   |
|    |                                                                   |
|   head = tail = 0                                                      |
|                                                                        |
|  AFTER push(A), push(B), push(C): tail - head = 3                      |
|  +---+---+---+---+---+---+---+---+                                      |
|  | A | B | C |   |   |   |   |   |                                      |
|  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |                                      |
|  +---+---+---+---+---+---+---+---+                                      |
|    ^           ^                                                       |
|    |           |                                                       |
|   head=0     tail=3                                                    |
|                                                                        |
|  AFTER pop() returns A: head = 1                                       |
|  +---+---+---+---+---+---+---+---+                                      |
|  |   | B | C |   |   |   |   |   |                                      |
|  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |                                      |
|  +---+---+---+---+---+---+---+---+                                      |
|        ^       ^                                                       |
|        |       |                                                       |
|      head=1  tail=3                                                    |
|                                                                        |
|  WRAP-AROUND: head=6, tail=10 (wraps to index 2)                       |
|  +---+---+---+---+---+---+---+---+                                      |
|  | E | F |   |   |   |   | C | D |                                      |
|  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |                                      |
|  +---+---+---+---+---+---+---+---+                                      |
|          ^               ^                                             |
|          |               |                                             |
|        tail=10         head=6                                          |
|        (10%8=2)        (6%8=6)                                         |
|                                                                        |
|  FULL STATE: tail - head == capacity                                   |
|  +---+---+---+---+---+---+---+---+                                      |
|  | H | I | J | K | L | M | N | O |                                      |
|  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |                                      |
|  +---+---+---+---+---+---+---+---+                                      |
|    ^                                                                   |
|    |                                                                   |
|   head=8, tail=16 (16-8=8=capacity)                                    |
|                                                                        |
+------------------------------------------------------------------------+

4.4 Algorithm Overview

Push Operation (Producer):

+------------------------------------------------------------------------+
|                         PUSH ALGORITHM                                 |
+------------------------------------------------------------------------+
|                                                                        |
|  fn push(&self, value: T) -> Result<(), T>                             |
|                                                                        |
|  1. tail_local = tail.load(Relaxed)                                    |
|     // We "own" tail, so Relaxed is sufficient                         |
|                                                                        |
|  2. head_snapshot = head.load(Acquire)                                 |
|     // Acquire synchronizes with consumer's Release on head            |
|     // Ensures we see the consumer's progress                          |
|                                                                        |
|  3. if (tail_local - head_snapshot) == capacity:                       |
|        return Err(value)  // Queue is full                             |
|                                                                        |
|  4. index = tail_local & (capacity - 1)  // Efficient modulo           |
|     buffer[index] = value                // Write the data             |
|                                                                        |
|  5. tail.store(tail_local + 1, Release)                                |
|     // Release "publishes" the buffer write to consumers               |
|                                                                        |
|  6. return Ok(())                                                      |
|                                                                        |
+------------------------------------------------------------------------+

Pop Operation (Consumer):

+------------------------------------------------------------------------+
|                         POP ALGORITHM                                  |
+------------------------------------------------------------------------+
|                                                                        |
|  fn pop(&self) -> Option<T>                                            |
|                                                                        |
|  1. head_local = head.load(Relaxed)                                    |
|     // We "own" head, so Relaxed is sufficient                         |
|                                                                        |
|  2. tail_snapshot = tail.load(Acquire)                                 |
|     // Acquire synchronizes with producer's Release on tail            |
|     // Ensures we see the producer's buffer writes                     |
|                                                                        |
|  3. if head_local == tail_snapshot:                                    |
|        return None  // Queue is empty                                  |
|                                                                        |
|  4. index = head_local & (capacity - 1)  // Efficient modulo           |
|     value = buffer[index]                // Read the data              |
|                                                                        |
|  5. head.store(head_local + 1, Release)                                |
|     // Release "publishes" the consumption to producer                 |
|     // Producer can now reuse this slot                                |
|                                                                        |
|  6. return Some(value)                                                 |
|                                                                        |
+------------------------------------------------------------------------+

Why These Orderings Work:

+------------------------------------------------------------------------+
|                    MEMORY ORDERING PROOF                               |
+------------------------------------------------------------------------+
|                                                                        |
|  Producer                              Consumer                        |
|     |                                     |                            |
|     | (1) Write buffer[i] = data          |                            |
|     |          |                          |                            |
|     |          | (happens-before)         |                            |
|     |          v                          |                            |
|     | (2) tail.store(i+1, Release)        |                            |
|     |          |                          |                            |
|     |          | (synchronizes-with)      |                            |
|     |          v                          v                            |
|     |          +-----------------------> (3) tail.load(Acquire)        |
|     |                                     |                            |
|     |                                     | (happens-before)           |
|     |                                     v                            |
|     |                                    (4) Read buffer[i]            |
|     |                                     |                            |
|     v                                     v                            |
|                                                                        |
|  The key relationship:                                                 |
|                                                                        |
|  (1) happens-before (2)  [program order]                               |
|  (2) synchronizes-with (3)  [Release-Acquire on tail]                  |
|  (3) happens-before (4)  [program order]                               |
|                                                                        |
|  Therefore: (1) happens-before (4)                                     |
|  The consumer is GUARANTEED to see the producer's buffer write.        |
|                                                                        |
+------------------------------------------------------------------------+

5. Implementation Guide

5.1 Development Environment Setup

Rust Setup:

# Create project
cargo new lock-free-market-data --lib
cd lock-free-market-data

# Add dependencies to Cargo.toml
# [dependencies]
# crossbeam-utils = "0.8"    # For CachePadded
#
# [dev-dependencies]
# loom = "0.7"               # For concurrency testing
# criterion = "0.5"          # For benchmarking
# rand = "0.8"               # For test data generation

# For debugging
rustup component add rust-src  # For Miri
cargo install cargo-tsan       # For ThreadSanitizer

# Build and test
cargo build --release
cargo test

C++ Setup:

# Create project structure
mkdir lock-free-market-data && cd lock-free-market-data
mkdir src include tests benchmarks

# CMakeLists.txt should include:
# - C++17 or later
# - Threading support (find_package(Threads REQUIRED))
# - Address sanitizer for debug builds
# - Thread sanitizer option

# Recommended compiler flags
# -std=c++17 -O3 -march=native  # Release
# -std=c++17 -g -fsanitize=thread  # Debug with TSAN

5.2 Project Structure

Rust:

lock-free-market-data/
+-- Cargo.toml
+-- src/
|   +-- lib.rs           # Public API and module declarations
|   +-- spsc_queue.rs    # Lock-free SPSC queue implementation
|   +-- protocol.rs      # Binary protocol and message types
|   +-- producer.rs      # Market data generator
|   +-- consumer.rs      # Consumer logic
|   +-- stats.rs         # Latency measurement and statistics
+-- tests/
|   +-- integration.rs   # End-to-end tests
|   +-- loom_tests.rs    # Loom concurrency tests
+-- benches/
|   +-- throughput.rs    # Criterion benchmarks
+-- examples/
    +-- demo.rs          # Full system demonstration

C++:

lock-free-market-data/
+-- CMakeLists.txt
+-- include/
|   +-- spsc_queue.hpp   # Lock-free queue header
|   +-- protocol.hpp     # Message types
|   +-- market_data.hpp  # Public API
+-- src/
|   +-- spsc_queue.cpp
|   +-- protocol.cpp
|   +-- producer.cpp
|   +-- consumer.cpp
+-- tests/
|   +-- queue_test.cpp
|   +-- stress_test.cpp
+-- benchmarks/
    +-- throughput_bench.cpp

5.3 The Core Question You’re Answering

“How do you safely share data between threads without locks while maintaining correctness and maximum performance?”

This question breaks down into several sub-questions:

  1. How do you ensure one thread sees another thread’s writes?
    • Answer: Memory ordering (Acquire/Release semantics)
  2. How do you prevent two threads from corrupting shared state?
    • Answer: Careful design so each variable is “owned” by one thread, with atomic operations for coordination
  3. How do you avoid hidden performance killers?
    • Answer: Cache line padding to prevent false sharing
  4. How do you know your code is correct?
    • Answer: Formal reasoning, testing tools (Loom, TSAN), stress testing

5.4 Concepts You Must Understand First

Before writing any code, verify you can answer these questions:

Memory Model:

  1. What is the difference between program order and execution order?
  2. What does “happens-before” mean in the context of concurrency?
  3. Why does x86 code often work even with incorrect memory ordering?

Atomics:

  1. What does “atomic” mean for a load or store operation?
  2. What is a torn read/write and when can it occur?
  3. What is the difference between compare_exchange and compare_exchange_weak?

Cache Architecture:

  1. What is a cache line and what is its typical size?
  2. How does the MESI protocol work?
  3. What triggers a cache line invalidation?

If you cannot answer these questions, read:

  • “Rust Atomics and Locks” by Mara Bos, Chapters 1-3
  • “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron, Chapter 6

5.5 Questions to Guide Your Design

Data Structure Design:

  1. How will you ensure head and tail are on separate cache lines?
  2. Why must the buffer capacity be a power of 2?
  3. How will you handle the case when indices wrap around usize::MAX?

Memory Ordering:

  1. Which operations need which memory orderings?
  2. Draw the synchronization diagram - which Release pairs with which Acquire?
  3. What would break if you used Relaxed everywhere?

Error Handling:

  1. What should push do when the queue is full?
  2. What should pop do when the queue is empty?
  3. How will you handle backpressure?

Testing:

  1. How will you test that memory ordering is correct?
  2. How will you stress test for race conditions?
  3. How will you verify performance claims?

5.6 Thinking Exercise

Before writing code, complete this mental exercise:

Exercise: Trace Through an Execution

Consider this interleaving:

Initial state: head=0, tail=0, buffer=[_, _, _, _]

Time    Producer                    Consumer
----    --------                    --------
T1      tail_local = 0
T2      head_snapshot = 0
T3      (0-0 != 4, not full)
T4      buffer[0] = "MSG1"
T5                                  head_local = 0
T6                                  tail_snapshot = 0
T7                                  (0 == 0, empty!)
T8                                  return None
T9      tail.store(1, Release)
T10                                 head_local = 0
T11                                 tail_snapshot = 1
T12                                 (0 != 1, not empty)
T13                                 value = buffer[0]
T14                                 head.store(1, Release)
T15                                 return Some("MSG1")

Questions:

  1. At T7, why does the consumer correctly see the queue as empty?
  2. At T13, why is the consumer guaranteed to see “MSG1”?
  3. What memory ordering ensures the answer to question 2?

Exercise: Find the Bug

Consider this incorrect implementation:

fn push(&self, value: T) -> Result<(), T> {
    let tail = self.tail.load(Ordering::Relaxed);
    let head = self.head.load(Ordering::Relaxed);  // BUG!

    if tail - head == N {
        return Err(value);
    }

    let index = tail & (N - 1);
    unsafe { (*self.buffer[index].get()).write(value); }

    self.tail.store(tail + 1, Ordering::Relaxed);  // BUG!
    Ok(())
}

Questions:

  1. Why is using Relaxed on head.load() incorrect?
  2. Why is using Relaxed on tail.store() incorrect?
  3. What could go wrong on ARM but not x86?

5.7 Hints in Layers

Use these hints progressively - only look at the next level when stuck.

Hint Level 1: Starting Point Begin with a struct containing head, tail, and buffer. Use AtomicUsize for head and tail. Use UnsafeCell<MaybeUninit<T>> for buffer slots to allow mutation through a shared reference.

Hint Level 2: Memory Layout For cache line padding, you can use crossbeam_utils::CachePadded<AtomicUsize> or create your own wrapper with #[repr(align(64))]. Ensure the struct layout puts head and tail in separate cache lines.

Hint Level 3: Memory Ordering

  • Producer reads tail with Relaxed (it owns tail)
  • Producer reads head with Acquire (synchronize with consumer)
  • Producer writes buffer with no explicit ordering (protected by tail)
  • Producer writes tail with Release (publish buffer write)
  • Consumer is symmetric: Relaxed for head, Acquire for tail, Release for head update

Hint Level 4: Testing and Verification

  • Use cargo miri test to check for undefined behavior
  • Use RUSTFLAGS="--cfg loom" cargo test for Loom tests
  • Create a stress test that runs producer and consumer on separate threads, pushing/popping millions of items, and verify the sum at the end

5.8 The Interview Questions They’ll Ask

Prepare to answer these questions about your implementation:

  1. “Walk me through the memory ordering in your push operation.”

    Expected answer: Discuss Relaxed for reading our own tail, Acquire for reading the consumer’s head (to see their progress), writing to buffer without explicit ordering (protected by the tail store), and Release for storing the updated tail (to publish our buffer write).

  2. “What happens if two producers call push simultaneously?”

    Expected answer: In SPSC, this is undefined behavior - the design assumes a single producer. For MPSC, you’d need CAS to atomically claim slots, handling the case where another producer wins the race.

  3. “How does cache line padding improve performance?”

    Expected answer: Without padding, head and tail may share a cache line. When the producer updates tail, it invalidates the consumer’s cached copy of head (and vice versa). With padding, each variable is on its own cache line, so updates don’t cause cross-core invalidations.

  4. “Why is your buffer capacity required to be a power of 2?”

    Expected answer: It allows using bitwise AND instead of modulo for index calculation. index = counter & (capacity - 1) is equivalent to index = counter % capacity when capacity is a power of 2, but AND is much faster (1 cycle vs 10-20+ cycles).

  5. “How would you modify this for MPMC?”

    Expected answer: Each producer needs to atomically claim a slot using CAS on tail. Each consumer needs to atomically claim an item using CAS on head. You’d also need sequence numbers per slot to handle the case where a producer wins a slot but hasn’t finished writing.

  6. “What is the ABA problem and does your SPSC queue have it?”

    Expected answer: ABA is when a CAS succeeds because a value changed A->B->A. It’s problematic in MPMC queues with memory reuse. SPSC doesn’t have ABA because indices only increment (they never return to previous values during the queue’s lifetime).

  7. “How would you prove your implementation is correct?”

    Expected answer: Use Loom to exhaustively test all possible interleavings. Use ThreadSanitizer to detect data races. Write a stress test that verifies every pushed item is popped exactly once. Test on ARM hardware (or QEMU) to catch ordering bugs that x86 hides.


5.9 Books That Will Help

Topic Book Chapter
Atomics in Rust “Rust Atomics and Locks” by Mara Bos Chapters 1-6
Memory Ordering Deep Dive “Rust Atomics and Locks” by Mara Bos Chapter 3
Cache Architecture “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron Chapter 6
C++ Atomics “C++ Concurrency in Action” by Anthony Williams Chapters 5-7
Lock-Free Algorithms “The Art of Multiprocessor Programming” by Herlihy & Shavit Chapters 10-15
HFT Systems “Building Low Latency Applications with C++” by Sourav Ghosh Chapters 4-6

5.10 Implementation Phases

Phase 1: Mutex-Based Baseline (Days 1-3)

Goal: Create a working queue with a mutex to establish a performance baseline.

Tasks:

  1. Define the message structure (64 bytes, cache-line-aligned)
  2. Implement a simple ring buffer queue with Mutex<VecDeque<T>>
  3. Write a producer thread that generates messages
  4. Write a consumer thread that receives and processes messages
  5. Add latency measurement (capture timestamp at push, measure at pop)
  6. Benchmark: measure throughput and latency percentiles

Checkpoint: You should see throughput around 100K-500K msg/sec with p99 latency in the microsecond range. This is your baseline to beat.

Phase 2: Basic Lock-Free SPSC (Days 4-9)

Goal: Replace the mutex with a lock-free SPSC queue.

Tasks:

  1. Implement the SPSC queue struct with atomic head/tail
  2. Implement push() with correct memory ordering
  3. Implement pop() with correct memory ordering
  4. Write unit tests for basic operations (push, pop, empty, full)
  5. Write a concurrent test that runs producer and consumer in parallel
  6. Verify correctness with Loom (Rust) or TSAN (C++)

Checkpoint: All tests pass, including Loom tests. You should see 5-10x improvement over mutex baseline.

Phase 3: Cache-Optimized Version (Days 10-14)

Goal: Eliminate false sharing and optimize for maximum performance.

Tasks:

  1. Add cache line padding to separate head and tail
  2. Benchmark with and without padding - quantify the improvement
  3. Profile cache behavior using perf stat (Linux) or Instruments (macOS)
  4. Experiment with prefetching (optional)
  5. Add the full market data handler: generator, parser, multiple consumers
  6. Implement comprehensive latency measurement (HDR Histogram)
  7. Compare against crossbeam-channel (Rust) or Boost.Lockfree (C++)
  8. Document your design decisions and benchmark results

Checkpoint: Throughput of 1M+ msg/sec, p50 latency under 100ns, beating or matching established libraries.


5.11 Key Implementation Decisions

Decision 1: Bounded vs Unbounded

Choose bounded (fixed capacity). Unbounded queues require dynamic allocation, which introduces latency spikes. In HFT, predictable performance is more important than convenience.

Decision 2: Power-of-2 Capacity

Always use power-of-2 capacity. The bitwise AND trick for modulo is not just an optimization - it’s essential for correct wrap-around behavior with wrapping arithmetic.

Decision 3: MaybeUninit for Buffer

Use MaybeUninit<T> (Rust) or uninitialized storage (C++). The buffer slots are not always valid - they contain garbage before first write and after consumption. Using a type that doesn’t require initialization avoids undefined behavior.

Decision 4: Wrapping Arithmetic

Use wrapping addition for head and tail. This avoids the need to ever “reset” the indices, which would require additional synchronization. The queue length is always tail.wrapping_sub(head).


6. Testing Strategy

6.1 Unit Tests

Test each operation in isolation:

+------------------------------------------------------------------------+
|                         UNIT TEST CASES                                |
+------------------------------------------------------------------------+
|                                                                        |
|  Test: single_push_pop                                                 |
|    - Push one item, pop one item, verify value                         |
|    - Push another, pop another, verify                                 |
|                                                                        |
|  Test: empty_queue_returns_none                                        |
|    - Fresh queue, pop returns None                                     |
|    - After pushing and popping, pop returns None                       |
|                                                                        |
|  Test: full_queue_returns_err                                          |
|    - Push until capacity, next push returns Err                        |
|    - Pop one, push succeeds                                            |
|                                                                        |
|  Test: fifo_ordering                                                   |
|    - Push 1, 2, 3, 4, 5                                                |
|    - Pop should return 1, 2, 3, 4, 5 in order                          |
|                                                                        |
|  Test: wrap_around                                                     |
|    - Fill and empty queue multiple times                               |
|    - Verify no corruption at wrap-around boundaries                    |
|                                                                        |
|  Test: len_and_capacity                                                |
|    - Verify len() changes correctly with push/pop                      |
|    - Verify capacity() returns correct value                           |
|                                                                        |
+------------------------------------------------------------------------+

6.2 Stress Tests

Run millions of operations across threads:

+------------------------------------------------------------------------+
|                         STRESS TEST DESIGN                             |
+------------------------------------------------------------------------+
|                                                                        |
|  Test: high_throughput                                                 |
|    - Producer pushes 10M messages as fast as possible                  |
|    - Consumer pops all messages                                        |
|    - Verify all messages received (sum of values matches)              |
|    - Measure time, calculate throughput                                |
|                                                                        |
|  Test: sustained_load                                                  |
|    - Run for 60 seconds at target rate (1M msg/sec)                    |
|    - Monitor for any dropped messages                                  |
|    - Check latency percentiles remain stable                           |
|                                                                        |
|  Test: backpressure                                                    |
|    - Producer faster than consumer                                     |
|    - Verify queue full condition handled correctly                     |
|    - Consumer catches up, verify no data loss                          |
|                                                                        |
|  Test: jitter                                                          |
|    - Randomly vary producer/consumer speed                             |
|    - Introduce random sleeps                                           |
|    - Verify correctness under variable conditions                      |
|                                                                        |
+------------------------------------------------------------------------+

6.3 Concurrency Verification

For Rust - Using Loom:

Loom exhaustively tests all possible thread interleavings:

+------------------------------------------------------------------------+
|                         LOOM TEST STRATEGY                             |
+------------------------------------------------------------------------+
|                                                                        |
|  How Loom works:                                                       |
|                                                                        |
|  1. You write a test that spawns threads                               |
|  2. Loom replaces std atomics with its own instrumented versions       |
|  3. Loom explores ALL possible interleavings                           |
|  4. If any interleaving causes a bug, Loom reports it                  |
|                                                                        |
|  Example test:                                                         |
|                                                                        |
|  loom::model(|| {                                                      |
|      let queue = Arc::new(SpscQueue::<i32, 4>::new());                 |
|      let q1 = queue.clone();                                           |
|      let q2 = queue.clone();                                           |
|                                                                        |
|      let producer = loom::thread::spawn(move || {                      |
|          q1.push(42).unwrap();                                         |
|      });                                                               |
|                                                                        |
|      let consumer = loom::thread::spawn(move || {                      |
|          loop {                                                        |
|              if let Some(v) = q2.pop() {                               |
|                  assert_eq!(v, 42);                                    |
|                  break;                                                |
|              }                                                         |
|          }                                                             |
|      });                                                               |
|                                                                        |
|      producer.join().unwrap();                                         |
|      consumer.join().unwrap();                                         |
|  });                                                                   |
|                                                                        |
|  Running this explores hundreds of interleavings automatically.        |
|                                                                        |
+------------------------------------------------------------------------+

For C++ - Using ThreadSanitizer (TSAN):

+------------------------------------------------------------------------+
|                         TSAN USAGE                                     |
+------------------------------------------------------------------------+
|                                                                        |
|  Compile with: -fsanitize=thread                                       |
|  Link with: -fsanitize=thread                                          |
|                                                                        |
|  Run your stress tests. TSAN will report:                              |
|  - Data races (concurrent non-atomic accesses)                         |
|  - Lock order violations                                               |
|  - Use of uninitialized synchronization primitives                     |
|                                                                        |
|  Example TSAN output for a bug:                                        |
|                                                                        |
|  WARNING: ThreadSanitizer: data race (pid=12345)                       |
|    Write of size 8 at 0x7f1234567890 by thread T1:                     |
|      #0 push src/queue.cpp:45                                          |
|    Previous read of size 8 at 0x7f1234567890 by thread T2:             |
|      #0 pop src/queue.cpp:67                                           |
|                                                                        |
|  If TSAN reports no issues after extensive testing, you have high      |
|  confidence (but not proof) of correctness.                            |
|                                                                        |
+------------------------------------------------------------------------+

7. Common Pitfalls & Debugging

7.1 Memory Ordering Bugs

Symptom: Works on x86, fails mysteriously on ARM (or works 99.99% of the time).

Problem: Using Relaxed where Acquire/Release is needed.

Example Bug:

// WRONG: Consumer might not see producer's buffer write
fn pop(&self) -> Option<T> {
    let head = self.head.load(Ordering::Relaxed);
    let tail = self.tail.load(Ordering::Relaxed);  // Should be Acquire!
    // ...
}

Diagnosis:

  • Run on ARM hardware (Raspberry Pi, Apple Silicon, AWS Graviton)
  • Use Loom for Rust
  • Carefully trace happens-before relationships on paper

Fix: Ensure every Release has a matching Acquire on the same variable.


7.2 False Sharing

Symptom: Multi-threaded performance is worse than expected, sometimes worse than single-threaded.

Problem: Head and tail on same cache line.

Diagnosis:

  • Use perf stat -e cache-misses (Linux)
  • Check sizeof your queue struct - head and tail should be 64+ bytes apart
  • Benchmark with and without padding

Fix: Use #[repr(align(64))] or CachePadded<T>.


7.3 Torn Reads/Writes

Symptom: Corrupted data, impossible values.

Problem: Reading/writing a multi-byte value that isn’t naturally atomic.

Example: A 64-bit read on a 32-bit system without atomics might read half of an old value and half of a new value.

Fix: Use atomic types for any shared data. In Rust, AtomicUsize is always atomic. For larger types, ensure proper alignment.


7.4 ABA Problem (MPMC Only)

Symptom: Data corruption when memory is reused.

Problem: CAS succeeds because value changed A -> B -> A.

Does this affect SPSC?: No. In SPSC, head and tail only increment. They never return to previous values (within the queue’s practical lifetime).

For MPMC: Use tagged pointers (combine pointer with counter) or epoch-based reclamation.


7.5 Integer Overflow in Index Calculation

Symptom: Crash or incorrect behavior after billions of operations.

Problem: tail % capacity breaks if tail overflows.

Fix: Use wrapping arithmetic and bitwise AND:

let index = tail.wrapping_add(1);  // Instead of tail + 1
let slot = tail & (capacity - 1);  // Instead of tail % capacity

This works because we only care about the difference tail - head, which wrapping subtraction handles correctly.


7.6 Uninitialized Memory

Symptom: Garbage data, crashes, undefined behavior sanitizer errors.

Problem: Reading from a buffer slot before it’s been written.

Fix: Use MaybeUninit<T> and only read after verifying tail > head (meaning the slot has been written).


7.7 Debugging Techniques

1. Printf Debugging (Careful!)

  • Adding prints can hide race conditions by changing timing
  • Use atomic counters instead for counting events
  • Write to a thread-local buffer, print after test completes

2. Deterministic Testing

  • Use Loom for exhaustive interleaving testing
  • Reduce queue size to minimum for testing (capacity = 2)
  • Use simple values (integers) before complex types

3. Sanitizers

  • TSAN for race detection
  • ASAN for memory errors
  • MSAN for uninitialized reads
  • Miri for Rust undefined behavior

8. Extensions & Challenges

8.1 Basic Extensions

Extension 1: Batch Operations Implement push_batch() and pop_batch() that transfer multiple items atomically. This can improve throughput by amortizing the cost of atomic operations.

Extension 2: Blocking Wait Add push_blocking() and pop_blocking() that use OS primitives (futex, condition variable) to sleep when the queue is full/empty. Measure the latency tradeoff.

Extension 3: Statistics Collection Add internal counters for: push successes, push failures (full), pop successes, pop failures (empty). Make these atomic and cache-padded.

8.2 Advanced Extensions

Extension 4: MPSC Queue Extend to Multi-Producer Single-Consumer. Use CAS for producers to claim slots. Measure the performance difference from SPSC.

Extension 5: Memory-Mapped Buffer Use mmap to create a buffer backed by a file. This enables crash recovery - uncommitted messages survive process restarts.

Extension 6: NUMA-Aware Allocation On multi-socket systems, allocate the buffer on the NUMA node where the consumer runs to minimize memory access latency.

8.3 Challenge Problems

Challenge 1: Wait-Free Pop Modify pop to be wait-free (every call completes in bounded steps). This is harder than lock-free because you can’t retry infinitely.

Challenge 2: Seqlock Alternative Implement a seqlock-protected shared data structure as an alternative to a queue. Compare performance for read-heavy workloads.

Challenge 3: Variable-Size Messages Modify the queue to support variable-length messages. Consider: length prefix, separate length queue, or log-structured design.


9. Real-World Connections

9.1 Where Lock-Free Queues Are Used

Trading Systems: Every HFT firm uses lock-free queues for market data distribution. The queue you’re building is a simplified version of what runs in production at Jane Street, Two Sigma, and Citadel.

Game Engines: Unity and Unreal use lock-free structures in their job systems. Frame deadlines are absolute - you can’t afford to wait on a lock.

Operating Systems: The Linux kernel uses lock-free RCU (Read-Copy-Update) for many kernel data structures. Windows uses interlocked singly-linked lists.

Databases: ScyllaDB, Cassandra, and RocksDB use lock-free structures for write-ahead logging and concurrent access.

Async Runtimes: Tokio’s work-stealing scheduler uses lock-free queues to pass tasks between worker threads.

9.2 Production-Quality Libraries

Study these after completing your implementation:

Rust:

  • crossbeam-channel: High-performance MPMC channels
  • flume: Alternative channel implementation
  • rtrb: Realtime-safe ring buffer

C++:

  • Boost.Lockfree: Comprehensive lock-free data structures
  • liblfds: C library with many lock-free structures
  • LMAX Disruptor: The famous low-latency queue from the trading world

9.3 Papers to Read

  • “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms” by Michael and Scott (1996)
  • “Wait-Free Synchronization” by Herlihy (1991)
  • “A Scalable, Correct Time-Stamped Stack” by Dodds et al. (2015)

10. Resources

10.1 Books

Book Author Relevance
“Rust Atomics and Locks” Mara Bos Essential for Rust, excellent for concepts
“C++ Concurrency in Action” Anthony Williams C++ atomics and concurrent programming
“The Art of Multiprocessor Programming” Herlihy & Shavit Theoretical foundations
“Computer Systems: A Programmer’s Perspective” Bryant & O’Hallaron Cache architecture (Ch. 6)
“Building Low Latency Applications with C++” Sourav Ghosh HFT-specific techniques

10.2 Online Resources

Memory Ordering:

  • Rust Nomicon: “Atomics” section
  • Preshing on Programming: “Memory Barriers Are Like Source Control Operations”
  • cppreference.com: std::memory_order

Lock-Free Programming:

  • rigtorp/awesome-lockfree (GitHub): Curated resource list
  • “Fear and Loathing in Lock-Free Programming” (Medium article)
  • CppCon talks by Herb Sutter, Fedor Pikus

Tools:

  • Loom documentation (Tokio project)
  • ThreadSanitizer documentation (Google)
  • Miri documentation (Rust)

10.3 Reference Implementations

  • crossbeam-queue::ArrayQueue (Rust)
  • boost::lockfree::spsc_queue (C++)
  • Linux kernel’s kfifo (C)

11. Self-Assessment Checklist

Use this checklist to verify your understanding:

Memory Ordering:

  • I can explain the difference between Relaxed, Acquire, Release, and SeqCst
  • I can draw a happens-before diagram for my push/pop operations
  • I understand why my code would fail on ARM if I used wrong orderings
  • I can explain what “synchronizes-with” means

Implementation:

  • My queue uses cache line padding for head and tail
  • My buffer capacity is a power of 2
  • I use wrapping arithmetic for indices
  • I use MaybeUninit (or equivalent) for buffer slots
  • Push returns Err when full, pop returns None when empty

Testing:

  • Unit tests pass for basic operations
  • Loom tests pass (Rust) or TSAN reports no issues (C++)
  • Stress test verifies all messages received correctly
  • I’ve tested on ARM or in emulation

Performance:

  • I can achieve 1M+ messages per second
  • p50 latency is under 100ns for queue operations
  • I’ve measured the improvement from cache padding
  • I’ve compared against my mutex baseline

Understanding:

  • I can explain every memory ordering choice in my code
  • I can answer the interview questions in section 5.8
  • I understand why SPSC doesn’t have the ABA problem
  • I can explain when to use lock-free vs mutex-based queues

12. Submission / Completion Criteria

Your project is complete when:

Functionality:

  • Market data generator produces realistic quote data
  • Binary protocol encodes/decodes messages correctly
  • SPSC queue handles push/pop without data loss
  • Multiple consumers receive data independently
  • System runs stably for 60+ seconds under load

Performance:

  • Throughput exceeds 1M messages per second
  • Queue p50 latency under 100ns
  • Queue p99 latency under 500ns
  • At least 5x improvement over mutex baseline

Quality:

  • All tests pass (unit, integration, stress)
  • No data races detected by TSAN/Loom
  • Code compiles without warnings
  • README documents design decisions and benchmark results

Documentation:

  • Explanation of memory ordering choices
  • Benchmark results with methodology
  • Comparison against reference implementations
  • Known limitations and future improvements

Summary

You have now completed one of the most challenging and rewarding projects in systems programming: building a lock-free market data handler from scratch.

Along the way, you have mastered:

  1. Lock-free programming fundamentals - understanding why locks are problematic and how atomics provide an alternative
  2. Memory ordering semantics - the critical knowledge that separates correct concurrent code from code that “happens to work”
  3. Cache architecture and false sharing - hardware knowledge that can 10x your performance
  4. SPSC queue design - a fundamental building block used in countless real-world systems
  5. Binary protocol design - efficient data representation for high-performance systems
  6. Performance measurement - latency percentiles, throughput metrics, and meaningful benchmarking

This knowledge directly translates to:

  • HFT and trading systems
  • Game engine development
  • Operating system kernels
  • Database internals
  • Any system where latency matters

The queue you built processes over a million messages per second with sub-100 nanosecond latency. This is production-quality performance.

More importantly, you now understand why it’s fast. You can reason about memory ordering, diagnose false sharing, and design lock-free algorithms. These skills set you apart as a systems programmer.

Welcome to the world of lock-free programming. Go build something fast.


Estimated completion time: 2-3 weeks Lines of code: 500-1000 (depending on testing thoroughness) Next project: P03 - Simple Matching Engine (combines order book + lock-free queues)