Project 17: High-Frequency Trading Simulator (Capstone)

Quick Reference

Attribute	Value
Project Number	17 of 17 (Capstone)
Category	C++ Concurrency Mastery
Difficulty	Level 5: Master
Time Estimate	8-12 weeks
Main Programming Language	C++
Alternative Languages	C, Rust (no GC languages!)
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	5. The “Industry Disruptor”
Knowledge Area	All Concurrency / Low Latency / Real-time
Primary Tool	GCC/Clang with C++20/23, perf, Valgrind
Main Books	“C++ Concurrency in Action” by Anthony Williams, “Trading and Exchanges” by Larry Harris

Summary: Build a simulated high-frequency trading system with market data feed handlers (lock-free), order matching engine (SIMD), strategy execution (coroutines), and risk management - all with microsecond-level latency. This is the ultimate integration test of everything you have learned.

Learning Objectives

By completing this project, you will be able to:

Architect a low-latency system where every microsecond matters and design decisions directly impact performance
Integrate lock-free data structures for market data feeds handling millions of messages per second
Apply SIMD operations for parallel price comparisons and order matching
Design trading strategies as coroutines with efficient async workflows using co_await
Implement real-time risk management with atomic operations and lock-free position tracking
Master low-latency patterns: object pools, core pinning, NUMA awareness, branch elimination
Measure and optimize using P50/P99/P99.9 latency metrics with sub-microsecond precision
Answer interview questions about trading systems, low-latency design, and concurrent architecture

Theoretical Foundation

The Core Question You’re Answering

“How do you build a system where every nanosecond matters, combining lock-free data structures, SIMD operations, coroutines, and parallel algorithms into a cohesive architecture that processes millions of events per second with microsecond-level latency?”

This project forces you to answer:

How do lock-free structures eliminate mutex overhead on the critical path?
Why does SIMD enable matching 8 prices in the time it takes to compare 1?
How do coroutines model trading strategies more naturally than callbacks?
What architectural patterns eliminate memory allocations, system calls, and cache misses?
How do you measure latency at P99.9 and identify tail latency sources?

Most developers have never built anything this performance-sensitive. After this project, you will understand what it takes to compete at the microsecond level and could interview for roles at trading firms, game engines, or embedded systems.

Why This Matters

                    LATENCY MATTERS IN HFT

Time Scale          |  What Happens
--------------------+--------------------------------------------------
1 second            |  Human reaction time. An eternity in HFT.
100ms               |  Network round-trip across continent.
10ms                |  Database query. OS scheduler quantum.
1ms (1000 us)       |  Thread context switch. Mutex contention.
100 us              |  Memory allocation. Garbage collection pause.
10 us               |  L3 cache miss. Your target quote-to-trade.
1 us (1000 ns)      |  L2 cache miss. Lock-free operation.
100 ns              |  L1 cache miss. Branch misprediction.
10 ns               |  Memory access (cached). SIMD operation.
1 ns                |  CPU cycle at 1 GHz. Register access.

In HFT, being 10us faster can mean winning vs losing every trade.

Historical Context: The “Flash Crash” of May 6, 2010 saw the Dow drop 1000 points in minutes, driven partly by HFT algorithms. Modern markets execute millions of orders per second, with firms competing to shave microseconds off latency. Understanding these systems is essential for:

Trading firm engineering (Jane Street, Two Sigma, Citadel)
Game engine development (tick rates, physics updates)
Embedded systems (automotive, aerospace)
Real-time audio/video processing
Any system where latency is business-critical

Common Misconceptions

“Fast code is just about algorithms” - Wrong. Memory layout, cache behavior, branch prediction, and system architecture matter more than big-O complexity at this level.
“Lock-free means faster” - Not always. Lock-free has overhead (CAS loops, memory barriers). It wins when contention is high or holding locks is dangerous, not universally.
“C++ is automatically fast” - C++ gives you control, not speed. You can write slow C++ easily. Speed comes from understanding what the compiler generates and what the hardware does.
“Latency and throughput are the same” - They are often opposed. Batching improves throughput but hurts latency. HFT cares about latency at P99.9, even at throughput cost.
“SIMD is just for math” - SIMD excels at data-parallel conditionals too. Comparing 8 prices at once is as useful as adding 8 numbers.

Concepts You Must Understand First

Stop and research these before coding:

1. Market Microstructure

How do financial markets actually work?

Concept	Description	Why It Matters
Order Book	Sorted list of buy/sell orders by price	Central data structure you’ll build
Bid/Ask Spread	Gap between best buy and sell prices	Profit opportunity for market makers
Market Order	Execute immediately at best available price	Creates latency pressure
Limit Order	Execute only at specified price or better	Rests in order book
Matching Engine	Core component that matches buyers to sellers	Your SIMD optimization target

              ORDER BOOK VISUALIZATION

         BID SIDE                    ASK SIDE
    (buyers want to buy)         (sellers want to sell)

    Price   Quantity              Price   Quantity
    -----   --------              -----   --------
    $99.97     1,000   <-- best   $100.03    500   <-- best ask
    $99.95     2,500              $100.05  1,200
    $99.90     5,000              $100.10  3,000
    $99.85     3,200              $100.15  2,800
                                           ^^^^
                 SPREAD = $0.06            |
                 (100.03 - 99.97)          |
                                           |
    When a BUY market order arrives -------+
    it matches against best ask

Book Reference: “Trading and Exchanges” by Larry Harris - Ch. 1-4

2. Lock-Free SPSC Queue

The market data path must be lock-free. A Single-Producer Single-Consumer (SPSC) queue is the simplest lock-free structure.

// SPSC Queue: One thread writes, one reads
template<typename T, size_t Size>
class SPSCQueue {
    alignas(64) std::array<T, Size> buffer_;
    alignas(64) std::atomic<size_t> head_{0};  // Writer increments
    alignas(64) std::atomic<size_t> tail_{0};  // Reader increments
    // Note: head_ and tail_ on separate cache lines (64 bytes apart)
    // This prevents false sharing between producer and consumer

public:
    bool push(const T& item) {
        size_t h = head_.load(std::memory_order_relaxed);
        size_t next = (h + 1) % Size;
        if (next == tail_.load(std::memory_order_acquire)) {
            return false;  // Queue full
        }
        buffer_[h] = item;
        head_.store(next, std::memory_order_release);
        return true;
    }

    std::optional<T> pop() {
        size_t t = tail_.load(std::memory_order_relaxed);
        if (t == head_.load(std::memory_order_acquire)) {
            return std::nullopt;  // Queue empty
        }
        T item = buffer_[t];
        tail_.store((t + 1) % Size, std::memory_order_release);
        return item;
    }
};

Key Points:

memory_order_acquire on load synchronizes with memory_order_release on store
Cache line alignment prevents false sharing (producer and consumer don’t contend)
No locks, no CAS loops (because single producer/consumer)

Book Reference: “C++ Concurrency in Action” Chapter 7 - Anthony Williams

3. SIMD Price Comparison

Instead of comparing prices one at a time, compare 8 at once:

#include <experimental/simd>
namespace stdx = std::experimental;

// Traditional: O(n) comparisons, one at a time
for (size_t i = 0; i < orders.size(); ++i) {
    if ((side == BUY && orders[i].price >= target_price) ||
        (side == SELL && orders[i].price <= target_price)) {
        matches.push_back(i);
    }
}

// SIMD: O(n/8) iterations, 8 comparisons per iteration
using simd_t = stdx::fixed_size_simd<int64_t, 8>;
simd_t target = target_price;  // Broadcast to all lanes

for (size_t i = 0; i < orders.size(); i += 8) {
    simd_t prices(&order_prices[i], stdx::element_aligned);
    auto mask = (side == BUY) ? (prices >= target) : (prices <= target);

    if (stdx::any_of(mask)) {
        // Extract matching indices using mask
        for (size_t j = 0; j < 8; ++j) {
            if (mask[j]) matches.push_back(i + j);
        }
    }
}

Book Reference: Projects 12-14 in this series

4. Coroutines for Strategy Execution

Trading strategies are naturally expressed as async workflows:

// Strategy as a coroutine - reads like sequential code
// but executes asynchronously
Task<void> momentum_strategy(MarketDataFeed& feed, OrderRouter& router) {
    while (true) {
        // Wait for next price update
        auto quote = co_await feed.next_quote();

        // Calculate trading signal
        double signal = co_await calculate_momentum(quote);

        if (signal > buy_threshold) {
            // Submit order and wait for fill
            auto order = Order{BUY, quote.symbol, 100, quote.ask};
            auto result = co_await router.submit(order);

            if (result.filled) {
                // Hold position
                co_await sleep_for(1000ms);
                // Close position
                co_await router.submit(Order{SELL, quote.symbol, 100});
            }
        }

        // Yield to other strategies
        co_await next_tick();
    }
}

Why coroutines over callbacks?

Sequential code is easier to reason about
State is preserved in coroutine frame (no explicit state machines)
Cooperative multitasking without OS threads
Can run thousands of strategies on few threads

Book Reference: Projects 9-11 in this series

5. Low-Latency Patterns

These patterns eliminate the sources of latency:

Pattern	Problem Solved	Implementation
Object Pools	`new`/`delete` take microseconds	Pre-allocate arrays, use free lists
Core Pinning	OS migrates threads, trashing caches	`pthread_setaffinity_np`
NUMA Awareness	Cross-socket memory access is 100+ ns	Allocate on local NUMA node
Branch Elimination	Mispredictions cost 15+ cycles	Use SIMD masks, branchless min/max
Cache Prefetching	L3 miss is 40+ ns	`__builtin_prefetch`
Kernel Bypass	System calls cost microseconds	DPDK for networking (optional)

                LATENCY SOURCES TO ELIMINATE

     +-----------------+         +-----------------+
     | Memory          |         | System          |
     | Allocation      | <------ | malloc/new      |
     | ~1-10 us        |         | heap management |
     +-----------------+         +-----------------+

     +-----------------+         +-----------------+
     | Cache           |         | Thread          |
     | Miss (L3)       | <------ | Migration       |
     | ~40 ns          |         | by OS scheduler |
     +-----------------+         +-----------------+

     +-----------------+         +-----------------+
     | Lock            |         | Mutex           |
     | Contention      | <------ | contention      |
     | ~1-100 us       |         | between threads |
     +-----------------+         +-----------------+

     +-----------------+         +-----------------+
     | Branch          |         | Unpredictable   |
     | Misprediction   | <------ | if/else in      |
     | ~15 cycles      |         | hot paths       |
     +-----------------+         +-----------------+

Project Specification

What You’ll Build

A complete simulated HFT system with these components:

Market Data Handler: Lock-free SPSC queue receiving simulated market data at 5M+ messages/second
Order Book Manager: Per-instrument order books with SIMD-accelerated matching
Strategy Engine: Multiple trading strategies implemented as coroutines
Risk Manager: Real-time position and exposure tracking with atomic operations
Order Router: Submits orders to simulated exchange
Performance Monitor: Measures latency at P50/P99/P99.9

Performance Targets

Metric	Target	Description
Quote-to-Trade P50	< 5 us	Median latency from quote arrival to order submission
Quote-to-Trade P99	< 10 us	99th percentile latency
Quote-to-Trade P99.9	< 50 us	99.9th percentile (1 in 1000 events)
Throughput	> 1M quotes/sec	Market data processing capacity
Orders/Second	> 50,000	Order submission rate
Cache Miss Rate	< 1%	L3 cache misses in hot path
Memory Allocations	0	Zero allocations in critical path

Input/Output Specification

Configuration:

# trading.yaml
market_data:
  source: "replay"                  # or "simulated", "network"
  file: "nasdaq_20240115.pcap"
  replay_speed: 10                  # 10x real-time

instruments:
  count: 8000                       # Number of stocks
  symbols: ["AAPL", "GOOGL", ...]  # or "all"

strategies:
  - name: "momentum"
    enabled: true
    params:
      lookback_period: 100
      threshold: 0.02
  - name: "mean_reversion"
    enabled: true
    params:
      window: 50
      z_score_threshold: 2.0
  - name: "market_making"
    enabled: true
    params:
      spread: 0.001
      position_limit: 1000

threads:
  market_data: [0, 1]               # Pinned to cores 0-1
  strategy: [2, 3, 4, 5]            # Pinned to cores 2-5
  risk: [6]                         # Pinned to core 6
  order_management: [7]             # Pinned to core 7

risk:
  max_position_per_symbol: 10000
  max_total_exposure: 10000000      # $10M
  max_loss_per_day: 100000          # $100k

Expected Output:

$ ./hft_simulator --config trading.yaml

=== High-Frequency Trading Simulator ===

Loading market data replay: nasdaq_20240115.pcap
Instruments: 8,000 stocks
Strategies: 5 (momentum, mean-reversion, arbitrage, market-making, statistical)

System Configuration:
  Market data threads: 2 (pinned to cores 0-1)
  Strategy threads: 4 (pinned to cores 2-5)
  Risk thread: 1 (pinned to core 6)
  Order management: 1 (pinned to core 7)

Data structures:
  Market data queue: Lock-free SPSC, 1M slots
  Order book: Lock-free per-instrument, SIMD matching
  Position cache: Atomic counters, no locks

Pre-allocation complete:
  Order pool: 100,000 orders pre-allocated
  Quote pool: 1,000,000 quotes pre-allocated
  Message pool: 500,000 messages pre-allocated

Starting replay at 10x speed...

[09:30:00.000] Market open
[09:30:00.001] Received 50,000 quotes
[09:30:00.002] Strategy signals: 23 buy, 17 sell
[09:30:00.003] Orders submitted: 40
[09:30:00.004] Orders filled: 38, rejected: 2 (risk limit)

Performance Metrics (1 second window):
  Market data messages: 5.2M
  Quote-to-trade latency:
    P50: 2.3 us
    P99: 8.1 us
    P99.9: 24 us
  Orders per second: 45,000
  Strategy CPU utilization: 78%
  Cache miss rate: 0.3%

[10:30:00.000] Replay complete

Summary:
  Total PnL: $127,432 (simulated)
  Trades: 2.1M
  Win rate: 51.3%
  Sharpe ratio: 3.2
  Max drawdown: -$8,923

Latency Histogram:
  < 1us:  ████████████░░░░░░░░ 35%
  1-5us:  ██████████████████░░ 52%
  5-10us: ███░░░░░░░░░░░░░░░░░ 10%
  10-50us:█░░░░░░░░░░░░░░░░░░░  2%
  > 50us: ░░░░░░░░░░░░░░░░░░░░  1%

Solution Architecture

System Overview

+-----------------------------------------------------------------------------+
|                     HIGH-FREQUENCY TRADING SIMULATOR                         |
+-----------------------------------------------------------------------------+
|                                                                              |
|  +------------------+                                                        |
|  | Market Data      |    Lock-free                                          |
|  | Feed Handler     |------+                                                |
|  | (Core 0-1)       |      |                                                |
|  +------------------+      |                                                |
|                            v                                                |
|                    +---------------+                                        |
|                    | SPSC Queue    |                                        |
|                    | (1M slots)    |                                        |
|                    +-------+-------+                                        |
|                            |                                                |
|                            v                                                |
|  +-----------------------------------------------------------+             |
|  |                   ORDER BOOK MANAGER                       |             |
|  |                                                            |             |
|  |   Symbol    Order Book (SIMD-optimized)                   |             |
|  |   ------    -----------------------------------            |             |
|  |   AAPL  --> [Bid: 185.50(100), 185.49(500), ...]          |             |
|  |             [Ask: 185.52(200), 185.53(300), ...]          |             |
|  |   GOOGL --> [Bid: 142.10(50), 142.09(200), ...]           |             |
|  |             [Ask: 142.12(100), 142.15(400), ...]          |             |
|  |   ... (8000 instruments)                                   |             |
|  +----------------------------+------------------------------+             |
|                               |                                             |
|              +----------------+----------------+                            |
|              |                                 |                            |
|              v                                 v                            |
|  +-------------------+              +-------------------+                   |
|  | Price Updates     |              | Trade Signals     |                   |
|  +--------+----------+              +--------+----------+                   |
|           |                                  |                              |
|           v                                  v                              |
|  +------------------------------------------------+                        |
|  |              STRATEGY ENGINE (Cores 2-5)        |                        |
|  |                                                 |                        |
|  |  +-------------+  +---------------+  +-------+  |                        |
|  |  | Momentum    |  | Mean-Reversion|  | Arb   |  |                        |
|  |  | (coroutine) |  | (coroutine)   |  | (co)  |  |                        |
|  |  +------+------+  +-------+-------+  +---+---+  |                        |
|  |         |                 |              |      |                        |
|  +---------+-----------------+--------------+------+                        |
|            |                 |              |                               |
|            v                 v              v                               |
|  +---------------------------------------------------+                      |
|  |                    ORDER QUEUE                     |                      |
|  +----------------------------+----------------------+                      |
|                               |                                             |
|                               v                                             |
|  +---------------------------------------------------+                      |
|  |               RISK MANAGER (Core 6)                |                      |
|  |                                                    |                      |
|  |   Position Limits    Exposure Checks    P&L Track  |                      |
|  |   (atomic counters)  (lock-free)       (real-time) |                      |
|  +----------------------------+-----------------------+                      |
|                               |                                             |
|                       PASS or REJECT                                        |
|                               |                                             |
|                               v                                             |
|  +---------------------------------------------------+                      |
|  |             ORDER ROUTER (Core 7)                  |                      |
|  +----------------------------+-----------------------+                      |
|                               |                                             |
|                               v                                             |
|  +---------------------------------------------------+                      |
|  |            SIMULATED EXCHANGE                      |                      |
|  |                                                    |                      |
|  |   Matching Engine    Fill Reports    Rejections    |                      |
|  +---------------------------------------------------+                      |
|                                                                              |
+-----------------------------------------------------------------------------+

Key Components

1. MarketDataFeed

// Market data feed handler - processes raw network data
class MarketDataFeed {
private:
    SPSCQueue<Quote, 1'000'000> queue_;  // Lock-free, 1M slots
    std::atomic<bool> running_{true};

    // Object pool for quotes (zero allocation)
    ObjectPool<Quote, 1'000'000> quote_pool_;

public:
    // Producer thread (network/replay)
    void run_producer(DataSource& source) {
        // Pin to cores 0-1
        set_thread_affinity({0, 1});

        while (running_.load(std::memory_order_relaxed)) {
            auto raw = source.read();
            if (!raw) continue;

            // Get quote from pool (no allocation)
            Quote* quote = quote_pool_.acquire();
            parse_into(raw, *quote);

            // Push to queue (lock-free)
            while (!queue_.push(*quote)) {
                // Queue full - apply backpressure
                _mm_pause();  // CPU-friendly spin
            }
        }
    }

    // Consumer interface (strategy threads)
    std::optional<Quote> next_quote() {
        return queue_.pop();
    }
};

2. OrderBook

// Per-instrument order book with SIMD matching
class OrderBook {
private:
    // Structure of arrays for SIMD access
    alignas(64) std::array<int64_t, MAX_LEVELS> bid_prices_;
    alignas(64) std::array<int64_t, MAX_LEVELS> bid_quantities_;
    alignas(64) std::array<int64_t, MAX_LEVELS> ask_prices_;
    alignas(64) std::array<int64_t, MAX_LEVELS> ask_quantities_;

    std::atomic<size_t> bid_count_{0};
    std::atomic<size_t> ask_count_{0};

public:
    // SIMD-accelerated order matching
    std::vector<Match> match_order(const Order& order) {
        using simd_t = stdx::fixed_size_simd<int64_t, 8>;

        std::vector<Match> matches;
        const auto& prices = (order.side == BUY) ? ask_prices_ : bid_prices_;
        const auto& qtys = (order.side == BUY) ? ask_quantities_ : bid_quantities_;
        size_t count = (order.side == BUY) ?
            ask_count_.load(std::memory_order_acquire) :
            bid_count_.load(std::memory_order_acquire);

        simd_t target_price = order.price;
        int64_t remaining_qty = order.quantity;

        for (size_t i = 0; i < count && remaining_qty > 0; i += 8) {
            // Load 8 prices at once
            simd_t book_prices(&prices[i], stdx::element_aligned);

            // Compare 8 prices at once
            auto price_match = (order.side == BUY) ?
                (book_prices <= target_price) :
                (book_prices >= target_price);

            if (stdx::any_of(price_match)) {
                // Extract matches
                for (size_t j = 0; j < 8 && remaining_qty > 0; ++j) {
                    if (price_match[j] && qtys[i + j] > 0) {
                        int64_t fill_qty = std::min(remaining_qty, qtys[i + j]);
                        matches.push_back({prices[i + j], fill_qty});
                        remaining_qty -= fill_qty;
                    }
                }
            }
        }

        return matches;
    }
};

3. StrategyEngine (Coroutine-based)

// Coroutine task type for strategies
template<typename T>
class Task {
    struct promise_type {
        T value;
        std::suspend_always initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        Task get_return_object() {
            return Task{std::coroutine_handle<promise_type>::from_promise(*this)};
        }
        void return_value(T v) { value = std::move(v); }
        void unhandled_exception() { std::terminate(); }
    };

    std::coroutine_handle<promise_type> handle_;
    // ...
};

// Awaitable for next market data tick
struct NextTick {
    MarketDataFeed& feed_;

    bool await_ready() { return feed_.has_data(); }
    void await_suspend(std::coroutine_handle<> h) {
        feed_.register_waiter(h);
    }
    Quote await_resume() { return feed_.get_quote(); }
};

// Momentum strategy as coroutine
Task<void> momentum_strategy(
    MarketDataFeed& feed,
    RiskManager& risk,
    OrderRouter& router
) {
    // State persists across co_await points
    std::deque<double> price_history;
    constexpr size_t LOOKBACK = 100;
    constexpr double THRESHOLD = 0.02;

    while (true) {
        // Wait for next quote (suspends, no busy-waiting)
        auto quote = co_await NextTick{feed};

        // Update price history
        price_history.push_back(quote.mid_price());
        if (price_history.size() > LOOKBACK) {
            price_history.pop_front();
        }

        if (price_history.size() < LOOKBACK) continue;

        // Calculate momentum signal
        double momentum = (price_history.back() - price_history.front())
                         / price_history.front();

        if (std::abs(momentum) > THRESHOLD) {
            Side side = momentum > 0 ? BUY : SELL;
            Order order{side, quote.symbol, 100, quote.mid_price()};

            // Check risk (lock-free atomic check)
            if (co_await risk.check(order)) {
                co_await router.submit(order);
            }
        }
    }
}

4. RiskManager

// Lock-free risk management
class RiskManager {
private:
    // Per-symbol position tracking (atomic)
    struct SymbolPosition {
        alignas(64) std::atomic<int64_t> position{0};
        alignas(64) std::atomic<int64_t> exposure{0};
    };

    std::unordered_map<SymbolId, SymbolPosition> positions_;
    std::atomic<int64_t> total_exposure_{0};
    std::atomic<int64_t> daily_pnl_{0};

    // Limits
    const int64_t max_position_per_symbol_;
    const int64_t max_total_exposure_;
    const int64_t max_daily_loss_;

public:
    // Lock-free risk check
    bool check(const Order& order) {
        auto& pos = positions_[order.symbol];
        int64_t current = pos.position.load(std::memory_order_relaxed);
        int64_t new_pos = current + (order.side == BUY ? order.quantity : -order.quantity);

        // Position limit check
        if (std::abs(new_pos) > max_position_per_symbol_) {
            return false;
        }

        // Exposure check
        int64_t current_exposure = total_exposure_.load(std::memory_order_relaxed);
        int64_t order_value = order.quantity * order.price;
        if (current_exposure + order_value > max_total_exposure_) {
            return false;
        }

        // Daily loss check
        if (daily_pnl_.load(std::memory_order_relaxed) < -max_daily_loss_) {
            return false;
        }

        return true;
    }

    // Update position after fill (lock-free)
    void update_position(const Fill& fill) {
        auto& pos = positions_[fill.symbol];

        // Atomic update
        pos.position.fetch_add(
            fill.side == BUY ? fill.quantity : -fill.quantity,
            std::memory_order_relaxed
        );

        pos.exposure.fetch_add(
            fill.quantity * fill.price,
            std::memory_order_relaxed
        );

        total_exposure_.fetch_add(
            fill.quantity * fill.price,
            std::memory_order_relaxed
        );
    }
};

Data Structures

// Core data types - all fixed size, no allocations

struct Quote {
    SymbolId symbol;       // 8 bytes
    int64_t bid_price;     // Price in cents (fixed point)
    int64_t ask_price;
    int32_t bid_size;
    int32_t ask_size;
    uint64_t timestamp;    // Nanoseconds since epoch
    // Total: 40 bytes, fits in cache line
};

struct Order {
    uint64_t order_id;
    SymbolId symbol;
    Side side;             // BUY or SELL
    int64_t price;
    int32_t quantity;
    OrderType type;        // MARKET, LIMIT
    uint64_t timestamp;
};

struct Fill {
    uint64_t order_id;
    SymbolId symbol;
    Side side;
    int64_t price;
    int32_t quantity;
    uint64_t exchange_timestamp;
    uint64_t local_timestamp;  // For latency measurement
};

// Object pool for zero-allocation
template<typename T, size_t N>
class ObjectPool {
private:
    std::array<T, N> objects_;
    std::array<T*, N> free_list_;
    std::atomic<size_t> free_count_{N};

public:
    ObjectPool() {
        for (size_t i = 0; i < N; ++i) {
            free_list_[i] = &objects_[i];
        }
    }

    T* acquire() {
        size_t idx = free_count_.fetch_sub(1, std::memory_order_relaxed) - 1;
        return free_list_[idx];
    }

    void release(T* obj) {
        size_t idx = free_count_.fetch_add(1, std::memory_order_relaxed);
        free_list_[idx] = obj;
    }
};

Implementation Guide

Phase 1: Foundation (Weeks 1-2)

Goal: Basic infrastructure without optimizations

Project Structure:

hft_simulator/
├── CMakeLists.txt
├── src/
│   ├── main.cpp
│   ├── core/
│   │   ├── types.hpp          # Quote, Order, Fill structs
│   │   ├── spsc_queue.hpp     # Lock-free SPSC queue
│   │   ├── object_pool.hpp    # Pre-allocation pool
│   │   └── timing.hpp         # High-resolution timing
│   ├── market_data/
│   │   ├── feed.hpp
│   │   ├── feed.cpp
│   │   └── parser.cpp
│   ├── order_book/
│   │   ├── order_book.hpp
│   │   └── order_book.cpp
│   ├── strategy/
│   │   ├── task.hpp           # Coroutine Task type
│   │   ├── momentum.cpp
│   │   ├── mean_reversion.cpp
│   │   └── market_making.cpp
│   ├── risk/
│   │   ├── risk_manager.hpp
│   │   └── risk_manager.cpp
│   └── routing/
│       ├── order_router.hpp
│       └── simulated_exchange.cpp
├── tests/
│   ├── test_spsc_queue.cpp
│   ├── test_order_book.cpp
│   └── test_risk.cpp
├── benchmarks/
│   ├── bench_latency.cpp
│   └── bench_throughput.cpp
└── data/
 └── sample_data.csv

Start with simple implementations:
- Mutex-based queue (replace with lock-free later)
- Scalar order matching (replace with SIMD later)
- Callback-based strategies (replace with coroutines later)
Build the data flow:
- Read CSV market data
- Update order books
- Generate strategy signals
- Submit orders

Checkpoint: System runs end-to-end, but slowly (100s of microseconds latency)

Phase 2: Lock-Free Path (Weeks 3-4)

Goal: Eliminate locks from critical path

Implement SPSC Queue: ```cpp // Key insight: separate cache lines for producer and consumer template<typename T, size_t Size> class SPSCQueue { static_assert((Size & (Size - 1)) == 0, “Size must be power of 2”);

alignas(64) std::atomic head_{0}; char padding1_[64 - sizeof(std::atomic)];

alignas(64) std::atomic tail_{0}; char padding2_[64 - sizeof(std::atomic)];

alignas(64) std::array<T, Size> buffer_;

public: bool push(const T& item) { const size_t h = head_.load(std::memory_order_relaxed); const size_t next = (h + 1) & (Size - 1);

    if (next == tail_.load(std::memory_order_acquire)) {
        return false;  // Full
    }

    buffer_[h] = item;
    head_.store(next, std::memory_order_release);
    return true;
}

bool pop(T& item) {
    const size_t t = tail_.load(std::memory_order_relaxed);

    if (t == head_.load(std::memory_order_acquire)) {
        return false;  // Empty
    }

    item = buffer_[t];
    tail_.store((t + 1) & (Size - 1), std::memory_order_release);
    return true;
} }; ```

Lock-free position tracking:
- Use std::atomic<int64_t> for positions
- Use fetch_add for updates
- No locks, no CAS loops for updates
Benchmark lock-free vs mutex:
- Expect 10-100x improvement under contention

Checkpoint: Critical path is lock-free, latency drops to 10-50 microseconds

Phase 3: SIMD Optimization (Weeks 5-6)

Goal: Vectorize order matching

Convert to Structure of Arrays: ```cpp // Before: Array of Structures (AoS) - bad for SIMD struct OrderBookLevel { int64_t price; int32_t quantity; uint64_t timestamp; }; std::vector levels;

// After: Structure of Arrays (SoA) - SIMD-friendly struct OrderBook { alignas(64) std::array<int64_t, 256> prices; alignas(64) std::array<int32_t, 256> quantities; alignas(64) std::array<uint64_t, 256> timestamps; size_t count; };

2. **Implement SIMD matching**:
```cpp
#include <immintrin.h>  // AVX2 intrinsics

// Compare 4 prices at once (AVX2, 256-bit)
std::vector<size_t> find_matching_levels_simd(
    const OrderBook& book,
    int64_t target_price,
    Side order_side
) {
    std::vector<size_t> matches;
    __m256i target = _mm256_set1_epi64x(target_price);

    for (size_t i = 0; i < book.count; i += 4) {
        __m256i prices = _mm256_load_si256(
            reinterpret_cast<const __m256i*>(&book.prices[i])
        );

        __m256i cmp;
        if (order_side == BUY) {
            // For buy: book price <= target (ask side)
            cmp = _mm256_cmpgt_epi64(target, prices);  // target > book
            cmp = _mm256_or_si256(cmp, _mm256_cmpeq_epi64(target, prices));
        } else {
            // For sell: book price >= target (bid side)
            cmp = _mm256_cmpgt_epi64(prices, target);
            cmp = _mm256_or_si256(cmp, _mm256_cmpeq_epi64(prices, target));
        }

        // Extract mask and find matching indices
        int mask = _mm256_movemask_pd(_mm256_castsi256_pd(cmp));
        for (size_t j = 0; j < 4; ++j) {
            if (mask & (1 << j)) {
                matches.push_back(i + j);
            }
        }
    }

    return matches;
}

Benchmark SIMD vs scalar:
- Expect 4-8x improvement for price comparisons

Checkpoint: Order matching uses SIMD, latency drops to 5-20 microseconds

Phase 4: Coroutines (Weeks 7-8)

Goal: Implement strategies as coroutines

Create Task type: ```cpp template class Task { public: struct promise_type { T result_; std::exception_ptr exception_;

 Task get_return_object() {
     return Task{Handle::from_promise(*this)};
 }

 std::suspend_never initial_suspend() noexcept { return {}; }
 std::suspend_always final_suspend() noexcept { return {}; }

 void return_value(T value) { result_ = std::move(value); }
 void unhandled_exception() { exception_ = std::current_exception(); }  };

using Handle = std::coroutine_handle;

private: Handle handle_;

public: explicit Task(Handle h) : handle_(h) {} ~Task() { if (handle_) handle_.destroy(); }

T get() {
    while (!handle_.done()) {
        handle_.resume();
    }
    if (handle_.promise().exception_) {
        std::rethrow_exception(handle_.promise().exception_);
    }
    return std::move(handle_.promise().result_);
} }; ```

Create awaitables: ```cpp // Await next market data event struct AwaitQuote { MarketDataFeed& feed_; Quote result_;

bool await_ready() const noexcept { return feed_.has_pending(); }

void await_suspend(std::coroutine_handle<> h) noexcept { feed_.register_continuation(h); }

Quote await_resume() noexcept { return feed_.pop(); } };

// Await order submission struct AwaitOrderSubmit { OrderRouter& router_; Order order_; OrderResult result_;

bool await_ready() const noexcept { return false; }

void await_suspend(std::coroutine_handle<> h) noexcept {
    router_.submit_async(order_, [this, h](OrderResult r) {
        result_ = r;
        h.resume();
    });
}

OrderResult await_resume() noexcept { return result_; } }; ```

Convert strategies:

Task<void> mean_reversion_strategy(
 MarketDataFeed& feed,
 RiskManager& risk,
 OrderRouter& router
) {
 // Moving average state
 std::deque<double> prices;
 constexpr size_t WINDOW = 50;
 constexpr double Z_THRESHOLD = 2.0;

 while (true) {
     auto quote = co_await AwaitQuote{feed};

     prices.push_back(quote.mid_price());
     if (prices.size() > WINDOW) prices.pop_front();
     if (prices.size() < WINDOW) continue;

     // Calculate z-score
     double mean = std::accumulate(prices.begin(), prices.end(), 0.0) / WINDOW;
     double sq_sum = std::inner_product(prices.begin(), prices.end(),
                                        prices.begin(), 0.0);
     double stdev = std::sqrt(sq_sum / WINDOW - mean * mean);
     double z_score = (quote.mid_price() - mean) / stdev;

     if (std::abs(z_score) > Z_THRESHOLD) {
         Side side = z_score > 0 ? SELL : BUY;  // Mean reversion
         Order order{side, quote.symbol, 100, quote.mid_price()};

         if (risk.check(order)) {
             auto result = co_await AwaitOrderSubmit{router, order};
             // Log result...
         }
     }
 }
}

Checkpoint: Strategies are coroutines, code is cleaner, performance maintained

Phase 5: Low-Latency Polish (Weeks 9-10)

Goal: Eliminate remaining latency sources

Core Pinning: ```cpp #include

void pin_to_core(int core_id) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core_id, &cpuset);

pthread_t thread = pthread_self();
int result = pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);
if (result != 0) {
    throw std::runtime_error("Failed to pin thread to core");
} }

// In thread startup: void market_data_thread() { pin_to_core(0); // Pin to core 0 // … processing }

2. **NUMA Awareness**:
```cpp
#include <numa.h>

void* numa_local_alloc(size_t size) {
    int node = numa_node_of_cpu(sched_getcpu());
    return numa_alloc_onnode(size, node);
}

// Allocate order book on local NUMA node
OrderBook* book = new (numa_local_alloc(sizeof(OrderBook))) OrderBook();

Cache Prefetching:

// Prefetch next order book levels while processing current
void process_orders(OrderBook& book) {
 for (size_t i = 0; i < book.count; i += 8) {
     // Prefetch next cache line
     __builtin_prefetch(&book.prices[i + 16], 0, 3);

     // Process current batch
     process_batch(&book.prices[i], &book.quantities[i]);
 }
}

Branch Elimination: ```cpp // Before: branchy code int64t best_price = (side == BUY) ? ask_prices[0] : bid_prices_[0];

// After: branchless with conditional move int64t prices[2] = {bid_prices[0], ask_prices_[0]}; int64_t best_price = prices[side]; // side is 0 or 1

5. **Object Pool**:
```cpp
// Zero-allocation order creation
OrderPool pool(100000);  // Pre-allocate 100k orders

Order* create_order(/* params */) {
    Order* order = pool.acquire();
    order->symbol = symbol;
    order->side = side;
    // ...
    return order;
}

void release_order(Order* order) {
    pool.release(order);
}

Checkpoint: P99 latency < 10 microseconds

Phase 6: Measurement & Tuning (Weeks 11-12)

Goal: Measure, profile, and tune

Latency Histogram: ```cpp class LatencyTracker { static constexpr size_t BUCKETS = 1000; // 1ns to 1ms std::array<std::atomic, BUCKETS> histogram_{}; std::atomic overflow_{0};

public: void record(uint64t latency_ns) { if (latency_ns < BUCKETS) { histogram[latency_ns].fetch_add(1, std::memory_order_relaxed); } else { overflow_.fetch_add(1, std::memory_order_relaxed); } }

uint64_t percentile(double p) const {
    uint64_t total = 0;
    for (const auto& count : histogram_) {
        total += count.load(std::memory_order_relaxed);
    }
    total += overflow_.load(std::memory_order_relaxed);

    uint64_t target = static_cast<uint64_t>(total * p);
    uint64_t cumulative = 0;

    for (size_t i = 0; i < BUCKETS; ++i) {
        cumulative += histogram_[i].load(std::memory_order_relaxed);
        if (cumulative >= target) {
            return i;
        }
    }
    return BUCKETS;  // Overflow
} }; ```

Use perf for profiling: ```bash
Record CPU cycles and cache misses

perf record -e cycles,cache-misses ./hft_simulator

Analyze hot spots

perf report

Check for cache misses

perf stat -e L1-dcache-load-misses,LLC-load-misses ./hft_simulator

3. **Flame graphs**:
```bash
# Generate flame graph
perf record -g ./hft_simulator
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Tune based on data:
- If cache miss rate high: improve data layout, add prefetching
- If branch misses high: eliminate branches with SIMD or branchless code
- If lock contention: move to lock-free or reduce sharing

Checkpoint: Meet all performance targets

Testing Strategy

Unit Tests

// test_spsc_queue.cpp
TEST(SPSCQueue, BasicPushPop) {
    SPSCQueue<int, 16> queue;

    EXPECT_TRUE(queue.push(42));
    int value;
    EXPECT_TRUE(queue.pop(value));
    EXPECT_EQ(value, 42);
}

TEST(SPSCQueue, FullQueue) {
    SPSCQueue<int, 4> queue;  // 4 slots, but only 3 usable

    EXPECT_TRUE(queue.push(1));
    EXPECT_TRUE(queue.push(2));
    EXPECT_TRUE(queue.push(3));
    EXPECT_FALSE(queue.push(4));  // Full
}

TEST(SPSCQueue, ConcurrentAccess) {
    SPSCQueue<int, 1024> queue;
    std::atomic<int> sum{0};

    std::thread producer([&]() {
        for (int i = 0; i < 10000; ++i) {
            while (!queue.push(i)) {
                std::this_thread::yield();
            }
        }
    });

    std::thread consumer([&]() {
        for (int i = 0; i < 10000; ++i) {
            int value;
            while (!queue.pop(value)) {
                std::this_thread::yield();
            }
            sum += value;
        }
    });

    producer.join();
    consumer.join();

    EXPECT_EQ(sum.load(), 10000 * 9999 / 2);  // Sum of 0..9999
}

Integration Tests

// test_end_to_end.cpp
TEST(HFTSimulator, EndToEndLatency) {
    HFTSimulator sim("test_config.yaml");
    sim.load_market_data("test_data.csv");

    LatencyTracker tracker;

    sim.on_quote_to_trade([&](const Quote& q, const Order& o) {
        auto latency = o.timestamp - q.timestamp;
        tracker.record(latency);
    });

    sim.run();

    // Verify latency targets
    EXPECT_LT(tracker.percentile(0.50), 5000);   // P50 < 5us
    EXPECT_LT(tracker.percentile(0.99), 10000);  // P99 < 10us
    EXPECT_LT(tracker.percentile(0.999), 50000); // P99.9 < 50us
}

TEST(HFTSimulator, RiskLimitsEnforced) {
    HFTSimulator sim("test_config.yaml");
    sim.set_position_limit("AAPL", 1000);

    // Try to exceed limit
    for (int i = 0; i < 20; ++i) {
        auto result = sim.submit_order(Order{BUY, "AAPL", 100, 100'00});
        if (i < 10) {
            EXPECT_TRUE(result.accepted);
        } else {
            EXPECT_FALSE(result.accepted);
            EXPECT_EQ(result.reject_reason, RejectReason::PositionLimit);
        }
    }
}

Performance Benchmarks

// bench_latency.cpp
static void BM_OrderMatching(benchmark::State& state) {
    OrderBook book;
    fill_order_book(book, 256);  // 256 levels

    Order order{BUY, "AAPL", 100, 185'50};

    for (auto _ : state) {
        auto matches = book.match_order(order);
        benchmark::DoNotOptimize(matches);
    }
}
BENCHMARK(BM_OrderMatching);

static void BM_SPSCQueue(benchmark::State& state) {
    SPSCQueue<Quote, 65536> queue;
    Quote quote{};

    for (auto _ : state) {
        queue.push(quote);
        Quote out;
        queue.pop(out);
        benchmark::DoNotOptimize(out);
    }
}
BENCHMARK(BM_SPSCQueue);

Common Pitfalls

Problem: Two threads accessing different atomics on the same cache line

// WRONG: head_ and tail_ on same cache line
struct BadQueue {
    std::atomic<size_t> head_;
    std::atomic<size_t> tail_;  // False sharing!
};

// RIGHT: Pad to separate cache lines
struct GoodQueue {
    alignas(64) std::atomic<size_t> head_;
    alignas(64) std::atomic<size_t> tail_;
};

Symptoms: High cache miss rate, poor scalability with threads

2. Memory Order Mistakes

Problem: Using relaxed ordering when synchronization needed

// WRONG: No synchronization between push and pop
head_.store(next, std::memory_order_relaxed);  // Producer

if (t == head_.load(std::memory_order_relaxed)) // Consumer
    // Data race! May read stale data

// RIGHT: Release-acquire pairing
head_.store(next, std::memory_order_release);  // Producer publishes

if (t == head_.load(std::memory_order_acquire)) // Consumer synchronizes

3. Allocation in Hot Path

Problem: Calling new/malloc in performance-critical code

// WRONG: Allocates on every order
Order* create_order() {
    return new Order();  // ~1-10 microseconds!
}

// RIGHT: Use object pool
Order* create_order() {
    return order_pool_.acquire();  // ~10 nanoseconds
}

4. Branch Misprediction

Problem: Unpredictable branches in inner loops

// WRONG: Hard to predict branch
for (auto& order : orders) {
    if (order.side == BUY) {  // Misprediction ~50%
        process_buy(order);
    } else {
        process_sell(order);
    }
}

// RIGHT: Separate loops or branchless
for (auto& order : buy_orders) {
    process_buy(order);
}
for (auto& order : sell_orders) {
    process_sell(order);
}

5. System Call Overhead

Problem: Calling the kernel in hot path

// WRONG: System call for time
auto now = std::chrono::system_clock::now();  // ~100ns syscall

// RIGHT: Use RDTSC for low overhead
inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

6. Coroutine Heap Allocation

Problem: Coroutine frame allocated on heap by default

// Problem: Each coroutine allocates on heap
Task<void> strategy() {
    co_await next_tick();  // Heap allocation for frame
}

// Solution: Custom allocator in promise_type
struct promise_type {
    static void* operator new(size_t size) {
        return frame_pool.acquire(size);  // Pool allocation
    }
    static void operator delete(void* ptr, size_t size) {
        frame_pool.release(ptr, size);
    }
};

Extensions & Challenges

Challenge 1: Kernel Bypass with DPDK

Implement network I/O without kernel involvement:

// DPDK provides direct NIC access, bypassing kernel
// ~100ns receive latency instead of ~10us

#include <rte_ethdev.h>

void dpdk_receive_loop() {
    struct rte_mbuf* bufs[32];

    while (running) {
        uint16_t nb_rx = rte_eth_rx_burst(port_id, 0, bufs, 32);

        for (int i = 0; i < nb_rx; ++i) {
            process_packet(rte_pktmbuf_mtod(bufs[i], uint8_t*));
            rte_pktmbuf_free(bufs[i]);
        }
    }
}

Challenge 2: Hardware Timestamping

Use NIC hardware timestamps for sub-microsecond accuracy:

// Enable hardware timestamping on NIC
struct hwtstamp_config cfg = {
    .tx_type = HWTSTAMP_TX_ON,
    .rx_filter = HWTSTAMP_FILTER_ALL,
};
ioctl(socket_fd, SIOCSHWTSTAMP, &cfg);

// Read hardware timestamp from packet
struct scm_timestamping* ts = CMSG_DATA(cmsg);
uint64_t hw_ns = ts->ts[2].tv_sec * 1e9 + ts->ts[2].tv_nsec;

Challenge 3: Order Book Reconstruction

Build order book from L3 market data (individual orders, not just levels):

struct L3OrderBook {
    struct Order {
        uint64_t order_id;
        int64_t price;
        int32_t quantity;
        uint64_t timestamp;
    };

    // Map from order_id to order details
    std::unordered_map<uint64_t, Order> orders_;

    // Price levels with order queues
    std::map<int64_t, std::deque<uint64_t>> bid_levels_;
    std::map<int64_t, std::deque<uint64_t>, std::greater<>> ask_levels_;

    void add_order(const Order& order, Side side);
    void modify_order(uint64_t order_id, int32_t new_quantity);
    void delete_order(uint64_t order_id);
};

Challenge 4: Market Maker Strategy

Implement a market-making strategy that quotes both sides:

Task<void> market_maker_strategy(
    MarketDataFeed& feed,
    RiskManager& risk,
    OrderRouter& router
) {
    constexpr double SPREAD = 0.001;  // 10 basis points
    constexpr int32_t SIZE = 100;

    std::optional<uint64_t> bid_order_id, ask_order_id;

    while (true) {
        auto quote = co_await NextTick{feed};

        // Cancel existing orders
        if (bid_order_id) co_await router.cancel(*bid_order_id);
        if (ask_order_id) co_await router.cancel(*ask_order_id);

        // Calculate new prices
        double mid = (quote.bid + quote.ask) / 2.0;
        int64_t bid_price = static_cast<int64_t>(mid * (1 - SPREAD/2));
        int64_t ask_price = static_cast<int64_t>(mid * (1 + SPREAD/2));

        // Quote both sides
        if (risk.check_position(quote.symbol) < POSITION_LIMIT) {
            auto bid_result = co_await router.submit(
                Order{BUY, quote.symbol, SIZE, bid_price}
            );
            bid_order_id = bid_result.order_id;

            auto ask_result = co_await router.submit(
                Order{SELL, quote.symbol, SIZE, ask_price}
            );
            ask_order_id = ask_result.order_id;
        }
    }
}

Challenge 5: Distributed HFT (Multi-Machine)

Scale across multiple machines with RDMA:

Machine A (Market Data)       Machine B (Strategy)
         |                            |
         | RDMA write (no CPU)        |
         +--------------------------->|
                   ~1 us latency

Resources

Primary Reading

Topic	Book/Resource	Chapter/Section
Lock-free programming	“C++ Concurrency in Action” by Anthony Williams	Chapters 5-7
SIMD programming	“Intel Intrinsics Guide”	AVX2 reference
Coroutines	“C++ Concurrency in Action” 2nd Ed	Chapter 13
Market microstructure	“Trading and Exchanges” by Larry Harris	Chapters 1-6
Low-latency design	“Trading and Exchanges” by Larry Harris	Chapter 16
Memory model	“C++ Concurrency in Action”	Chapter 5

CppCon Talks (Essential Viewing)

Talk	Speaker	Year	Topic
“Designing Low-Latency Systems”	Carl Cook	2017	HFT architecture
“When a Microsecond Is an Eternity”	Carl Cook	2019	Low-latency patterns
“Lock-Free Programming”	Herb Sutter	2014	Lock-free fundamentals
“C++ Atomics, From Basic to Advanced”	Fedor Pikus	2017	Memory model deep dive
“SIMD: From Zero to Hero”	Victor Ciura	2018	SIMD optimization

Tools

Tool	Purpose
`perf`	Linux profiling (cycles, cache misses)
`flamegraph`	Visualize CPU time distribution
`valgrind --tool=cachegrind`	Cache simulation
Google Benchmark	Microbenchmarking
Catch2/GoogleTest	Unit testing
`rdtsc`	Nanosecond-precision timing

Online Resources

Intel Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Preshing on Programming: https://preshing.com/archives/
Lock-Free Programming: https://www.1024cores.net/
LMAX Disruptor Paper: https://lmax-exchange.github.io/disruptor/

Self-Assessment Checklist

Before considering this project complete, verify you can:

Architecture & Design

Draw the complete system architecture from memory
Explain why each component uses its specific concurrency primitive
Justify the thread/core assignment strategy
Describe the data flow from market data to order submission

Lock-Free Programming

Implement SPSC queue from scratch
Explain acquire-release semantics and when to use them
Identify and fix false sharing issues
Explain why the queue is wait-free for single producer/consumer

SIMD Optimization

Convert array-of-structures to structure-of-arrays
Write SIMD price comparison code
Explain when SIMD helps and when it doesn’t
Measure SIMD vs scalar speedup

Coroutines

Implement a basic Task coroutine type
Create custom awaitables for market data and order submission
Explain coroutine frame allocation and how to optimize it
Convert callback-based code to coroutine-based

Low-Latency Patterns

Implement object pool for zero-allocation
Apply core pinning and measure impact
Identify and eliminate branches in hot paths
Use cache prefetching effectively

Performance Measurement

Measure latency at P50/P99/P99.9
Use perf to identify hot spots
Measure cache miss rates
Create latency histograms

Interview Readiness

Explain HFT system architecture in 2 minutes
Answer questions about lock-free vs wait-free
Discuss tradeoffs in low-latency design
Explain why C++ (not Java/Python) for HFT

Submission/Completion Criteria

Minimum Viable Product (8 weeks)

To consider Phase 1-4 complete:

Functional Requirements:
- System reads market data from file/generator
- Order book updates correctly on quotes
- At least one strategy generates signals
- Risk manager enforces position limits
- Orders submitted to simulated exchange
Performance Requirements:
- Quote-to-trade latency P99 < 50 microseconds
- Throughput > 100,000 quotes/second
- Zero allocations in critical path (verified)
Code Quality:
- Unit tests for all core components
- Integration tests for end-to-end flow
- Benchmarks for latency and throughput
- Documentation of architecture

Full Completion (12 weeks)

To consider the project fully complete:

All MVP requirements plus:
- Quote-to-trade latency P99 < 10 microseconds
- Throughput > 1 million quotes/second
- Multiple strategies running concurrently
- Core pinning and NUMA awareness implemented
Advanced Features (at least 2):
- SIMD order matching
- Coroutine-based strategies
- Latency histogram with percentile reporting
- Flame graph analysis completed
Documentation:
- Architecture design document
- Performance optimization log
- Lessons learned summary

Final Thoughts

This capstone project integrates everything from the C++ Concurrency learning path:

Projects 1-2: Threading fundamentals, futures/promises
Projects 3-4: Thread pools, custom synchronization primitives
Projects 5-6: Lock-free data structures
Projects 7-8: Parallel algorithms
Projects 9-11: Coroutines
Projects 12-14: SIMD
Projects 15-16: Actor model, distributed systems

Building an HFT simulator teaches you that performance engineering is about understanding your hardware: CPU caches, memory bandwidth, branch prediction, and SIMD lanes. It is about measuring obsessively and optimizing surgically.

Even if you never work in finance, these skills transfer directly to:

Game engines (physics, rendering)
Database engines (query execution)
Network infrastructure (packet processing)
Audio/video processing (real-time)
Any system where latency matters

The journey from “working” to “fast” to “really fast” is where you become an expert.

Previous Project: P16 - Distributed Task Scheduler

Back to: C++ Concurrency Learning Guide