Project 14: Real-Time Game Physics with Parallelism

“The real-time constraint is not a performance target - it is a hard correctness requirement. A physics engine that misses frame deadlines delivers broken gameplay, no matter how accurate its simulation.” - Erin Catto, Creator of Box2D

Quick Reference

Attribute	Value
Language	C++ (alt: Rust, C)
Difficulty	Expert
Time	4-6 weeks
Prerequisites	Projects 7-8 (Parallel Algorithms), 12-13 (SIMD)
Coolness	Level 5: Pure Magic
Portfolio Value	Resume Gold - demonstrates systems mastery

Learning Objectives

By completing this project, you will:

Master real-time systems constraints: Understand frame budgets, deadline-driven design, and how to profile and optimize for consistent frame times
Implement parallel broad-phase collision detection: Use spatial data structures (spatial hashing, grid-based partitioning) with parallel algorithms to find potential collision pairs
Apply SIMD to narrow-phase physics: Vectorize intersection tests and physics integration using std::simd or platform intrinsics
Design cache-friendly data layouts: Transform Array-of-Structures (AoS) to Structure-of-Arrays (SoA) for optimal SIMD and cache performance
Integrate multiple parallel techniques: Combine thread pools, parallel algorithms, SIMD, and lock-free communication in a single coherent system
Handle boundary synchronization: Solve the challenge of particles near region boundaries without introducing races or bottlenecks
Profile and optimize systematically: Use profilers and benchmarks to identify bottlenecks and verify improvements

Theoretical Foundation

Part 1: The Physics Simulation Pipeline

Every real-time physics engine follows a similar pipeline, repeated every frame:

GAME PHYSICS PIPELINE (Per Frame)
====================================================================

Frame Start (t = 0ms)
        |
        v
+------------------------------------------------------------------+
|                    BROAD-PHASE COLLISION                          |
|                                                                   |
|  Goal: Find POTENTIAL collision pairs from N objects              |
|  Complexity target: O(N) or O(N log N), NOT O(N^2)               |
|                                                                   |
|  Techniques:                                                      |
|  - Spatial Hashing: Hash position to grid cell                    |
|  - Sweep and Prune: Sort AABBs along one axis                     |
|  - BVH: Hierarchical bounding volumes                             |
|                                                                   |
|  Output: List of (object_a, object_b) pairs that MIGHT collide   |
+------------------------------------------------------------------+
        |
        v (2-3ms elapsed)
+------------------------------------------------------------------+
|                    NARROW-PHASE COLLISION                         |
|                                                                   |
|  Goal: Exact collision detection for candidate pairs              |
|  Per-pair tests are expensive but count is limited                |
|                                                                   |
|  Techniques:                                                      |
|  - GJK/EPA: General convex shapes                                 |
|  - SAT: Separating Axis Theorem for polygons                      |
|  - Sphere-Sphere: sqrt-free distance test                         |
|  - SIMD: Process 4/8 pairs simultaneously                         |
|                                                                   |
|  Output: Contact points, normals, penetration depths              |
+------------------------------------------------------------------+
        |
        v (5-6ms elapsed)
+------------------------------------------------------------------+
|                  CONSTRAINT RESOLUTION                            |
|                                                                   |
|  Goal: Resolve overlaps and apply joint constraints               |
|                                                                   |
|  Techniques:                                                      |
|  - Sequential Impulses (SI): Iterative solver                     |
|  - Projected Gauss-Seidel (PGS): Matrix-free iteration            |
|  - Position-Based Dynamics (PBD): Direct position correction      |
|                                                                   |
|  Output: Velocity changes to separate colliding objects           |
+------------------------------------------------------------------+
        |
        v (7-8ms elapsed)
+------------------------------------------------------------------+
|                      INTEGRATION                                  |
|                                                                   |
|  Goal: Update positions and velocities                            |
|                                                                   |
|  Techniques:                                                      |
|  - Euler: x += v*dt, v += a*dt (simple, less stable)              |
|  - Verlet: x_new = 2*x - x_old + a*dt^2 (stable, popular)         |
|  - RK4: 4th order Runge-Kutta (accurate, expensive)               |
|  - SIMD: Process 8 particles per instruction                      |
|                                                                   |
|  Output: New positions and velocities for all objects             |
+------------------------------------------------------------------+
        |
        v (8-9ms elapsed)
+------------------------------------------------------------------+
|                      RENDERING SYNC                               |
|                                                                   |
|  Goal: Prepare physics state for renderer                         |
|  Copy/interpolate positions to render thread                      |
|  Remaining time: VSync wait or additional work                    |
+------------------------------------------------------------------+
        |
        v
Frame End (t = 16.67ms for 60 FPS)

FRAME BUDGET BREAKDOWN (60 FPS = 16.67ms total):
================================================
Broad-phase:      2.1ms (12.6%)
Narrow-phase:     3.8ms (22.8%)
Constraints:      1.5ms  (9.0%)
Integration:      0.8ms  (4.8%)
Rendering prep:   0.5ms  (3.0%)
---------------------------------
Total physics:    8.7ms (52.2%)
Remaining:        7.97ms (for rendering, audio, etc.)

Book Reference: “Game Physics Engine Development” by Ian Millington, Chapters 7-9 cover the complete physics pipeline.

Part 2: Spatial Partitioning for Broad-Phase

The naive O(N^2) approach of checking every pair is unacceptable for large object counts. Spatial partitioning reduces this dramatically:

SPATIAL HASHING EXPLAINED
====================================================================

CONCEPT: Divide world into grid cells, hash position to cell ID

World Space (continuous):              Grid Space (discrete):
+---------------------------+          +---+---+---+---+---+
|  .  o              o      |          | 0 | 1 | 2 | 3 | 4 |
|       .      .            |          +---+---+---+---+---+
|    o      .          o    |   --->   | 5 | 6 | 7 | 8 | 9 |
|  .           o    .       |          +---+---+---+---+---+
|      o  .      o          |          |10 |11 |12 |13 |14 |
+---------------------------+          +---+---+---+---+---+

HASH FUNCTION:
cell_x = floor(pos_x / cell_size)
cell_y = floor(pos_y / cell_size)
cell_id = cell_y * grid_width + cell_x

Or for infinite worlds (no grid bounds):
cell_id = hash(cell_x, cell_y)  // e.g., (cell_x * 73856093) ^ (cell_y * 19349663)


DATA STRUCTURE:
====================================================================

Option 1: HashMap<CellID, Vec<ObjectID>>
- Pros: Sparse, handles infinite worlds
- Cons: Hash overhead, cache-unfriendly iteration

Option 2: Dense Grid Array
- Pros: Cache-friendly, simple indexing
- Cons: Memory proportional to world size

Option 3: Compact Hash Table (Robin Hood)
- Pros: Balance of sparse storage and cache locality
- Best choice for most games


QUERYING NEIGHBORS:
====================================================================

For object at position (px, py) with radius r:

min_cell_x = floor((px - r) / cell_size)
max_cell_x = floor((px + r) / cell_size)
min_cell_y = floor((py - r) / cell_size)
max_cell_y = floor((py + r) / cell_size)

for cy in min_cell_y..=max_cell_y:
    for cx in min_cell_x..=max_cell_x:
        cell_id = hash(cx, cy)
        for other_object in grid[cell_id]:
            if object.id < other_object.id:  // Avoid duplicate pairs
                candidate_pairs.push((object.id, other_object.id))


CHOOSING CELL SIZE:
====================================================================

                Too Small                     Optimal                       Too Large
                (cell < radius)               (cell ~ 2*radius)             (cell >> radius)

Objects per     Few objects per cell          ~4-16 objects/cell            Many objects/cell
cell:           Many cells to check           Balanced                      Few cells, many pairs

Overhead:       High iteration count          Optimal                       O(N^2) within cells

Rule of thumb: cell_size = 2 * average_object_diameter

PARALLEL SPATIAL HASH INSERT/QUERY
====================================================================

CHALLENGE: Multiple threads inserting to same cell = data race!

SOLUTION 1: Per-Thread Grid (Recommended for insert-heavy)
==========================================================

Thread 0 Grid:     Thread 1 Grid:     Thread 2 Grid:
+---+---+---+      +---+---+---+      +---+---+---+
| a |   | b |      |   | c |   |      | d |   | e |
+---+---+---+      +---+---+---+      +---+---+---+
|   | f |   |      | g |   | h |      |   | i |   |
+---+---+---+      +---+---+---+      +---+---+---+

After parallel insert, merge grids:
+---+---+---+
|a,d| c |b,e|
+---+---+---+
| g |f,i| h |
+---+---+---+

Merge step: O(cells) work, parallelizable


SOLUTION 2: Atomic Cell Lists (Better for query-heavy)
======================================================

struct Cell {
    std::atomic<ObjectNode*> head;  // Lock-free linked list
};

void insert(Cell& cell, ObjectID id) {
    auto* node = new ObjectNode{id, nullptr};
    node->next = cell.head.load(std::memory_order_relaxed);
    while (!cell.head.compare_exchange_weak(
        node->next, node,
        std::memory_order_release,
        std::memory_order_relaxed)) {
        // Retry on contention
    }
}


SOLUTION 3: Partitioned Cells (Best for balanced workloads)
===========================================================

Assign cell ranges to threads:
- Thread 0 owns cells [0, 99]
- Thread 1 owns cells [100, 199]
- Thread 2 owns cells [200, 299]
- ...

Objects near boundaries:
+---+---+---+---+---+---+
|   |   | A |   |   |   |   <-- A is near T0/T1 boundary
+---+---+---+---+---+---+
        T0    T1

Object A must be:
1. Inserted into T0's grid (primary)
2. Also inserted into T1's grid (ghost copy)
3. Collision pairs involving ghosts handled carefully

This is the "domain decomposition" approach from HPC.

Book Reference: “Real-Time Collision Detection” by Christer Ericson, Chapter 7 covers spatial partitioning in depth.

Part 3: Structure of Arrays (SoA) for SIMD

The data layout dramatically affects SIMD efficiency:

ARRAY OF STRUCTURES (AoS) - Traditional OOP
====================================================================

struct Particle {
    float x, y, z;      // Position
    float vx, vy, vz;   // Velocity
    float ax, ay, az;   // Acceleration (force / mass)
    float mass;
    uint32_t flags;
};

std::vector<Particle> particles;  // [P0, P1, P2, P3, ...]

Memory layout:
+----+----+----+----+----+----+----+----+----+----+----+
| x0 | y0 | z0 |vx0 |vy0 |vz0 |ax0 |ay0 |az0 | m0 |fl0 |
+----+----+----+----+----+----+----+----+----+----+----+
| x1 | y1 | z1 |vx1 |vy1 |vz1 |ax1 |ay1 |az1 | m1 |fl1 |
+----+----+----+----+----+----+----+----+----+----+----+
| x2 | y2 | z2 | ...
+----+----+----+

To update all X positions: x += vx * dt
- Load x0, skip 10 floats, load x1, skip 10 floats, load x2...
- Cache lines wasted loading unused data (vy, vz, ax, ...)
- SIMD impossible: data is non-contiguous!


STRUCTURE OF ARRAYS (SoA) - SIMD-Friendly
====================================================================

struct ParticleSystem {
    std::vector<float> x;   // All X positions contiguous
    std::vector<float> y;   // All Y positions contiguous
    std::vector<float> z;   // All Z positions contiguous
    std::vector<float> vx;  // All X velocities contiguous
    std::vector<float> vy;
    std::vector<float> vz;
    std::vector<float> ax;
    std::vector<float> ay;
    std::vector<float> az;
    std::vector<float> mass;
    std::vector<uint32_t> flags;
    size_t count;
};

Memory layout:
X array:  +----+----+----+----+----+----+----+----+
          | x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | ...
          +----+----+----+----+----+----+----+----+

VX array: +----+----+----+----+----+----+----+----+
          |vx0 |vx1 |vx2 |vx3 |vx4 |vx5 |vx6 |vx7 | ...
          +----+----+----+----+----+----+----+----+

To update all X positions: x += vx * dt
- Load 8 consecutive X values into SIMD register
- Load 8 consecutive VX values into SIMD register
- Multiply, add, store - 8 particles at once!


SIMD INTEGRATION (Verlet Method)
====================================================================

Physics equation: x_new = x + v*dt + 0.5*a*dt^2

void integrate_soa(ParticleSystem& ps, float dt) {
    using simd_t = std::experimental::simd<float>;
    constexpr size_t lanes = simd_t::size();  // 4 (SSE), 8 (AVX), etc.

    simd_t dt_vec = dt;
    simd_t half_dt2 = 0.5f * dt * dt;

    // Process 8 particles at a time (AVX)
    for (size_t i = 0; i + lanes <= ps.count; i += lanes) {
        // Load 8 X positions
        simd_t x(ps.x.data() + i, std::experimental::element_aligned);
        simd_t vx(ps.vx.data() + i, std::experimental::element_aligned);
        simd_t ax(ps.ax.data() + i, std::experimental::element_aligned);

        // x_new = x + vx*dt + 0.5*ax*dt^2
        x += vx * dt_vec + ax * half_dt2;

        // v_new = vx + ax*dt
        vx += ax * dt_vec;

        // Store back
        x.copy_to(ps.x.data() + i, std::experimental::element_aligned);
        vx.copy_to(ps.vx.data() + i, std::experimental::element_aligned);
    }

    // Similar for Y, Z components...

    // Scalar cleanup for remaining particles
    for (size_t i = (ps.count / lanes) * lanes; i < ps.count; ++i) {
        ps.x[i] += ps.vx[i] * dt + 0.5f * ps.ax[i] * dt * dt;
        ps.vx[i] += ps.ax[i] * dt;
    }
}


PERFORMANCE COMPARISON
====================================================================

Operation: Update 100,000 particle positions

AoS (scalar):
  - Cache misses: ~100,000 (each particle access spans cache line)
  - Instructions: 100,000 * 6 = 600,000 scalar ops
  - Time: 2.1ms

SoA (SIMD, AVX):
  - Cache misses: ~12,500 (8 particles per cache line, contiguous)
  - Instructions: 12,500 * 6 = 75,000 SIMD ops (8x parallelism)
  - Time: 0.12ms

Speedup: 17.5x (8x from SIMD + 2x from cache efficiency)

Book Reference: “Optimizing Software in C++” by Agner Fog, Sections 7-9 cover data layout and SIMD optimization.

Part 4: Parallel Collision Detection Strategy

PARALLEL BROAD-PHASE WITH DOMAIN DECOMPOSITION
====================================================================

World divided into regions, each owned by one thread:

+---------------+---------------+---------------+---------------+
|   Region 0    |   Region 1    |   Region 2    |   Region 3    |
|   (Thread 0)  |   (Thread 1)  |   (Thread 2)  |   (Thread 3)  |
|               |               |               |               |
|  o  o    o    |    o      o   |  o   o    o   |    o    o     |
|      o      o |  o     o      |       o       | o       o  o  |
|   o      o    |     o    o    |  o   o     o  |     o         |
|               |               |               |               |
+---------------+---------------+---------------+---------------+

STEP 1: Classify particles by region (parallel)
================================================
std::for_each(std::execution::par,
    particle_indices.begin(), particle_indices.end(),
    [&](size_t i) {
        int region = compute_region(ps.x[i], ps.y[i]);
        region_lists[region].push_back_atomic(i);  // Lock-free append
    });

STEP 2: Process each region independently (parallel)
====================================================
std::for_each(std::execution::par,
    regions.begin(), regions.end(),
    [&](Region& region) {
        // Build spatial hash for this region
        SpatialHash local_hash(cell_size);
        for (size_t i : region.particle_indices) {
            local_hash.insert(i, ps.x[i], ps.y[i]);
        }

        // Find collision pairs within region
        for (size_t i : region.particle_indices) {
            auto neighbors = local_hash.query(ps.x[i], ps.y[i], ps.radius[i]);
            for (size_t j : neighbors) {
                if (i < j) {
                    region.collision_pairs.push_back({i, j});
                }
            }
        }
    });


BOUNDARY HANDLING
====================================================================

Problem: Particles near region boundaries may collide with particles in
         adjacent regions, but each thread only sees its own region.

+---------------+---------------+
|   Region 0    |   Region 1    |
|               |               |
|            A  |  B            |   <-- A and B might collide!
|               |               |
+---------------+---------------+

SOLUTION: Boundary zones processed separately

+---------------+---------------+
|   Region 0    |   Region 1    |
|   Interior    |   Interior    |
|               |               |
|   (parallel)  |   (parallel)  |
|       +---+---+---+           |
|       |   | B |   |           |
|       | A |   |   |           |
|       +---+---+---+           |
|         Boundary Zone         |
|        (serialized or         |
|         with locks)           |
+---------------+---------------+

IMPLEMENTATION:
1. Classify particles as Interior (> collision_radius from boundary) or Boundary
2. Process Interior particles in parallel (no coordination needed)
3. Process Boundary particles with synchronization:
   - Option A: Sequential processing (simple, may be bottleneck)
   - Option B: Fine-grained locks on boundary cells
   - Option C: Duplicate boundary particles in adjacent regions (ghost particles)

GHOST PARTICLE APPROACH (Popular in HPC):
=========================================

Region 0 has:
- 1000 interior particles (owned)
- 50 ghost particles copied from Region 1

Region 1 has:
- 1000 interior particles (owned)
- 50 ghost particles copied from Region 0

Collision pairs involving ghosts:
- (owned, ghost) -> Belongs to region with owned particle
- (ghost, ghost) -> Ignored (will be found by some other region)

This requires a ghost exchange step between regions each frame.

Part 5: Frame Budget Management

REAL-TIME CONSTRAINT ANALYSIS
====================================================================

Target: 60 FPS = 16.67ms per frame
Physics budget: 50% = 8.33ms
Safety margin: 20% = 6.67ms (for worst-case spikes)

FRAME TIMING DIAGRAM:
=====================

Frame N Timeline:
|--------------------16.67ms--------------------|
|--Physics 8ms--|--Render 6ms--|--VSync 2ms--|
                                              ^
                                              Frame deadline

If physics takes 10ms:
|------Physics 10ms-------|--Render 6ms--|
                                         X Frame miss! (16ms)
                          Stuttering visible to player


MAINTAINING FRAME BUDGET:
====================================================================

1. FIXED TIMESTEP SIMULATION
   -------------------------
   Never let simulation time vary - always step by fixed dt (e.g., 1/60 sec)

   void game_loop() {
       double accumulator = 0.0;
       const double fixed_dt = 1.0 / 60.0;

       while (running) {
           double frame_time = get_elapsed_time();
           accumulator += frame_time;

           // May run multiple physics steps per frame if behind
           while (accumulator >= fixed_dt) {
               physics_step(fixed_dt);
               accumulator -= fixed_dt;
           }

           // Interpolate for rendering (smooth visuals)
           double alpha = accumulator / fixed_dt;
           render(interpolate_state(prev_state, curr_state, alpha));
       }
   }

2. LEVEL OF DETAIL (LOD) FOR PHYSICS
   ---------------------------------
   Reduce simulation quality when frame budget is tight

   if (frame_budget_remaining < 2ms) {
       // Reduce constraint solver iterations
       solver_iterations = 2;  // Instead of 8
   } else {
       solver_iterations = 8;
   }

   if (object_count > 50000) {
       // Use cheaper collision for distant objects
       distant_objects.use_sphere_bounds();  // Instead of mesh
   }

3. TIME SLICING
   ------------
   Spread expensive work across multiple frames

   Frame 0: Process particles 0-25000
   Frame 1: Process particles 25001-50000
   Frame 2: Process particles 50001-75000
   Frame 3: Process particles 75001-100000
   Frame 4: Process particles 0-25000 (repeat)

   Works for sleeping/stable objects, not active collisions


PROFILING METRICS TO TRACK:
====================================================================

struct FrameMetrics {
    float broad_phase_ms;
    float narrow_phase_ms;
    float constraints_ms;
    float integration_ms;
    float total_physics_ms;

    size_t collision_pairs_tested;
    size_t actual_collisions;
    size_t active_particles;

    // Thread utilization
    float thread_utilization[MAX_THREADS];

    // SIMD efficiency
    size_t simd_operations;
    size_t scalar_fallback_operations;
};

void log_frame_metrics(const FrameMetrics& m) {
    printf("Frame: %.2fms (budget: %.2fms) %s\n",
           m.total_physics_ms, 8.33,
           m.total_physics_ms > 8.33 ? "OVER!" : "OK");
    printf("  Broad:  %.2fms (%zu pairs)\n", m.broad_phase_ms, m.collision_pairs_tested);
    printf("  Narrow: %.2fms (%zu collisions)\n", m.narrow_phase_ms, m.actual_collisions);
    printf("  Solve:  %.2fms\n", m.constraints_ms);
    printf("  Integrate: %.2fms (%zu particles)\n", m.integration_ms, m.active_particles);
}

Project Specification

What You Will Build

A high-performance particle physics simulation that:

Simulates 100,000+ particles at 60 FPS
Uses parallel spatial hashing for broad-phase collision detection
Implements SIMD-accelerated Verlet integration
Renders the simulation in a window (optional - can be headless with metrics output)
Provides detailed performance profiling per frame

Functional Requirements

Particle System:
- Support 100,000+ particles with position, velocity, acceleration, radius
- SoA data layout for SIMD-friendly access
- Configurable gravity, damping, and collision restitution
Collision Detection:
- Spatial hash-based broad-phase with O(N) average complexity
- Parallel broad-phase using std::execution::par or thread pool
- Sphere-sphere narrow-phase (SIMD accelerated)
- Collision response with momentum conservation
Physics Integration:
- Verlet integration with SIMD (process 8 particles per instruction)
- Fixed timestep simulation (1/60 second)
- Boundary collision with walls
Performance:
- Target: 60 FPS with 100,000 particles on 8-core CPU
- Parallel efficiency: > 60% on 8 threads
- SIMD speedup: > 4x compared to scalar baseline

Real World Outcome

When you complete this project, running your simulation will produce:

$ ./physics_demo --particles 100000 --threads 8

Physics Simulation Configuration:
  Particles:         100,000
  Cell size:         2.0 (spatial hash)
  World bounds:      [0, 1000] x [0, 1000]
  Integration:       Verlet (SIMD)
  SIMD width:        8 lanes (AVX)
  Threads:           8
  Target FPS:        60 (16.67ms frame budget)

Initializing particle system... done (45ms)
Building spatial hash... done (3ms)

Running simulation (press Ctrl+C to stop)...

Frame breakdown (target: 16.67ms for 60fps):
+------------------------+--------+--------+
| Phase                  | Time   | Budget |
+------------------------+--------+--------+
| Broad-phase collision  | 2.1ms  | 12.6%  |
| Narrow-phase collision | 3.8ms  | 22.8%  |
| Constraint resolution  | 1.5ms  |  9.0%  |
| Integration            | 0.8ms  |  4.8%  |
| Rendering sync         | 0.5ms  |  3.0%  |
+------------------------+--------+--------+
| Total physics          | 8.7ms  | 52.2%  |
| Frame overhead         | 7.97ms | 47.8%  |
+------------------------+--------+--------+

Performance scaling (100,000 particles):
+----------+------------+--------+------------+
| Threads  | Frame time | FPS    | Efficiency |
+----------+------------+--------+------------+
| 1        | 45.2ms     | 22 fps | 100%       |
| 2        | 24.1ms     | 41 fps | 94%        |
| 4        | 12.8ms     | 78 fps | 88%        |
| 8        | 8.2ms      | 122 fps| 69%        |
+----------+------------+--------+------------+

Collision statistics:
  Broad-phase pairs tested:  245,892
  Actual collisions found:   12,456
  Collision detection ratio: 5.1% (good spatial hash efficiency)

SIMD efficiency:
  SIMD operations:      12,500,000
  Scalar fallbacks:     156 (0.001%)
  Effective speedup:    7.8x (of theoretical 8x)

Average FPS: 60 (vsync limited)
Physics time: 8.2ms, Render time: 2.1ms, Idle: 6.4ms

$ ./physics_demo --benchmark

Running benchmark suite...

Test 1: Scaling with particle count (8 threads)
+------------+------------+--------+-------------+
| Particles  | Frame time | FPS    | Particles/ms|
+------------+------------+--------+-------------+
| 10,000     | 1.2ms      | 833 fps| 8,333       |
| 25,000     | 2.8ms      | 357 fps| 8,929       |
| 50,000     | 5.1ms      | 196 fps| 9,804       |
| 100,000    | 8.2ms      | 122 fps| 12,195      |
| 250,000    | 19.8ms     | 50 fps | 12,626      |
+------------+------------+--------+-------------+

Test 2: Scaling with thread count (100,000 particles)
[Parallel efficiency chart...]

Test 3: SIMD vs Scalar comparison
+------------------+--------+---------+
| Implementation   | Time   | Speedup |
+------------------+--------+---------+
| Scalar (no SIMD) | 6.4ms  | 1.0x    |
| SSE (4-wide)     | 1.8ms  | 3.6x    |
| AVX (8-wide)     | 0.8ms  | 8.0x    |
+------------------+--------+---------+

BENCHMARK COMPLETE
All targets met!

Solution Architecture

High-Level Design

PHYSICS ENGINE ARCHITECTURE
====================================================================

+--------------------------------------------------------------------+
|                         PHYSICS ENGINE                              |
|                                                                     |
|  +-------------------------+     +-----------------------------+   |
|  |     ParticleSystem      |     |      CollisionDetector      |   |
|  |-------------------------|     |-----------------------------|   |
|  | SoA Layout:             |     | SpatialHash:                |   |
|  |   float* pos_x          |<--->|   HashMap<CellID, Vec<ID>>  |   |
|  |   float* pos_y          |     |   cell_size: float          |   |
|  |   float* vel_x          |     |                             |   |
|  |   float* vel_y          |     | Methods:                    |   |
|  |   float* radius         |     |   insert(id, x, y)          |   |
|  |   size_t count          |     |   query(x, y, r) -> Vec<ID> |   |
|  |                         |     |   rebuild()                 |   |
|  | Methods:                |     +-----------------------------+   |
|  |   integrate(dt)         |                   |                   |
|  |   apply_gravity()       |                   |                   |
|  |   apply_damping()       |                   v                   |
|  +-------------------------+     +-----------------------------+   |
|             ^                    |      BroadPhase (parallel)  |   |
|             |                    |-----------------------------|   |
|             |                    | std::for_each(par, ...)     |   |
|             |                    | Per-thread local hash       |   |
|             |                    | Merge + deduplicate         |   |
|             |                    +-----------------------------+   |
|             |                                  |                   |
|             |                                  v                   |
|             |                    +-----------------------------+   |
|             |                    |      NarrowPhase (SIMD)     |   |
|             |                    |-----------------------------|   |
|             |                    | Process pairs in batches    |   |
|             |                    | SIMD distance calculation   |   |
|             |                    | Generate contacts           |   |
|             |                    +-----------------------------+   |
|             |                                  |                   |
|             v                                  v                   |
|  +-------------------------+     +-----------------------------+   |
|  |       Integrator        |     |     ConstraintSolver        |   |
|  |-------------------------|     |-----------------------------|   |
|  | SIMD Verlet:            |     | Sequential Impulses:        |   |
|  |   x += v*dt + 0.5*a*dt^2|     |   for each contact:         |   |
|  | Process 8 particles/op  |     |     compute impulse         |   |
|  |                         |     |     apply velocity change   |   |
|  +-------------------------+     +-----------------------------+   |
|                                                                     |
|  +-------------------------+     +-----------------------------+   |
|  |       ThreadPool        |     |      FrameProfiler          |   |
|  |-------------------------|     |-----------------------------|   |
|  | Worker threads: N       |     | Per-phase timing            |   |
|  | Task queue              |     | Rolling averages            |   |
|  | Work stealing           |     | Bottleneck detection        |   |
|  +-------------------------+     +-----------------------------+   |
|                                                                     |
+--------------------------------------------------------------------+

DATA FLOW PER FRAME:
====================

     ParticleSystem
           |
           v
   +-------+-------+
   |               |
   v               v
 [pos_x,y]     [radius]
   |               |
   +-------+-------+
           |
           v
    SpatialHash.rebuild()
           |
           v
    BroadPhase (parallel)
           |
           v
    Candidate Pairs
           |
           v
    NarrowPhase (SIMD)
           |
           v
    Contact List
           |
           v
    ConstraintSolver
           |
           v
    Velocity Updates
           |
           v
    Integrator (SIMD)
           |
           v
    Updated Positions

Key Components

// ParticleSystem.hpp - SoA Layout
struct ParticleSystem {
    // Position (SoA for SIMD)
    std::vector<float, AlignedAllocator<float, 32>> pos_x;
    std::vector<float, AlignedAllocator<float, 32>> pos_y;

    // Velocity (SoA for SIMD)
    std::vector<float, AlignedAllocator<float, 32>> vel_x;
    std::vector<float, AlignedAllocator<float, 32>> vel_y;

    // Acceleration (SoA for SIMD)
    std::vector<float, AlignedAllocator<float, 32>> acc_x;
    std::vector<float, AlignedAllocator<float, 32>> acc_y;

    // Per-particle properties
    std::vector<float> radius;
    std::vector<float> mass;
    std::vector<float> restitution;

    size_t count;

    void reserve(size_t n);
    void add_particle(float x, float y, float vx, float vy, float r, float m);
    void integrate_verlet_simd(float dt);
    void apply_gravity(float gx, float gy);
    void apply_damping(float factor);
    void enforce_bounds(float min_x, float max_x, float min_y, float max_y);
};

// SpatialHash.hpp - Broad-Phase Acceleration Structure
struct SpatialHash {
    float cell_size;
    float inv_cell_size;  // Precomputed 1/cell_size

    // Option 1: Dense grid (fixed world size)
    // std::vector<std::vector<uint32_t>> cells;

    // Option 2: Sparse hash map (infinite world)
    std::unordered_map<uint64_t, std::vector<uint32_t>> cells;

    SpatialHash(float cell_size);

    uint64_t hash_cell(int cx, int cy) const;
    void clear();
    void insert(uint32_t particle_id, float x, float y);
    void query_neighbors(float x, float y, float radius,
                         std::vector<uint32_t>& out) const;

    // Parallel rebuild
    void rebuild_parallel(const ParticleSystem& ps, size_t num_threads);
};

// CollisionPair.hpp - Contact Information
struct CollisionPair {
    uint32_t a;
    uint32_t b;
};

struct Contact {
    uint32_t a;
    uint32_t b;
    float nx, ny;       // Contact normal (a -> b)
    float penetration;  // Overlap distance
};

Data Flow Diagram

PARALLEL BROAD-PHASE PIPELINE
====================================================================

Input: ParticleSystem with N particles
Output: Vector of CollisionPair candidates

STEP 1: Parallel Spatial Hash Build
------------------------------------

Thread 0          Thread 1          Thread 2          Thread 3
+----------+      +----------+      +----------+      +----------+
| Particles|      | Particles|      | Particles|      | Particles|
| 0-24999  |      | 25000-   |      | 50000-   |      | 75000-   |
|          |      | 49999    |      | 74999    |      | 99999    |
+----+-----+      +----+-----+      +----+-----+      +----+-----+
     |                 |                 |                 |
     v                 v                 v                 v
+----------+      +----------+      +----------+      +----------+
| Local    |      | Local    |      | Local    |      | Local    |
| Hash     |      | Hash     |      | Hash     |      | Hash     |
| (partial)|      | (partial)|      | (partial)|      | (partial)|
+----+-----+      +----+-----+      +----+-----+      +----+-----+
     |                 |                 |                 |
     +--------+--------+--------+--------+
              |
              v (merge step)
     +------------------+
     | Combined Hash    |
     | (all particles)  |
     +------------------+


STEP 2: Parallel Neighbor Query
-------------------------------

Particles divided among threads again:

Thread 0: query neighbors for particles 0-24999
Thread 1: query neighbors for particles 25000-49999
Thread 2: query neighbors for particles 50000-74999
Thread 3: query neighbors for particles 75000-99999

Each thread writes to its own output vector (no synchronization needed)

     Thread 0          Thread 1          Thread 2          Thread 3
     +-------+         +-------+         +-------+         +-------+
     | Pairs |         | Pairs |         | Pairs |         | Pairs |
     | 0-M   |         | M-N   |         | N-O   |         | O-P   |
     +---+---+         +---+---+         +---+---+         +---+---+
         |                 |                 |                 |
         +--------+--------+--------+--------+
                  |
                  v
     +---------------------------+
     | All Candidate Pairs       |
     | (concatenated, no merge)  |
     +---------------------------+


SIMD NARROW-PHASE PIPELINE
====================================================================

Input: Vector of CollisionPair candidates
Output: Vector of Contact (actual collisions)

Process pairs in batches of 8 (AVX) or 4 (SSE):

Pairs:    [p0, p1, p2, p3, p4, p5, p6, p7]
          Each pair = (a, b) indices

Load into SIMD registers:
  ax = [pos_x[a0], pos_x[a1], ..., pos_x[a7]]
  ay = [pos_y[a0], pos_y[a1], ..., pos_y[a7]]
  bx = [pos_x[b0], pos_x[b1], ..., pos_x[b7]]
  by = [pos_y[b0], pos_y[b1], ..., pos_y[b7]]

Compute distances (SIMD):
  dx = bx - ax
  dy = by - ay
  dist_sq = dx*dx + dy*dy

Compare with sum of radii (SIMD):
  ra = [radius[a0], ..., radius[a7]]
  rb = [radius[b0], ..., radius[b7]]
  threshold_sq = (ra + rb)^2

Collision mask:
  mask = dist_sq < threshold_sq
  e.g., [true, false, true, true, false, false, true, false]

Extract colliding pairs using mask:
  for i where mask[i] is true:
      emit Contact{pairs[i].a, pairs[i].b, ...}

Implementation Guide

Development Environment Setup

# Required: C++17 compiler with parallel algorithms and SIMD support
# GCC 10+ or Clang 12+ recommended

# Ubuntu/Debian
sudo apt-get install g++-12 libtbb-dev cmake ninja-build

# macOS (with Homebrew)
brew install gcc@13 tbb cmake ninja

# Check SIMD support
g++ -march=native -dM -E - < /dev/null | grep -E "AVX|SSE"

# Create project structure
mkdir -p physics_engine/{src,include,tests,benchmarks}
cd physics_engine

# Create CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(physics_engine CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Enable native SIMD instructions
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")

# Find TBB for parallel algorithms
find_package(TBB REQUIRED)

# Main library
add_library(physics_lib STATIC
    src/ParticleSystem.cpp
    src/SpatialHash.cpp
    src/CollisionDetector.cpp
    src/Integrator.cpp
    src/FrameProfiler.cpp
)
target_include_directories(physics_lib PUBLIC include)
target_link_libraries(physics_lib TBB::tbb)

# Demo executable
add_executable(physics_demo src/main.cpp)
target_link_libraries(physics_demo physics_lib)

# Benchmarks
add_executable(physics_bench benchmarks/benchmark.cpp)
target_link_libraries(physics_bench physics_lib)

# Tests
enable_testing()
add_executable(physics_tests tests/test_spatial_hash.cpp tests/test_integration.cpp)
target_link_libraries(physics_tests physics_lib)
add_test(NAME physics_tests COMMAND physics_tests)
EOF

Implementation Phases

Phase 1: Particle System Foundation (Days 1-4)

Goals:

Implement SoA ParticleSystem with aligned memory
Basic scalar Verlet integration
Boundary collision with walls
Visual verification (optional: SDL2 or headless benchmark)

// include/ParticleSystem.hpp
#pragma once
#include <vector>
#include <cstdint>
#include <cstddef>
#include <memory>

// Aligned allocator for SIMD
template<typename T, std::size_t Alignment>
struct AlignedAllocator {
    using value_type = T;

    T* allocate(std::size_t n) {
        return static_cast<T*>(std::aligned_alloc(Alignment, n * sizeof(T)));
    }

    void deallocate(T* p, std::size_t) {
        std::free(p);
    }
};

class ParticleSystem {
public:
    // SoA layout with 32-byte alignment for AVX
    std::vector<float, AlignedAllocator<float, 32>> pos_x;
    std::vector<float, AlignedAllocator<float, 32>> pos_y;
    std::vector<float, AlignedAllocator<float, 32>> vel_x;
    std::vector<float, AlignedAllocator<float, 32>> vel_y;
    std::vector<float, AlignedAllocator<float, 32>> acc_x;
    std::vector<float, AlignedAllocator<float, 32>> acc_y;
    std::vector<float> radius;
    std::vector<float> mass;

    std::size_t count = 0;

    void reserve(std::size_t n);
    void add_particle(float x, float y, float vx, float vy, float r, float m);

    // Physics operations
    void apply_gravity(float gx, float gy);
    void integrate_scalar(float dt);
    void integrate_simd(float dt);  // Phase 3
    void enforce_bounds(float min_x, float max_x, float min_y, float max_y);
    void apply_damping(float factor);
};

Checkpoint: 1000 particles falling with gravity, bouncing off walls.

Phase 2: Spatial Hashing (Days 5-9)

Goals:

Implement SpatialHash with insert and query
Test with known configurations
Integrate with particle system

// include/SpatialHash.hpp
#pragma once
#include <unordered_map>
#include <vector>
#include <cstdint>

class SpatialHash {
public:
    explicit SpatialHash(float cell_size);

    void clear();
    void insert(uint32_t id, float x, float y);
    void query_neighbors(float x, float y, float radius,
                         std::vector<uint32_t>& out) const;

    // Statistics
    size_t cell_count() const { return cells_.size(); }
    size_t max_cell_size() const;

private:
    float cell_size_;
    float inv_cell_size_;

    // Hash cell coordinates to cell ID
    static uint64_t hash_coords(int32_t cx, int32_t cy);

    // Cell ID -> list of particle IDs
    std::unordered_map<uint64_t, std::vector<uint32_t>> cells_;
};

Implementation detail:

uint64_t SpatialHash::hash_coords(int32_t cx, int32_t cy) {
    // Use large primes to reduce collisions
    constexpr uint64_t PRIME_X = 73856093ULL;
    constexpr uint64_t PRIME_Y = 19349663ULL;
    return (static_cast<uint64_t>(cx) * PRIME_X) ^
           (static_cast<uint64_t>(cy) * PRIME_Y);
}

void SpatialHash::insert(uint32_t id, float x, float y) {
    int32_t cx = static_cast<int32_t>(std::floor(x * inv_cell_size_));
    int32_t cy = static_cast<int32_t>(std::floor(y * inv_cell_size_));
    cells_[hash_coords(cx, cy)].push_back(id);
}

void SpatialHash::query_neighbors(float x, float y, float r,
                                   std::vector<uint32_t>& out) const {
    out.clear();

    int32_t min_cx = static_cast<int32_t>(std::floor((x - r) * inv_cell_size_));
    int32_t max_cx = static_cast<int32_t>(std::floor((x + r) * inv_cell_size_));
    int32_t min_cy = static_cast<int32_t>(std::floor((y - r) * inv_cell_size_));
    int32_t max_cy = static_cast<int32_t>(std::floor((y + r) * inv_cell_size_));

    for (int32_t cy = min_cy; cy <= max_cy; ++cy) {
        for (int32_t cx = min_cx; cx <= max_cx; ++cx) {
            auto it = cells_.find(hash_coords(cx, cy));
            if (it != cells_.end()) {
                out.insert(out.end(), it->second.begin(), it->second.end());
            }
        }
    }
}

Checkpoint: Spatial hash correctly returns neighbors; collision detection works.

Phase 3: SIMD Integration (Days 10-14)

Goals:

Implement SIMD Verlet integration using std::simd or intrinsics
Benchmark SIMD vs scalar
Handle non-multiple-of-8 particle counts

// Using std::experimental::simd (GCC 11+, Clang with libstdc++)
#include <experimental/simd>
namespace stdx = std::experimental;

void ParticleSystem::integrate_simd(float dt) {
    using simd_t = stdx::native_simd<float>;
    constexpr size_t lanes = simd_t::size();

    simd_t dt_vec(dt);
    simd_t half_dt2(0.5f * dt * dt);

    // Process X component
    for (size_t i = 0; i + lanes <= count; i += lanes) {
        simd_t x(pos_x.data() + i, stdx::element_aligned);
        simd_t vx(vel_x.data() + i, stdx::element_aligned);
        simd_t ax(acc_x.data() + i, stdx::element_aligned);

        // Verlet: x_new = x + v*dt + 0.5*a*dt^2
        x += vx * dt_vec + ax * half_dt2;
        vx += ax * dt_vec;

        x.copy_to(pos_x.data() + i, stdx::element_aligned);
        vx.copy_to(vel_x.data() + i, stdx::element_aligned);
    }

    // Process Y component (similar)
    for (size_t i = 0; i + lanes <= count; i += lanes) {
        simd_t y(pos_y.data() + i, stdx::element_aligned);
        simd_t vy(vel_y.data() + i, stdx::element_aligned);
        simd_t ay(acc_y.data() + i, stdx::element_aligned);

        y += vy * dt_vec + ay * half_dt2;
        vy += ay * dt_vec;

        y.copy_to(pos_y.data() + i, stdx::element_aligned);
        vy.copy_to(vel_y.data() + i, stdx::element_aligned);
    }

    // Scalar cleanup for remaining particles
    size_t remainder_start = (count / lanes) * lanes;
    for (size_t i = remainder_start; i < count; ++i) {
        pos_x[i] += vel_x[i] * dt + 0.5f * acc_x[i] * dt * dt;
        pos_y[i] += vel_y[i] * dt + 0.5f * acc_y[i] * dt * dt;
        vel_x[i] += acc_x[i] * dt;
        vel_y[i] += acc_y[i] * dt;
    }

    // Clear accelerations for next frame
    std::fill(acc_x.begin(), acc_x.end(), 0.0f);
    std::fill(acc_y.begin(), acc_y.end(), 0.0f);
}

Checkpoint: SIMD integration is 4-8x faster than scalar for large particle counts.

Phase 4: Parallel Broad-Phase (Days 15-21)

Goals:

Parallelize spatial hash rebuild
Parallelize neighbor queries
Handle thread-safety correctly

#include <execution>
#include <algorithm>
#include <numeric>
#include <mutex>

struct ThreadLocalHash {
    SpatialHash hash;
    std::vector<uint32_t> particle_ids;
};

void CollisionDetector::broad_phase_parallel(
    const ParticleSystem& ps,
    std::vector<CollisionPair>& out_pairs,
    size_t num_threads)
{
    out_pairs.clear();

    // Step 1: Build per-thread spatial hashes
    std::vector<ThreadLocalHash> local_hashes(num_threads,
        ThreadLocalHash{SpatialHash(cell_size_), {}});

    // Partition particles among threads
    std::vector<size_t> indices(ps.count);
    std::iota(indices.begin(), indices.end(), 0);

    std::for_each(std::execution::par,
        indices.begin(), indices.end(),
        [&](size_t i) {
            // Determine which thread owns this particle (simple modulo)
            size_t thread_id = i % num_threads;
            auto& local = local_hashes[thread_id];

            // Thread-local insert (no synchronization needed)
            local.hash.insert(static_cast<uint32_t>(i), ps.pos_x[i], ps.pos_y[i]);
            local.particle_ids.push_back(static_cast<uint32_t>(i));
        });

    // Step 2: Merge into global hash
    SpatialHash global_hash(cell_size_);
    for (const auto& local : local_hashes) {
        for (uint32_t id : local.particle_ids) {
            global_hash.insert(id, ps.pos_x[id], ps.pos_y[id]);
        }
    }

    // Step 3: Parallel neighbor query
    std::vector<std::vector<CollisionPair>> per_thread_pairs(num_threads);

    std::for_each(std::execution::par,
        indices.begin(), indices.end(),
        [&](size_t i) {
            size_t thread_id = i % num_threads;
            std::vector<uint32_t> neighbors;
            global_hash.query_neighbors(
                ps.pos_x[i], ps.pos_y[i], ps.radius[i] * 2.0f, neighbors);

            for (uint32_t j : neighbors) {
                if (i < j) {  // Avoid duplicate pairs
                    per_thread_pairs[thread_id].push_back(
                        {static_cast<uint32_t>(i), j});
                }
            }
        });

    // Step 4: Concatenate results
    for (const auto& pairs : per_thread_pairs) {
        out_pairs.insert(out_pairs.end(), pairs.begin(), pairs.end());
    }
}

Checkpoint: Parallel broad-phase scales with thread count; no races or duplicates.

Phase 5: SIMD Narrow-Phase (Days 22-28)

Goals:

Batch collision pairs for SIMD processing
Implement SIMD distance calculation
Generate contacts for colliding pairs

void CollisionDetector::narrow_phase_simd(
    const ParticleSystem& ps,
    const std::vector<CollisionPair>& pairs,
    std::vector<Contact>& out_contacts)
{
    out_contacts.clear();

    using simd_t = stdx::native_simd<float>;
    constexpr size_t lanes = simd_t::size();

    size_t pair_count = pairs.size();
    size_t batches = pair_count / lanes;

    for (size_t batch = 0; batch < batches; ++batch) {
        size_t base = batch * lanes;

        // Gather particle data for this batch
        alignas(32) float ax[lanes], ay[lanes], bx[lanes], by[lanes];
        alignas(32) float ra[lanes], rb[lanes];

        for (size_t i = 0; i < lanes; ++i) {
            const auto& pair = pairs[base + i];
            ax[i] = ps.pos_x[pair.a];
            ay[i] = ps.pos_y[pair.a];
            bx[i] = ps.pos_x[pair.b];
            by[i] = ps.pos_y[pair.b];
            ra[i] = ps.radius[pair.a];
            rb[i] = ps.radius[pair.b];
        }

        // Load into SIMD registers
        simd_t vax(ax, stdx::element_aligned);
        simd_t vay(ay, stdx::element_aligned);
        simd_t vbx(bx, stdx::element_aligned);
        simd_t vby(by, stdx::element_aligned);
        simd_t vra(ra, stdx::element_aligned);
        simd_t vrb(rb, stdx::element_aligned);

        // Compute distance squared
        simd_t dx = vbx - vax;
        simd_t dy = vby - vay;
        simd_t dist_sq = dx * dx + dy * dy;

        // Compute threshold (sum of radii squared)
        simd_t sum_r = vra + vrb;
        simd_t threshold_sq = sum_r * sum_r;

        // Collision mask
        auto mask = dist_sq < threshold_sq;

        // Extract colliding pairs
        alignas(32) float dist_sq_arr[lanes];
        dist_sq.copy_to(dist_sq_arr, stdx::element_aligned);

        for (size_t i = 0; i < lanes; ++i) {
            if (mask[i]) {
                float dist = std::sqrt(dist_sq_arr[i]);
                float penetration = (ra[i] + rb[i]) - dist;

                // Normal from a to b
                float nx = (dist > 1e-6f) ? (bx[i] - ax[i]) / dist : 1.0f;
                float ny = (dist > 1e-6f) ? (by[i] - ay[i]) / dist : 0.0f;

                out_contacts.push_back({
                    pairs[base + i].a,
                    pairs[base + i].b,
                    nx, ny,
                    penetration
                });
            }
        }
    }

    // Scalar remainder
    for (size_t i = batches * lanes; i < pair_count; ++i) {
        const auto& pair = pairs[i];
        float dx = ps.pos_x[pair.b] - ps.pos_x[pair.a];
        float dy = ps.pos_y[pair.b] - ps.pos_y[pair.a];
        float dist_sq = dx * dx + dy * dy;
        float sum_r = ps.radius[pair.a] + ps.radius[pair.b];

        if (dist_sq < sum_r * sum_r) {
            float dist = std::sqrt(dist_sq);
            float penetration = sum_r - dist;
            float nx = (dist > 1e-6f) ? dx / dist : 1.0f;
            float ny = (dist > 1e-6f) ? dy / dist : 0.0f;

            out_contacts.push_back({pair.a, pair.b, nx, ny, penetration});
        }
    }
}

Checkpoint: SIMD narrow-phase processes collision pairs efficiently.

Phase 6: Collision Response & Integration (Days 29-35)

Goals:

Implement impulse-based collision response
Integrate constraint solver
Complete frame loop with profiling

void CollisionDetector::resolve_collisions(
    ParticleSystem& ps,
    const std::vector<Contact>& contacts)
{
    for (const auto& c : contacts) {
        // Relative velocity
        float rel_vx = ps.vel_x[c.b] - ps.vel_x[c.a];
        float rel_vy = ps.vel_y[c.b] - ps.vel_y[c.a];

        // Relative velocity along normal
        float rel_v_normal = rel_vx * c.nx + rel_vy * c.ny;

        // Don't resolve if objects are separating
        if (rel_v_normal > 0) continue;

        // Restitution (bounciness)
        float e = std::min(ps.restitution[c.a], ps.restitution[c.b]);

        // Impulse magnitude
        float inv_mass_a = 1.0f / ps.mass[c.a];
        float inv_mass_b = 1.0f / ps.mass[c.b];
        float j = -(1.0f + e) * rel_v_normal / (inv_mass_a + inv_mass_b);

        // Apply impulse
        float impulse_x = j * c.nx;
        float impulse_y = j * c.ny;

        ps.vel_x[c.a] -= impulse_x * inv_mass_a;
        ps.vel_y[c.a] -= impulse_y * inv_mass_a;
        ps.vel_x[c.b] += impulse_x * inv_mass_b;
        ps.vel_y[c.b] += impulse_y * inv_mass_b;

        // Position correction (prevent sinking)
        const float correction_percent = 0.8f;
        const float slop = 0.01f;  // Allowable penetration
        float correction = std::max(c.penetration - slop, 0.0f) /
                          (inv_mass_a + inv_mass_b) * correction_percent;
        float corr_x = correction * c.nx;
        float corr_y = correction * c.ny;

        ps.pos_x[c.a] -= corr_x * inv_mass_a;
        ps.pos_y[c.a] -= corr_y * inv_mass_a;
        ps.pos_x[c.b] += corr_x * inv_mass_b;
        ps.pos_y[c.b] += corr_y * inv_mass_b;
    }
}

Checkpoint: Complete physics simulation running at target FPS.

Phase 7: Profiling & Optimization (Days 36-42)

Goals:

Add detailed frame profiling
Identify and fix bottlenecks
Achieve performance targets

// include/FrameProfiler.hpp
#pragma once
#include <chrono>
#include <string>
#include <vector>
#include <iostream>

class FrameProfiler {
public:
    void begin_phase(const std::string& name);
    void end_phase();
    void end_frame();
    void print_summary() const;

private:
    struct PhaseData {
        std::string name;
        double total_ms = 0.0;
        size_t count = 0;
    };

    std::chrono::high_resolution_clock::time_point phase_start_;
    std::vector<PhaseData> phases_;
    size_t current_phase_ = 0;
    size_t frame_count_ = 0;
    double total_frame_ms_ = 0.0;
};

// Usage in main loop
void simulate_frame(ParticleSystem& ps, CollisionDetector& detector,
                    FrameProfiler& profiler)
{
    profiler.begin_phase("Gravity");
    ps.apply_gravity(0.0f, -9.81f);
    profiler.end_phase();

    profiler.begin_phase("Broad-Phase");
    std::vector<CollisionPair> pairs;
    detector.broad_phase_parallel(ps, pairs, 8);
    profiler.end_phase();

    profiler.begin_phase("Narrow-Phase");
    std::vector<Contact> contacts;
    detector.narrow_phase_simd(ps, pairs, contacts);
    profiler.end_phase();

    profiler.begin_phase("Collision Response");
    detector.resolve_collisions(ps, contacts);
    profiler.end_phase();

    profiler.begin_phase("Integration");
    ps.integrate_simd(1.0f / 60.0f);
    profiler.end_phase();

    profiler.begin_phase("Bounds");
    ps.enforce_bounds(0.0f, 1000.0f, 0.0f, 1000.0f);
    profiler.end_phase();

    profiler.end_frame();
}

Testing Strategy

Unit Tests

// tests/test_spatial_hash.cpp
TEST_CASE("Spatial hash insertion and query", "[spatial_hash]") {
    SpatialHash hash(10.0f);

    // Insert particles in a grid pattern
    for (int i = 0; i < 100; ++i) {
        float x = (i % 10) * 5.0f;
        float y = (i / 10) * 5.0f;
        hash.insert(i, x, y);
    }

    // Query should return neighbors
    std::vector<uint32_t> neighbors;
    hash.query_neighbors(25.0f, 25.0f, 10.0f, neighbors);

    // Should find particles in nearby cells
    REQUIRE(neighbors.size() >= 4);
    REQUIRE(neighbors.size() <= 16);
}

TEST_CASE("SIMD integration matches scalar", "[simd]") {
    ParticleSystem ps_scalar, ps_simd;

    // Initialize identically
    for (int i = 0; i < 1000; ++i) {
        float x = i * 1.0f, y = i * 2.0f;
        float vx = 1.0f, vy = -1.0f;
        ps_scalar.add_particle(x, y, vx, vy, 1.0f, 1.0f);
        ps_simd.add_particle(x, y, vx, vy, 1.0f, 1.0f);
    }

    ps_scalar.apply_gravity(0.0f, -9.81f);
    ps_simd.apply_gravity(0.0f, -9.81f);

    ps_scalar.integrate_scalar(1.0f / 60.0f);
    ps_simd.integrate_simd(1.0f / 60.0f);

    // Compare results
    for (size_t i = 0; i < ps_scalar.count; ++i) {
        REQUIRE(std::abs(ps_scalar.pos_x[i] - ps_simd.pos_x[i]) < 1e-5f);
        REQUIRE(std::abs(ps_scalar.pos_y[i] - ps_simd.pos_y[i]) < 1e-5f);
    }
}

Stress Tests

// benchmarks/benchmark.cpp
void benchmark_scaling() {
    const std::vector<size_t> particle_counts = {1000, 10000, 50000, 100000};
    const std::vector<size_t> thread_counts = {1, 2, 4, 8};

    for (size_t n : particle_counts) {
        std::cout << "\n=== " << n << " particles ===" << std::endl;

        ParticleSystem ps;
        initialize_random_particles(ps, n);

        for (size_t threads : thread_counts) {
            CollisionDetector detector(2.0f);

            auto start = std::chrono::high_resolution_clock::now();

            // Run 100 frames
            for (int frame = 0; frame < 100; ++frame) {
                std::vector<CollisionPair> pairs;
                detector.broad_phase_parallel(ps, pairs, threads);

                std::vector<Contact> contacts;
                detector.narrow_phase_simd(ps, pairs, contacts);

                detector.resolve_collisions(ps, contacts);
                ps.integrate_simd(1.0f / 60.0f);
            }

            auto end = std::chrono::high_resolution_clock::now();
            double ms = std::chrono::duration<double, std::milli>(end - start).count();

            std::cout << threads << " threads: " << ms / 100.0 << " ms/frame" << std::endl;
        }
    }
}

Verification Tests

// Verify conservation of momentum
TEST_CASE("Momentum conservation in collisions", "[physics]") {
    ParticleSystem ps;

    // Two particles moving toward each other
    ps.add_particle(0.0f, 0.0f, 1.0f, 0.0f, 1.0f, 1.0f);  // Moving right
    ps.add_particle(3.0f, 0.0f, -1.0f, 0.0f, 1.0f, 1.0f); // Moving left

    float initial_momentum_x = ps.vel_x[0] * ps.mass[0] + ps.vel_x[1] * ps.mass[1];

    // Simulate until collision
    CollisionDetector detector(2.0f);
    for (int i = 0; i < 100; ++i) {
        std::vector<CollisionPair> pairs;
        detector.broad_phase(ps, pairs);

        std::vector<Contact> contacts;
        detector.narrow_phase(ps, pairs, contacts);

        detector.resolve_collisions(ps, contacts);
        ps.integrate_simd(1.0f / 60.0f);
    }

    float final_momentum_x = ps.vel_x[0] * ps.mass[0] + ps.vel_x[1] * ps.mass[1];

    REQUIRE(std::abs(initial_momentum_x - final_momentum_x) < 1e-4f);
}

Common Pitfalls

Performance Pitfalls

Problem	Symptom	Solution
False sharing	Poor parallel scaling	Pad per-thread data to cache line size (64 bytes)
Unaligned SIMD loads	Crashes or slow performance	Use AlignedAllocator, ensure 32-byte alignment
Hash table contention	Parallel hash insert is slow	Use per-thread hash tables, merge at end
Too many collision pairs	Narrow-phase dominates	Tune cell size to average 4-16 objects/cell
Branch misprediction	SIMD masked code is slow	Process collisions in batches, sort by outcome

Correctness Pitfalls

Problem	Symptom	Solution
Duplicate pairs	Objects bounce twice	Use i < j check in broad-phase
Missing collisions	Objects pass through each other	Cell size too large; reduce to 2*max_radius
Energy explosion	Objects accelerate over time	Use semi-implicit Euler or Verlet; clamp velocities
Penetration stacking	Objects sink into each other	Add position correction step after impulse
Race conditions	Non-deterministic behavior	Block boundaries handled with locks or ghost particles

SIMD Pitfalls

// WRONG: Unaligned access (may crash on some CPUs)
simd_t x(pos_x.data() + 1);  // pos_x.data() + 1 is not 32-byte aligned!

// RIGHT: Ensure alignment
assert(reinterpret_cast<uintptr_t>(pos_x.data() + i) % 32 == 0);
simd_t x(pos_x.data() + i, stdx::element_aligned);

// WRONG: Mixing SIMD and scalar in tight loop
for (size_t i = 0; i < count; ++i) {
    if (should_process(i)) {  // Branching kills SIMD
        process_simd(i);
    }
}

// RIGHT: Batch by condition, then process
std::vector<size_t> to_process;
for (size_t i = 0; i < count; ++i) {
    if (should_process(i)) {
        to_process.push_back(i);
    }
}
// Now process to_process in SIMD batches

Extensions & Challenges

Beginner Extensions

Visualization: Add SDL2 or OpenGL rendering to see particles
Mouse interaction: Click to spawn particles or apply forces
Particle trails: Draw motion trails for visual effect
Gravity wells: Add attractors that pull particles

Intermediate Extensions

Rigid body rotation: Add angular velocity and torque
Polygon collision: Implement SAT for convex polygons
Constraints: Add distance constraints (springs) between particles
Sleeping objects: Don’t simulate stationary objects

Advanced Extensions

Continuous collision detection: Prevent tunneling for fast objects
GPU acceleration: Port narrow-phase to CUDA/OpenCL
Persistent contacts: Cache collision information between frames
Hierarchical spatial hash: Multi-resolution for varying object sizes

Challenge Problems

1 Million Particles: Achieve 30 FPS with 1,000,000 particles
Stack Stability: 100 boxes stacked perfectly stable for 1000 frames
Fluid Simulation: Implement SPH (Smoothed Particle Hydrodynamics)
Cloth Simulation: Particles connected by distance constraints forming cloth

Resources

Essential Reading

Resource	Topic	Chapters/Sections
“Game Physics Engine Development” - Ian Millington	Complete physics pipeline	Chapters 7-12
“Real-Time Collision Detection” - Christer Ericson	Spatial partitioning, collision	Chapters 7-8
“Computer Graphics from Scratch” - Gabriel Gambetta	Rendering integration	Chapters on 3D graphics
“C++ Concurrency in Action” - Anthony Williams	Parallel algorithms	Chapters 7-8

Online Resources

Box2D source code: https://github.com/erincatto/box2d
GDC Physics Tutorials: Search “Game Developers Conference physics”
Erin Catto’s GDC slides on sequential impulses
Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

Video Resources

“Physics for Game Developers” - YouTube series by Two-Bit Coding
“Building a Physics Engine” - Sebastian Lague
“SIMD Programming” - Handmade Hero episodes on SIMD

Tools for Debugging

# CPU performance counters
perf stat -e cache-misses,cache-references,instructions,cycles ./physics_demo

# SIMD vectorization report (GCC)
g++ -O3 -march=native -fopt-info-vec-optimized src/Integrator.cpp

# Thread profiling
valgrind --tool=helgrind ./physics_demo

# Memory alignment check
valgrind --tool=memcheck ./physics_demo

Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

I can explain why SoA is better than AoS for SIMD operations
I understand how spatial hashing reduces collision detection from O(N^2) to O(N)
I can describe the tradeoffs between cell size in spatial hashing
I know why boundary particles require special handling in parallel partitioning
I can explain Verlet integration and why it’s preferred over Euler
I understand the frame budget constraint and how to stay within it

Implementation

ParticleSystem uses SoA layout with aligned memory
Spatial hash correctly returns neighbors without duplicates
SIMD integration is >4x faster than scalar baseline
Parallel broad-phase scales with thread count (>60% efficiency on 8 threads)
Collision response conserves momentum (verified by test)
Frame time is under 10ms for 100,000 particles on target hardware

Performance

Achieved 60 FPS with 100,000 particles
SIMD operations dominate (scalar fallbacks < 1%)
Cache miss rate is acceptable (profile with perf)
Thread utilization is balanced (no straggler threads)

Code Quality

Code is well-organized with clear separation of concerns
Performance-critical code is documented with complexity analysis
Unit tests cover core functionality
Benchmarks are reproducible and documented

Submission/Completion Criteria

Minimum Viable Completion

10,000 particles simulated at 60 FPS
Basic spatial hashing for broad-phase
SIMD integration working
Simple collision response

Full Completion

100,000 particles at 60 FPS with 8 threads
Parallel spatial hash rebuild and query
SIMD narrow-phase collision detection
Complete collision response with position correction
Frame profiler with per-phase timing
Performance scaling benchmarks documented

Excellence (Going Above & Beyond)

250,000+ particles at 30+ FPS
GPU acceleration for narrow-phase
Continuous collision detection
Visualization with real-time rendering
Constraint system (springs, joints)
Published benchmark comparison with other physics engines

The Core Question You’re Answering

“How do you build a physics simulation that handles 100,000+ interacting objects while guaranteeing consistent frame times under 16 milliseconds?”

This project is the ultimate integration test for everything learned in the C++ Concurrency and Parallelism series. It combines:

Thread pools (from Project 3) for parallel work distribution
Parallel algorithms (from Projects 7-8) for broad-phase collision
SIMD (from Projects 12-13) for narrow-phase and integration
Lock-free techniques (from Projects 5-6) for thread-safe data structures
Real-time constraints requiring systematic profiling and optimization

When you complete this project, you will have demonstrated mastery of modern C++ parallelism and the ability to apply these techniques to a demanding real-world domain.

Books That Will Help

Topic	Book	Specific Chapters
Physics simulation	“Game Physics Engine Development” - Millington	Chapters 7-12: Collision detection, response
Collision detection	“Real-Time Collision Detection” - Ericson	Chapter 7: Spatial partitioning
SIMD optimization	“Modern X86 Assembly Language Programming” - Kusswurm	Chapters on AVX
Parallel algorithms	“C++ Concurrency in Action” - Williams	Chapter 10: Parallel algorithms
Performance tuning	“Optimizing Software in C++” - Fog	Sections 7-9: Data layout, SIMD
Game architecture	“Game Programming Patterns” - Nystrom	Chapter on Update Method

This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the README.