Project 14: Real-Time Game Physics with Parallelism
“The real-time constraint is not a performance target - it is a hard correctness requirement. A physics engine that misses frame deadlines delivers broken gameplay, no matter how accurate its simulation.” - Erin Catto, Creator of Box2D
Quick Reference
| Attribute | Value |
|---|---|
| Language | C++ (alt: Rust, C) |
| Difficulty | Expert |
| Time | 4-6 weeks |
| Prerequisites | Projects 7-8 (Parallel Algorithms), 12-13 (SIMD) |
| Coolness | Level 5: Pure Magic |
| Portfolio Value | Resume Gold - demonstrates systems mastery |
Learning Objectives
By completing this project, you will:
-
Master real-time systems constraints: Understand frame budgets, deadline-driven design, and how to profile and optimize for consistent frame times
-
Implement parallel broad-phase collision detection: Use spatial data structures (spatial hashing, grid-based partitioning) with parallel algorithms to find potential collision pairs
-
Apply SIMD to narrow-phase physics: Vectorize intersection tests and physics integration using std::simd or platform intrinsics
-
Design cache-friendly data layouts: Transform Array-of-Structures (AoS) to Structure-of-Arrays (SoA) for optimal SIMD and cache performance
-
Integrate multiple parallel techniques: Combine thread pools, parallel algorithms, SIMD, and lock-free communication in a single coherent system
-
Handle boundary synchronization: Solve the challenge of particles near region boundaries without introducing races or bottlenecks
-
Profile and optimize systematically: Use profilers and benchmarks to identify bottlenecks and verify improvements
Theoretical Foundation
Part 1: The Physics Simulation Pipeline
Every real-time physics engine follows a similar pipeline, repeated every frame:
GAME PHYSICS PIPELINE (Per Frame)
====================================================================
Frame Start (t = 0ms)
|
v
+------------------------------------------------------------------+
| BROAD-PHASE COLLISION |
| |
| Goal: Find POTENTIAL collision pairs from N objects |
| Complexity target: O(N) or O(N log N), NOT O(N^2) |
| |
| Techniques: |
| - Spatial Hashing: Hash position to grid cell |
| - Sweep and Prune: Sort AABBs along one axis |
| - BVH: Hierarchical bounding volumes |
| |
| Output: List of (object_a, object_b) pairs that MIGHT collide |
+------------------------------------------------------------------+
|
v (2-3ms elapsed)
+------------------------------------------------------------------+
| NARROW-PHASE COLLISION |
| |
| Goal: Exact collision detection for candidate pairs |
| Per-pair tests are expensive but count is limited |
| |
| Techniques: |
| - GJK/EPA: General convex shapes |
| - SAT: Separating Axis Theorem for polygons |
| - Sphere-Sphere: sqrt-free distance test |
| - SIMD: Process 4/8 pairs simultaneously |
| |
| Output: Contact points, normals, penetration depths |
+------------------------------------------------------------------+
|
v (5-6ms elapsed)
+------------------------------------------------------------------+
| CONSTRAINT RESOLUTION |
| |
| Goal: Resolve overlaps and apply joint constraints |
| |
| Techniques: |
| - Sequential Impulses (SI): Iterative solver |
| - Projected Gauss-Seidel (PGS): Matrix-free iteration |
| - Position-Based Dynamics (PBD): Direct position correction |
| |
| Output: Velocity changes to separate colliding objects |
+------------------------------------------------------------------+
|
v (7-8ms elapsed)
+------------------------------------------------------------------+
| INTEGRATION |
| |
| Goal: Update positions and velocities |
| |
| Techniques: |
| - Euler: x += v*dt, v += a*dt (simple, less stable) |
| - Verlet: x_new = 2*x - x_old + a*dt^2 (stable, popular) |
| - RK4: 4th order Runge-Kutta (accurate, expensive) |
| - SIMD: Process 8 particles per instruction |
| |
| Output: New positions and velocities for all objects |
+------------------------------------------------------------------+
|
v (8-9ms elapsed)
+------------------------------------------------------------------+
| RENDERING SYNC |
| |
| Goal: Prepare physics state for renderer |
| Copy/interpolate positions to render thread |
| Remaining time: VSync wait or additional work |
+------------------------------------------------------------------+
|
v
Frame End (t = 16.67ms for 60 FPS)
FRAME BUDGET BREAKDOWN (60 FPS = 16.67ms total):
================================================
Broad-phase: 2.1ms (12.6%)
Narrow-phase: 3.8ms (22.8%)
Constraints: 1.5ms (9.0%)
Integration: 0.8ms (4.8%)
Rendering prep: 0.5ms (3.0%)
---------------------------------
Total physics: 8.7ms (52.2%)
Remaining: 7.97ms (for rendering, audio, etc.)
Book Reference: “Game Physics Engine Development” by Ian Millington, Chapters 7-9 cover the complete physics pipeline.
Part 2: Spatial Partitioning for Broad-Phase
The naive O(N^2) approach of checking every pair is unacceptable for large object counts. Spatial partitioning reduces this dramatically:
SPATIAL HASHING EXPLAINED
====================================================================
CONCEPT: Divide world into grid cells, hash position to cell ID
World Space (continuous): Grid Space (discrete):
+---------------------------+ +---+---+---+---+---+
| . o o | | 0 | 1 | 2 | 3 | 4 |
| . . | +---+---+---+---+---+
| o . o | ---> | 5 | 6 | 7 | 8 | 9 |
| . o . | +---+---+---+---+---+
| o . o | |10 |11 |12 |13 |14 |
+---------------------------+ +---+---+---+---+---+
HASH FUNCTION:
cell_x = floor(pos_x / cell_size)
cell_y = floor(pos_y / cell_size)
cell_id = cell_y * grid_width + cell_x
Or for infinite worlds (no grid bounds):
cell_id = hash(cell_x, cell_y) // e.g., (cell_x * 73856093) ^ (cell_y * 19349663)
DATA STRUCTURE:
====================================================================
Option 1: HashMap<CellID, Vec<ObjectID>>
- Pros: Sparse, handles infinite worlds
- Cons: Hash overhead, cache-unfriendly iteration
Option 2: Dense Grid Array
- Pros: Cache-friendly, simple indexing
- Cons: Memory proportional to world size
Option 3: Compact Hash Table (Robin Hood)
- Pros: Balance of sparse storage and cache locality
- Best choice for most games
QUERYING NEIGHBORS:
====================================================================
For object at position (px, py) with radius r:
min_cell_x = floor((px - r) / cell_size)
max_cell_x = floor((px + r) / cell_size)
min_cell_y = floor((py - r) / cell_size)
max_cell_y = floor((py + r) / cell_size)
for cy in min_cell_y..=max_cell_y:
for cx in min_cell_x..=max_cell_x:
cell_id = hash(cx, cy)
for other_object in grid[cell_id]:
if object.id < other_object.id: // Avoid duplicate pairs
candidate_pairs.push((object.id, other_object.id))
CHOOSING CELL SIZE:
====================================================================
Too Small Optimal Too Large
(cell < radius) (cell ~ 2*radius) (cell >> radius)
Objects per Few objects per cell ~4-16 objects/cell Many objects/cell
cell: Many cells to check Balanced Few cells, many pairs
Overhead: High iteration count Optimal O(N^2) within cells
Rule of thumb: cell_size = 2 * average_object_diameter
PARALLEL SPATIAL HASH INSERT/QUERY
====================================================================
CHALLENGE: Multiple threads inserting to same cell = data race!
SOLUTION 1: Per-Thread Grid (Recommended for insert-heavy)
==========================================================
Thread 0 Grid: Thread 1 Grid: Thread 2 Grid:
+---+---+---+ +---+---+---+ +---+---+---+
| a | | b | | | c | | | d | | e |
+---+---+---+ +---+---+---+ +---+---+---+
| | f | | | g | | h | | | i | |
+---+---+---+ +---+---+---+ +---+---+---+
After parallel insert, merge grids:
+---+---+---+
|a,d| c |b,e|
+---+---+---+
| g |f,i| h |
+---+---+---+
Merge step: O(cells) work, parallelizable
SOLUTION 2: Atomic Cell Lists (Better for query-heavy)
======================================================
struct Cell {
std::atomic<ObjectNode*> head; // Lock-free linked list
};
void insert(Cell& cell, ObjectID id) {
auto* node = new ObjectNode{id, nullptr};
node->next = cell.head.load(std::memory_order_relaxed);
while (!cell.head.compare_exchange_weak(
node->next, node,
std::memory_order_release,
std::memory_order_relaxed)) {
// Retry on contention
}
}
SOLUTION 3: Partitioned Cells (Best for balanced workloads)
===========================================================
Assign cell ranges to threads:
- Thread 0 owns cells [0, 99]
- Thread 1 owns cells [100, 199]
- Thread 2 owns cells [200, 299]
- ...
Objects near boundaries:
+---+---+---+---+---+---+
| | | A | | | | <-- A is near T0/T1 boundary
+---+---+---+---+---+---+
T0 T1
Object A must be:
1. Inserted into T0's grid (primary)
2. Also inserted into T1's grid (ghost copy)
3. Collision pairs involving ghosts handled carefully
This is the "domain decomposition" approach from HPC.
Book Reference: “Real-Time Collision Detection” by Christer Ericson, Chapter 7 covers spatial partitioning in depth.
Part 3: Structure of Arrays (SoA) for SIMD
The data layout dramatically affects SIMD efficiency:
ARRAY OF STRUCTURES (AoS) - Traditional OOP
====================================================================
struct Particle {
float x, y, z; // Position
float vx, vy, vz; // Velocity
float ax, ay, az; // Acceleration (force / mass)
float mass;
uint32_t flags;
};
std::vector<Particle> particles; // [P0, P1, P2, P3, ...]
Memory layout:
+----+----+----+----+----+----+----+----+----+----+----+
| x0 | y0 | z0 |vx0 |vy0 |vz0 |ax0 |ay0 |az0 | m0 |fl0 |
+----+----+----+----+----+----+----+----+----+----+----+
| x1 | y1 | z1 |vx1 |vy1 |vz1 |ax1 |ay1 |az1 | m1 |fl1 |
+----+----+----+----+----+----+----+----+----+----+----+
| x2 | y2 | z2 | ...
+----+----+----+
To update all X positions: x += vx * dt
- Load x0, skip 10 floats, load x1, skip 10 floats, load x2...
- Cache lines wasted loading unused data (vy, vz, ax, ...)
- SIMD impossible: data is non-contiguous!
STRUCTURE OF ARRAYS (SoA) - SIMD-Friendly
====================================================================
struct ParticleSystem {
std::vector<float> x; // All X positions contiguous
std::vector<float> y; // All Y positions contiguous
std::vector<float> z; // All Z positions contiguous
std::vector<float> vx; // All X velocities contiguous
std::vector<float> vy;
std::vector<float> vz;
std::vector<float> ax;
std::vector<float> ay;
std::vector<float> az;
std::vector<float> mass;
std::vector<uint32_t> flags;
size_t count;
};
Memory layout:
X array: +----+----+----+----+----+----+----+----+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | ...
+----+----+----+----+----+----+----+----+
VX array: +----+----+----+----+----+----+----+----+
|vx0 |vx1 |vx2 |vx3 |vx4 |vx5 |vx6 |vx7 | ...
+----+----+----+----+----+----+----+----+
To update all X positions: x += vx * dt
- Load 8 consecutive X values into SIMD register
- Load 8 consecutive VX values into SIMD register
- Multiply, add, store - 8 particles at once!
SIMD INTEGRATION (Verlet Method)
====================================================================
Physics equation: x_new = x + v*dt + 0.5*a*dt^2
void integrate_soa(ParticleSystem& ps, float dt) {
using simd_t = std::experimental::simd<float>;
constexpr size_t lanes = simd_t::size(); // 4 (SSE), 8 (AVX), etc.
simd_t dt_vec = dt;
simd_t half_dt2 = 0.5f * dt * dt;
// Process 8 particles at a time (AVX)
for (size_t i = 0; i + lanes <= ps.count; i += lanes) {
// Load 8 X positions
simd_t x(ps.x.data() + i, std::experimental::element_aligned);
simd_t vx(ps.vx.data() + i, std::experimental::element_aligned);
simd_t ax(ps.ax.data() + i, std::experimental::element_aligned);
// x_new = x + vx*dt + 0.5*ax*dt^2
x += vx * dt_vec + ax * half_dt2;
// v_new = vx + ax*dt
vx += ax * dt_vec;
// Store back
x.copy_to(ps.x.data() + i, std::experimental::element_aligned);
vx.copy_to(ps.vx.data() + i, std::experimental::element_aligned);
}
// Similar for Y, Z components...
// Scalar cleanup for remaining particles
for (size_t i = (ps.count / lanes) * lanes; i < ps.count; ++i) {
ps.x[i] += ps.vx[i] * dt + 0.5f * ps.ax[i] * dt * dt;
ps.vx[i] += ps.ax[i] * dt;
}
}
PERFORMANCE COMPARISON
====================================================================
Operation: Update 100,000 particle positions
AoS (scalar):
- Cache misses: ~100,000 (each particle access spans cache line)
- Instructions: 100,000 * 6 = 600,000 scalar ops
- Time: 2.1ms
SoA (SIMD, AVX):
- Cache misses: ~12,500 (8 particles per cache line, contiguous)
- Instructions: 12,500 * 6 = 75,000 SIMD ops (8x parallelism)
- Time: 0.12ms
Speedup: 17.5x (8x from SIMD + 2x from cache efficiency)
Book Reference: “Optimizing Software in C++” by Agner Fog, Sections 7-9 cover data layout and SIMD optimization.
Part 4: Parallel Collision Detection Strategy
PARALLEL BROAD-PHASE WITH DOMAIN DECOMPOSITION
====================================================================
World divided into regions, each owned by one thread:
+---------------+---------------+---------------+---------------+
| Region 0 | Region 1 | Region 2 | Region 3 |
| (Thread 0) | (Thread 1) | (Thread 2) | (Thread 3) |
| | | | |
| o o o | o o | o o o | o o |
| o o | o o | o | o o o |
| o o | o o | o o o | o |
| | | | |
+---------------+---------------+---------------+---------------+
STEP 1: Classify particles by region (parallel)
================================================
std::for_each(std::execution::par,
particle_indices.begin(), particle_indices.end(),
[&](size_t i) {
int region = compute_region(ps.x[i], ps.y[i]);
region_lists[region].push_back_atomic(i); // Lock-free append
});
STEP 2: Process each region independently (parallel)
====================================================
std::for_each(std::execution::par,
regions.begin(), regions.end(),
[&](Region& region) {
// Build spatial hash for this region
SpatialHash local_hash(cell_size);
for (size_t i : region.particle_indices) {
local_hash.insert(i, ps.x[i], ps.y[i]);
}
// Find collision pairs within region
for (size_t i : region.particle_indices) {
auto neighbors = local_hash.query(ps.x[i], ps.y[i], ps.radius[i]);
for (size_t j : neighbors) {
if (i < j) {
region.collision_pairs.push_back({i, j});
}
}
}
});
BOUNDARY HANDLING
====================================================================
Problem: Particles near region boundaries may collide with particles in
adjacent regions, but each thread only sees its own region.
+---------------+---------------+
| Region 0 | Region 1 |
| | |
| A | B | <-- A and B might collide!
| | |
+---------------+---------------+
SOLUTION: Boundary zones processed separately
+---------------+---------------+
| Region 0 | Region 1 |
| Interior | Interior |
| | |
| (parallel) | (parallel) |
| +---+---+---+ |
| | | B | | |
| | A | | | |
| +---+---+---+ |
| Boundary Zone |
| (serialized or |
| with locks) |
+---------------+---------------+
IMPLEMENTATION:
1. Classify particles as Interior (> collision_radius from boundary) or Boundary
2. Process Interior particles in parallel (no coordination needed)
3. Process Boundary particles with synchronization:
- Option A: Sequential processing (simple, may be bottleneck)
- Option B: Fine-grained locks on boundary cells
- Option C: Duplicate boundary particles in adjacent regions (ghost particles)
GHOST PARTICLE APPROACH (Popular in HPC):
=========================================
Region 0 has:
- 1000 interior particles (owned)
- 50 ghost particles copied from Region 1
Region 1 has:
- 1000 interior particles (owned)
- 50 ghost particles copied from Region 0
Collision pairs involving ghosts:
- (owned, ghost) -> Belongs to region with owned particle
- (ghost, ghost) -> Ignored (will be found by some other region)
This requires a ghost exchange step between regions each frame.
Part 5: Frame Budget Management
REAL-TIME CONSTRAINT ANALYSIS
====================================================================
Target: 60 FPS = 16.67ms per frame
Physics budget: 50% = 8.33ms
Safety margin: 20% = 6.67ms (for worst-case spikes)
FRAME TIMING DIAGRAM:
=====================
Frame N Timeline:
|--------------------16.67ms--------------------|
|--Physics 8ms--|--Render 6ms--|--VSync 2ms--|
^
Frame deadline
If physics takes 10ms:
|------Physics 10ms-------|--Render 6ms--|
X Frame miss! (16ms)
Stuttering visible to player
MAINTAINING FRAME BUDGET:
====================================================================
1. FIXED TIMESTEP SIMULATION
-------------------------
Never let simulation time vary - always step by fixed dt (e.g., 1/60 sec)
void game_loop() {
double accumulator = 0.0;
const double fixed_dt = 1.0 / 60.0;
while (running) {
double frame_time = get_elapsed_time();
accumulator += frame_time;
// May run multiple physics steps per frame if behind
while (accumulator >= fixed_dt) {
physics_step(fixed_dt);
accumulator -= fixed_dt;
}
// Interpolate for rendering (smooth visuals)
double alpha = accumulator / fixed_dt;
render(interpolate_state(prev_state, curr_state, alpha));
}
}
2. LEVEL OF DETAIL (LOD) FOR PHYSICS
---------------------------------
Reduce simulation quality when frame budget is tight
if (frame_budget_remaining < 2ms) {
// Reduce constraint solver iterations
solver_iterations = 2; // Instead of 8
} else {
solver_iterations = 8;
}
if (object_count > 50000) {
// Use cheaper collision for distant objects
distant_objects.use_sphere_bounds(); // Instead of mesh
}
3. TIME SLICING
------------
Spread expensive work across multiple frames
Frame 0: Process particles 0-25000
Frame 1: Process particles 25001-50000
Frame 2: Process particles 50001-75000
Frame 3: Process particles 75001-100000
Frame 4: Process particles 0-25000 (repeat)
Works for sleeping/stable objects, not active collisions
PROFILING METRICS TO TRACK:
====================================================================
struct FrameMetrics {
float broad_phase_ms;
float narrow_phase_ms;
float constraints_ms;
float integration_ms;
float total_physics_ms;
size_t collision_pairs_tested;
size_t actual_collisions;
size_t active_particles;
// Thread utilization
float thread_utilization[MAX_THREADS];
// SIMD efficiency
size_t simd_operations;
size_t scalar_fallback_operations;
};
void log_frame_metrics(const FrameMetrics& m) {
printf("Frame: %.2fms (budget: %.2fms) %s\n",
m.total_physics_ms, 8.33,
m.total_physics_ms > 8.33 ? "OVER!" : "OK");
printf(" Broad: %.2fms (%zu pairs)\n", m.broad_phase_ms, m.collision_pairs_tested);
printf(" Narrow: %.2fms (%zu collisions)\n", m.narrow_phase_ms, m.actual_collisions);
printf(" Solve: %.2fms\n", m.constraints_ms);
printf(" Integrate: %.2fms (%zu particles)\n", m.integration_ms, m.active_particles);
}
Project Specification
What You Will Build
A high-performance particle physics simulation that:
- Simulates 100,000+ particles at 60 FPS
- Uses parallel spatial hashing for broad-phase collision detection
- Implements SIMD-accelerated Verlet integration
- Renders the simulation in a window (optional - can be headless with metrics output)
- Provides detailed performance profiling per frame
Functional Requirements
- Particle System:
- Support 100,000+ particles with position, velocity, acceleration, radius
- SoA data layout for SIMD-friendly access
- Configurable gravity, damping, and collision restitution
- Collision Detection:
- Spatial hash-based broad-phase with O(N) average complexity
- Parallel broad-phase using std::execution::par or thread pool
- Sphere-sphere narrow-phase (SIMD accelerated)
- Collision response with momentum conservation
- Physics Integration:
- Verlet integration with SIMD (process 8 particles per instruction)
- Fixed timestep simulation (1/60 second)
- Boundary collision with walls
- Performance:
- Target: 60 FPS with 100,000 particles on 8-core CPU
- Parallel efficiency: > 60% on 8 threads
- SIMD speedup: > 4x compared to scalar baseline
Real World Outcome
When you complete this project, running your simulation will produce:
$ ./physics_demo --particles 100000 --threads 8
Physics Simulation Configuration:
Particles: 100,000
Cell size: 2.0 (spatial hash)
World bounds: [0, 1000] x [0, 1000]
Integration: Verlet (SIMD)
SIMD width: 8 lanes (AVX)
Threads: 8
Target FPS: 60 (16.67ms frame budget)
Initializing particle system... done (45ms)
Building spatial hash... done (3ms)
Running simulation (press Ctrl+C to stop)...
Frame breakdown (target: 16.67ms for 60fps):
+------------------------+--------+--------+
| Phase | Time | Budget |
+------------------------+--------+--------+
| Broad-phase collision | 2.1ms | 12.6% |
| Narrow-phase collision | 3.8ms | 22.8% |
| Constraint resolution | 1.5ms | 9.0% |
| Integration | 0.8ms | 4.8% |
| Rendering sync | 0.5ms | 3.0% |
+------------------------+--------+--------+
| Total physics | 8.7ms | 52.2% |
| Frame overhead | 7.97ms | 47.8% |
+------------------------+--------+--------+
Performance scaling (100,000 particles):
+----------+------------+--------+------------+
| Threads | Frame time | FPS | Efficiency |
+----------+------------+--------+------------+
| 1 | 45.2ms | 22 fps | 100% |
| 2 | 24.1ms | 41 fps | 94% |
| 4 | 12.8ms | 78 fps | 88% |
| 8 | 8.2ms | 122 fps| 69% |
+----------+------------+--------+------------+
Collision statistics:
Broad-phase pairs tested: 245,892
Actual collisions found: 12,456
Collision detection ratio: 5.1% (good spatial hash efficiency)
SIMD efficiency:
SIMD operations: 12,500,000
Scalar fallbacks: 156 (0.001%)
Effective speedup: 7.8x (of theoretical 8x)
Average FPS: 60 (vsync limited)
Physics time: 8.2ms, Render time: 2.1ms, Idle: 6.4ms
$ ./physics_demo --benchmark
Running benchmark suite...
Test 1: Scaling with particle count (8 threads)
+------------+------------+--------+-------------+
| Particles | Frame time | FPS | Particles/ms|
+------------+------------+--------+-------------+
| 10,000 | 1.2ms | 833 fps| 8,333 |
| 25,000 | 2.8ms | 357 fps| 8,929 |
| 50,000 | 5.1ms | 196 fps| 9,804 |
| 100,000 | 8.2ms | 122 fps| 12,195 |
| 250,000 | 19.8ms | 50 fps | 12,626 |
+------------+------------+--------+-------------+
Test 2: Scaling with thread count (100,000 particles)
[Parallel efficiency chart...]
Test 3: SIMD vs Scalar comparison
+------------------+--------+---------+
| Implementation | Time | Speedup |
+------------------+--------+---------+
| Scalar (no SIMD) | 6.4ms | 1.0x |
| SSE (4-wide) | 1.8ms | 3.6x |
| AVX (8-wide) | 0.8ms | 8.0x |
+------------------+--------+---------+
BENCHMARK COMPLETE
All targets met!
Solution Architecture
High-Level Design
PHYSICS ENGINE ARCHITECTURE
====================================================================
+--------------------------------------------------------------------+
| PHYSICS ENGINE |
| |
| +-------------------------+ +-----------------------------+ |
| | ParticleSystem | | CollisionDetector | |
| |-------------------------| |-----------------------------| |
| | SoA Layout: | | SpatialHash: | |
| | float* pos_x |<--->| HashMap<CellID, Vec<ID>> | |
| | float* pos_y | | cell_size: float | |
| | float* vel_x | | | |
| | float* vel_y | | Methods: | |
| | float* radius | | insert(id, x, y) | |
| | size_t count | | query(x, y, r) -> Vec<ID> | |
| | | | rebuild() | |
| | Methods: | +-----------------------------+ |
| | integrate(dt) | | |
| | apply_gravity() | | |
| | apply_damping() | v |
| +-------------------------+ +-----------------------------+ |
| ^ | BroadPhase (parallel) | |
| | |-----------------------------| |
| | | std::for_each(par, ...) | |
| | | Per-thread local hash | |
| | | Merge + deduplicate | |
| | +-----------------------------+ |
| | | |
| | v |
| | +-----------------------------+ |
| | | NarrowPhase (SIMD) | |
| | |-----------------------------| |
| | | Process pairs in batches | |
| | | SIMD distance calculation | |
| | | Generate contacts | |
| | +-----------------------------+ |
| | | |
| v v |
| +-------------------------+ +-----------------------------+ |
| | Integrator | | ConstraintSolver | |
| |-------------------------| |-----------------------------| |
| | SIMD Verlet: | | Sequential Impulses: | |
| | x += v*dt + 0.5*a*dt^2| | for each contact: | |
| | Process 8 particles/op | | compute impulse | |
| | | | apply velocity change | |
| +-------------------------+ +-----------------------------+ |
| |
| +-------------------------+ +-----------------------------+ |
| | ThreadPool | | FrameProfiler | |
| |-------------------------| |-----------------------------| |
| | Worker threads: N | | Per-phase timing | |
| | Task queue | | Rolling averages | |
| | Work stealing | | Bottleneck detection | |
| +-------------------------+ +-----------------------------+ |
| |
+--------------------------------------------------------------------+
DATA FLOW PER FRAME:
====================
ParticleSystem
|
v
+-------+-------+
| |
v v
[pos_x,y] [radius]
| |
+-------+-------+
|
v
SpatialHash.rebuild()
|
v
BroadPhase (parallel)
|
v
Candidate Pairs
|
v
NarrowPhase (SIMD)
|
v
Contact List
|
v
ConstraintSolver
|
v
Velocity Updates
|
v
Integrator (SIMD)
|
v
Updated Positions
Key Components
// ParticleSystem.hpp - SoA Layout
struct ParticleSystem {
// Position (SoA for SIMD)
std::vector<float, AlignedAllocator<float, 32>> pos_x;
std::vector<float, AlignedAllocator<float, 32>> pos_y;
// Velocity (SoA for SIMD)
std::vector<float, AlignedAllocator<float, 32>> vel_x;
std::vector<float, AlignedAllocator<float, 32>> vel_y;
// Acceleration (SoA for SIMD)
std::vector<float, AlignedAllocator<float, 32>> acc_x;
std::vector<float, AlignedAllocator<float, 32>> acc_y;
// Per-particle properties
std::vector<float> radius;
std::vector<float> mass;
std::vector<float> restitution;
size_t count;
void reserve(size_t n);
void add_particle(float x, float y, float vx, float vy, float r, float m);
void integrate_verlet_simd(float dt);
void apply_gravity(float gx, float gy);
void apply_damping(float factor);
void enforce_bounds(float min_x, float max_x, float min_y, float max_y);
};
// SpatialHash.hpp - Broad-Phase Acceleration Structure
struct SpatialHash {
float cell_size;
float inv_cell_size; // Precomputed 1/cell_size
// Option 1: Dense grid (fixed world size)
// std::vector<std::vector<uint32_t>> cells;
// Option 2: Sparse hash map (infinite world)
std::unordered_map<uint64_t, std::vector<uint32_t>> cells;
SpatialHash(float cell_size);
uint64_t hash_cell(int cx, int cy) const;
void clear();
void insert(uint32_t particle_id, float x, float y);
void query_neighbors(float x, float y, float radius,
std::vector<uint32_t>& out) const;
// Parallel rebuild
void rebuild_parallel(const ParticleSystem& ps, size_t num_threads);
};
// CollisionPair.hpp - Contact Information
struct CollisionPair {
uint32_t a;
uint32_t b;
};
struct Contact {
uint32_t a;
uint32_t b;
float nx, ny; // Contact normal (a -> b)
float penetration; // Overlap distance
};
Data Flow Diagram
PARALLEL BROAD-PHASE PIPELINE
====================================================================
Input: ParticleSystem with N particles
Output: Vector of CollisionPair candidates
STEP 1: Parallel Spatial Hash Build
------------------------------------
Thread 0 Thread 1 Thread 2 Thread 3
+----------+ +----------+ +----------+ +----------+
| Particles| | Particles| | Particles| | Particles|
| 0-24999 | | 25000- | | 50000- | | 75000- |
| | | 49999 | | 74999 | | 99999 |
+----+-----+ +----+-----+ +----+-----+ +----+-----+
| | | |
v v v v
+----------+ +----------+ +----------+ +----------+
| Local | | Local | | Local | | Local |
| Hash | | Hash | | Hash | | Hash |
| (partial)| | (partial)| | (partial)| | (partial)|
+----+-----+ +----+-----+ +----+-----+ +----+-----+
| | | |
+--------+--------+--------+--------+
|
v (merge step)
+------------------+
| Combined Hash |
| (all particles) |
+------------------+
STEP 2: Parallel Neighbor Query
-------------------------------
Particles divided among threads again:
Thread 0: query neighbors for particles 0-24999
Thread 1: query neighbors for particles 25000-49999
Thread 2: query neighbors for particles 50000-74999
Thread 3: query neighbors for particles 75000-99999
Each thread writes to its own output vector (no synchronization needed)
Thread 0 Thread 1 Thread 2 Thread 3
+-------+ +-------+ +-------+ +-------+
| Pairs | | Pairs | | Pairs | | Pairs |
| 0-M | | M-N | | N-O | | O-P |
+---+---+ +---+---+ +---+---+ +---+---+
| | | |
+--------+--------+--------+--------+
|
v
+---------------------------+
| All Candidate Pairs |
| (concatenated, no merge) |
+---------------------------+
SIMD NARROW-PHASE PIPELINE
====================================================================
Input: Vector of CollisionPair candidates
Output: Vector of Contact (actual collisions)
Process pairs in batches of 8 (AVX) or 4 (SSE):
Pairs: [p0, p1, p2, p3, p4, p5, p6, p7]
Each pair = (a, b) indices
Load into SIMD registers:
ax = [pos_x[a0], pos_x[a1], ..., pos_x[a7]]
ay = [pos_y[a0], pos_y[a1], ..., pos_y[a7]]
bx = [pos_x[b0], pos_x[b1], ..., pos_x[b7]]
by = [pos_y[b0], pos_y[b1], ..., pos_y[b7]]
Compute distances (SIMD):
dx = bx - ax
dy = by - ay
dist_sq = dx*dx + dy*dy
Compare with sum of radii (SIMD):
ra = [radius[a0], ..., radius[a7]]
rb = [radius[b0], ..., radius[b7]]
threshold_sq = (ra + rb)^2
Collision mask:
mask = dist_sq < threshold_sq
e.g., [true, false, true, true, false, false, true, false]
Extract colliding pairs using mask:
for i where mask[i] is true:
emit Contact{pairs[i].a, pairs[i].b, ...}
Implementation Guide
Development Environment Setup
# Required: C++17 compiler with parallel algorithms and SIMD support
# GCC 10+ or Clang 12+ recommended
# Ubuntu/Debian
sudo apt-get install g++-12 libtbb-dev cmake ninja-build
# macOS (with Homebrew)
brew install gcc@13 tbb cmake ninja
# Check SIMD support
g++ -march=native -dM -E - < /dev/null | grep -E "AVX|SSE"
# Create project structure
mkdir -p physics_engine/{src,include,tests,benchmarks}
cd physics_engine
# Create CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(physics_engine CXX)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Enable native SIMD instructions
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
# Find TBB for parallel algorithms
find_package(TBB REQUIRED)
# Main library
add_library(physics_lib STATIC
src/ParticleSystem.cpp
src/SpatialHash.cpp
src/CollisionDetector.cpp
src/Integrator.cpp
src/FrameProfiler.cpp
)
target_include_directories(physics_lib PUBLIC include)
target_link_libraries(physics_lib TBB::tbb)
# Demo executable
add_executable(physics_demo src/main.cpp)
target_link_libraries(physics_demo physics_lib)
# Benchmarks
add_executable(physics_bench benchmarks/benchmark.cpp)
target_link_libraries(physics_bench physics_lib)
# Tests
enable_testing()
add_executable(physics_tests tests/test_spatial_hash.cpp tests/test_integration.cpp)
target_link_libraries(physics_tests physics_lib)
add_test(NAME physics_tests COMMAND physics_tests)
EOF
Implementation Phases
Phase 1: Particle System Foundation (Days 1-4)
Goals:
- Implement SoA ParticleSystem with aligned memory
- Basic scalar Verlet integration
- Boundary collision with walls
- Visual verification (optional: SDL2 or headless benchmark)
// include/ParticleSystem.hpp
#pragma once
#include <vector>
#include <cstdint>
#include <cstddef>
#include <memory>
// Aligned allocator for SIMD
template<typename T, std::size_t Alignment>
struct AlignedAllocator {
using value_type = T;
T* allocate(std::size_t n) {
return static_cast<T*>(std::aligned_alloc(Alignment, n * sizeof(T)));
}
void deallocate(T* p, std::size_t) {
std::free(p);
}
};
class ParticleSystem {
public:
// SoA layout with 32-byte alignment for AVX
std::vector<float, AlignedAllocator<float, 32>> pos_x;
std::vector<float, AlignedAllocator<float, 32>> pos_y;
std::vector<float, AlignedAllocator<float, 32>> vel_x;
std::vector<float, AlignedAllocator<float, 32>> vel_y;
std::vector<float, AlignedAllocator<float, 32>> acc_x;
std::vector<float, AlignedAllocator<float, 32>> acc_y;
std::vector<float> radius;
std::vector<float> mass;
std::size_t count = 0;
void reserve(std::size_t n);
void add_particle(float x, float y, float vx, float vy, float r, float m);
// Physics operations
void apply_gravity(float gx, float gy);
void integrate_scalar(float dt);
void integrate_simd(float dt); // Phase 3
void enforce_bounds(float min_x, float max_x, float min_y, float max_y);
void apply_damping(float factor);
};
Checkpoint: 1000 particles falling with gravity, bouncing off walls.
Phase 2: Spatial Hashing (Days 5-9)
Goals:
- Implement SpatialHash with insert and query
- Test with known configurations
- Integrate with particle system
// include/SpatialHash.hpp
#pragma once
#include <unordered_map>
#include <vector>
#include <cstdint>
class SpatialHash {
public:
explicit SpatialHash(float cell_size);
void clear();
void insert(uint32_t id, float x, float y);
void query_neighbors(float x, float y, float radius,
std::vector<uint32_t>& out) const;
// Statistics
size_t cell_count() const { return cells_.size(); }
size_t max_cell_size() const;
private:
float cell_size_;
float inv_cell_size_;
// Hash cell coordinates to cell ID
static uint64_t hash_coords(int32_t cx, int32_t cy);
// Cell ID -> list of particle IDs
std::unordered_map<uint64_t, std::vector<uint32_t>> cells_;
};
Implementation detail:
uint64_t SpatialHash::hash_coords(int32_t cx, int32_t cy) {
// Use large primes to reduce collisions
constexpr uint64_t PRIME_X = 73856093ULL;
constexpr uint64_t PRIME_Y = 19349663ULL;
return (static_cast<uint64_t>(cx) * PRIME_X) ^
(static_cast<uint64_t>(cy) * PRIME_Y);
}
void SpatialHash::insert(uint32_t id, float x, float y) {
int32_t cx = static_cast<int32_t>(std::floor(x * inv_cell_size_));
int32_t cy = static_cast<int32_t>(std::floor(y * inv_cell_size_));
cells_[hash_coords(cx, cy)].push_back(id);
}
void SpatialHash::query_neighbors(float x, float y, float r,
std::vector<uint32_t>& out) const {
out.clear();
int32_t min_cx = static_cast<int32_t>(std::floor((x - r) * inv_cell_size_));
int32_t max_cx = static_cast<int32_t>(std::floor((x + r) * inv_cell_size_));
int32_t min_cy = static_cast<int32_t>(std::floor((y - r) * inv_cell_size_));
int32_t max_cy = static_cast<int32_t>(std::floor((y + r) * inv_cell_size_));
for (int32_t cy = min_cy; cy <= max_cy; ++cy) {
for (int32_t cx = min_cx; cx <= max_cx; ++cx) {
auto it = cells_.find(hash_coords(cx, cy));
if (it != cells_.end()) {
out.insert(out.end(), it->second.begin(), it->second.end());
}
}
}
}
Checkpoint: Spatial hash correctly returns neighbors; collision detection works.
Phase 3: SIMD Integration (Days 10-14)
Goals:
- Implement SIMD Verlet integration using std::simd or intrinsics
- Benchmark SIMD vs scalar
- Handle non-multiple-of-8 particle counts
// Using std::experimental::simd (GCC 11+, Clang with libstdc++)
#include <experimental/simd>
namespace stdx = std::experimental;
void ParticleSystem::integrate_simd(float dt) {
using simd_t = stdx::native_simd<float>;
constexpr size_t lanes = simd_t::size();
simd_t dt_vec(dt);
simd_t half_dt2(0.5f * dt * dt);
// Process X component
for (size_t i = 0; i + lanes <= count; i += lanes) {
simd_t x(pos_x.data() + i, stdx::element_aligned);
simd_t vx(vel_x.data() + i, stdx::element_aligned);
simd_t ax(acc_x.data() + i, stdx::element_aligned);
// Verlet: x_new = x + v*dt + 0.5*a*dt^2
x += vx * dt_vec + ax * half_dt2;
vx += ax * dt_vec;
x.copy_to(pos_x.data() + i, stdx::element_aligned);
vx.copy_to(vel_x.data() + i, stdx::element_aligned);
}
// Process Y component (similar)
for (size_t i = 0; i + lanes <= count; i += lanes) {
simd_t y(pos_y.data() + i, stdx::element_aligned);
simd_t vy(vel_y.data() + i, stdx::element_aligned);
simd_t ay(acc_y.data() + i, stdx::element_aligned);
y += vy * dt_vec + ay * half_dt2;
vy += ay * dt_vec;
y.copy_to(pos_y.data() + i, stdx::element_aligned);
vy.copy_to(vel_y.data() + i, stdx::element_aligned);
}
// Scalar cleanup for remaining particles
size_t remainder_start = (count / lanes) * lanes;
for (size_t i = remainder_start; i < count; ++i) {
pos_x[i] += vel_x[i] * dt + 0.5f * acc_x[i] * dt * dt;
pos_y[i] += vel_y[i] * dt + 0.5f * acc_y[i] * dt * dt;
vel_x[i] += acc_x[i] * dt;
vel_y[i] += acc_y[i] * dt;
}
// Clear accelerations for next frame
std::fill(acc_x.begin(), acc_x.end(), 0.0f);
std::fill(acc_y.begin(), acc_y.end(), 0.0f);
}
Checkpoint: SIMD integration is 4-8x faster than scalar for large particle counts.
Phase 4: Parallel Broad-Phase (Days 15-21)
Goals:
- Parallelize spatial hash rebuild
- Parallelize neighbor queries
- Handle thread-safety correctly
#include <execution>
#include <algorithm>
#include <numeric>
#include <mutex>
struct ThreadLocalHash {
SpatialHash hash;
std::vector<uint32_t> particle_ids;
};
void CollisionDetector::broad_phase_parallel(
const ParticleSystem& ps,
std::vector<CollisionPair>& out_pairs,
size_t num_threads)
{
out_pairs.clear();
// Step 1: Build per-thread spatial hashes
std::vector<ThreadLocalHash> local_hashes(num_threads,
ThreadLocalHash{SpatialHash(cell_size_), {}});
// Partition particles among threads
std::vector<size_t> indices(ps.count);
std::iota(indices.begin(), indices.end(), 0);
std::for_each(std::execution::par,
indices.begin(), indices.end(),
[&](size_t i) {
// Determine which thread owns this particle (simple modulo)
size_t thread_id = i % num_threads;
auto& local = local_hashes[thread_id];
// Thread-local insert (no synchronization needed)
local.hash.insert(static_cast<uint32_t>(i), ps.pos_x[i], ps.pos_y[i]);
local.particle_ids.push_back(static_cast<uint32_t>(i));
});
// Step 2: Merge into global hash
SpatialHash global_hash(cell_size_);
for (const auto& local : local_hashes) {
for (uint32_t id : local.particle_ids) {
global_hash.insert(id, ps.pos_x[id], ps.pos_y[id]);
}
}
// Step 3: Parallel neighbor query
std::vector<std::vector<CollisionPair>> per_thread_pairs(num_threads);
std::for_each(std::execution::par,
indices.begin(), indices.end(),
[&](size_t i) {
size_t thread_id = i % num_threads;
std::vector<uint32_t> neighbors;
global_hash.query_neighbors(
ps.pos_x[i], ps.pos_y[i], ps.radius[i] * 2.0f, neighbors);
for (uint32_t j : neighbors) {
if (i < j) { // Avoid duplicate pairs
per_thread_pairs[thread_id].push_back(
{static_cast<uint32_t>(i), j});
}
}
});
// Step 4: Concatenate results
for (const auto& pairs : per_thread_pairs) {
out_pairs.insert(out_pairs.end(), pairs.begin(), pairs.end());
}
}
Checkpoint: Parallel broad-phase scales with thread count; no races or duplicates.
Phase 5: SIMD Narrow-Phase (Days 22-28)
Goals:
- Batch collision pairs for SIMD processing
- Implement SIMD distance calculation
- Generate contacts for colliding pairs
void CollisionDetector::narrow_phase_simd(
const ParticleSystem& ps,
const std::vector<CollisionPair>& pairs,
std::vector<Contact>& out_contacts)
{
out_contacts.clear();
using simd_t = stdx::native_simd<float>;
constexpr size_t lanes = simd_t::size();
size_t pair_count = pairs.size();
size_t batches = pair_count / lanes;
for (size_t batch = 0; batch < batches; ++batch) {
size_t base = batch * lanes;
// Gather particle data for this batch
alignas(32) float ax[lanes], ay[lanes], bx[lanes], by[lanes];
alignas(32) float ra[lanes], rb[lanes];
for (size_t i = 0; i < lanes; ++i) {
const auto& pair = pairs[base + i];
ax[i] = ps.pos_x[pair.a];
ay[i] = ps.pos_y[pair.a];
bx[i] = ps.pos_x[pair.b];
by[i] = ps.pos_y[pair.b];
ra[i] = ps.radius[pair.a];
rb[i] = ps.radius[pair.b];
}
// Load into SIMD registers
simd_t vax(ax, stdx::element_aligned);
simd_t vay(ay, stdx::element_aligned);
simd_t vbx(bx, stdx::element_aligned);
simd_t vby(by, stdx::element_aligned);
simd_t vra(ra, stdx::element_aligned);
simd_t vrb(rb, stdx::element_aligned);
// Compute distance squared
simd_t dx = vbx - vax;
simd_t dy = vby - vay;
simd_t dist_sq = dx * dx + dy * dy;
// Compute threshold (sum of radii squared)
simd_t sum_r = vra + vrb;
simd_t threshold_sq = sum_r * sum_r;
// Collision mask
auto mask = dist_sq < threshold_sq;
// Extract colliding pairs
alignas(32) float dist_sq_arr[lanes];
dist_sq.copy_to(dist_sq_arr, stdx::element_aligned);
for (size_t i = 0; i < lanes; ++i) {
if (mask[i]) {
float dist = std::sqrt(dist_sq_arr[i]);
float penetration = (ra[i] + rb[i]) - dist;
// Normal from a to b
float nx = (dist > 1e-6f) ? (bx[i] - ax[i]) / dist : 1.0f;
float ny = (dist > 1e-6f) ? (by[i] - ay[i]) / dist : 0.0f;
out_contacts.push_back({
pairs[base + i].a,
pairs[base + i].b,
nx, ny,
penetration
});
}
}
}
// Scalar remainder
for (size_t i = batches * lanes; i < pair_count; ++i) {
const auto& pair = pairs[i];
float dx = ps.pos_x[pair.b] - ps.pos_x[pair.a];
float dy = ps.pos_y[pair.b] - ps.pos_y[pair.a];
float dist_sq = dx * dx + dy * dy;
float sum_r = ps.radius[pair.a] + ps.radius[pair.b];
if (dist_sq < sum_r * sum_r) {
float dist = std::sqrt(dist_sq);
float penetration = sum_r - dist;
float nx = (dist > 1e-6f) ? dx / dist : 1.0f;
float ny = (dist > 1e-6f) ? dy / dist : 0.0f;
out_contacts.push_back({pair.a, pair.b, nx, ny, penetration});
}
}
}
Checkpoint: SIMD narrow-phase processes collision pairs efficiently.
Phase 6: Collision Response & Integration (Days 29-35)
Goals:
- Implement impulse-based collision response
- Integrate constraint solver
- Complete frame loop with profiling
void CollisionDetector::resolve_collisions(
ParticleSystem& ps,
const std::vector<Contact>& contacts)
{
for (const auto& c : contacts) {
// Relative velocity
float rel_vx = ps.vel_x[c.b] - ps.vel_x[c.a];
float rel_vy = ps.vel_y[c.b] - ps.vel_y[c.a];
// Relative velocity along normal
float rel_v_normal = rel_vx * c.nx + rel_vy * c.ny;
// Don't resolve if objects are separating
if (rel_v_normal > 0) continue;
// Restitution (bounciness)
float e = std::min(ps.restitution[c.a], ps.restitution[c.b]);
// Impulse magnitude
float inv_mass_a = 1.0f / ps.mass[c.a];
float inv_mass_b = 1.0f / ps.mass[c.b];
float j = -(1.0f + e) * rel_v_normal / (inv_mass_a + inv_mass_b);
// Apply impulse
float impulse_x = j * c.nx;
float impulse_y = j * c.ny;
ps.vel_x[c.a] -= impulse_x * inv_mass_a;
ps.vel_y[c.a] -= impulse_y * inv_mass_a;
ps.vel_x[c.b] += impulse_x * inv_mass_b;
ps.vel_y[c.b] += impulse_y * inv_mass_b;
// Position correction (prevent sinking)
const float correction_percent = 0.8f;
const float slop = 0.01f; // Allowable penetration
float correction = std::max(c.penetration - slop, 0.0f) /
(inv_mass_a + inv_mass_b) * correction_percent;
float corr_x = correction * c.nx;
float corr_y = correction * c.ny;
ps.pos_x[c.a] -= corr_x * inv_mass_a;
ps.pos_y[c.a] -= corr_y * inv_mass_a;
ps.pos_x[c.b] += corr_x * inv_mass_b;
ps.pos_y[c.b] += corr_y * inv_mass_b;
}
}
Checkpoint: Complete physics simulation running at target FPS.
Phase 7: Profiling & Optimization (Days 36-42)
Goals:
- Add detailed frame profiling
- Identify and fix bottlenecks
- Achieve performance targets
// include/FrameProfiler.hpp
#pragma once
#include <chrono>
#include <string>
#include <vector>
#include <iostream>
class FrameProfiler {
public:
void begin_phase(const std::string& name);
void end_phase();
void end_frame();
void print_summary() const;
private:
struct PhaseData {
std::string name;
double total_ms = 0.0;
size_t count = 0;
};
std::chrono::high_resolution_clock::time_point phase_start_;
std::vector<PhaseData> phases_;
size_t current_phase_ = 0;
size_t frame_count_ = 0;
double total_frame_ms_ = 0.0;
};
// Usage in main loop
void simulate_frame(ParticleSystem& ps, CollisionDetector& detector,
FrameProfiler& profiler)
{
profiler.begin_phase("Gravity");
ps.apply_gravity(0.0f, -9.81f);
profiler.end_phase();
profiler.begin_phase("Broad-Phase");
std::vector<CollisionPair> pairs;
detector.broad_phase_parallel(ps, pairs, 8);
profiler.end_phase();
profiler.begin_phase("Narrow-Phase");
std::vector<Contact> contacts;
detector.narrow_phase_simd(ps, pairs, contacts);
profiler.end_phase();
profiler.begin_phase("Collision Response");
detector.resolve_collisions(ps, contacts);
profiler.end_phase();
profiler.begin_phase("Integration");
ps.integrate_simd(1.0f / 60.0f);
profiler.end_phase();
profiler.begin_phase("Bounds");
ps.enforce_bounds(0.0f, 1000.0f, 0.0f, 1000.0f);
profiler.end_phase();
profiler.end_frame();
}
Testing Strategy
Unit Tests
// tests/test_spatial_hash.cpp
TEST_CASE("Spatial hash insertion and query", "[spatial_hash]") {
SpatialHash hash(10.0f);
// Insert particles in a grid pattern
for (int i = 0; i < 100; ++i) {
float x = (i % 10) * 5.0f;
float y = (i / 10) * 5.0f;
hash.insert(i, x, y);
}
// Query should return neighbors
std::vector<uint32_t> neighbors;
hash.query_neighbors(25.0f, 25.0f, 10.0f, neighbors);
// Should find particles in nearby cells
REQUIRE(neighbors.size() >= 4);
REQUIRE(neighbors.size() <= 16);
}
TEST_CASE("SIMD integration matches scalar", "[simd]") {
ParticleSystem ps_scalar, ps_simd;
// Initialize identically
for (int i = 0; i < 1000; ++i) {
float x = i * 1.0f, y = i * 2.0f;
float vx = 1.0f, vy = -1.0f;
ps_scalar.add_particle(x, y, vx, vy, 1.0f, 1.0f);
ps_simd.add_particle(x, y, vx, vy, 1.0f, 1.0f);
}
ps_scalar.apply_gravity(0.0f, -9.81f);
ps_simd.apply_gravity(0.0f, -9.81f);
ps_scalar.integrate_scalar(1.0f / 60.0f);
ps_simd.integrate_simd(1.0f / 60.0f);
// Compare results
for (size_t i = 0; i < ps_scalar.count; ++i) {
REQUIRE(std::abs(ps_scalar.pos_x[i] - ps_simd.pos_x[i]) < 1e-5f);
REQUIRE(std::abs(ps_scalar.pos_y[i] - ps_simd.pos_y[i]) < 1e-5f);
}
}
Stress Tests
// benchmarks/benchmark.cpp
void benchmark_scaling() {
const std::vector<size_t> particle_counts = {1000, 10000, 50000, 100000};
const std::vector<size_t> thread_counts = {1, 2, 4, 8};
for (size_t n : particle_counts) {
std::cout << "\n=== " << n << " particles ===" << std::endl;
ParticleSystem ps;
initialize_random_particles(ps, n);
for (size_t threads : thread_counts) {
CollisionDetector detector(2.0f);
auto start = std::chrono::high_resolution_clock::now();
// Run 100 frames
for (int frame = 0; frame < 100; ++frame) {
std::vector<CollisionPair> pairs;
detector.broad_phase_parallel(ps, pairs, threads);
std::vector<Contact> contacts;
detector.narrow_phase_simd(ps, pairs, contacts);
detector.resolve_collisions(ps, contacts);
ps.integrate_simd(1.0f / 60.0f);
}
auto end = std::chrono::high_resolution_clock::now();
double ms = std::chrono::duration<double, std::milli>(end - start).count();
std::cout << threads << " threads: " << ms / 100.0 << " ms/frame" << std::endl;
}
}
}
Verification Tests
// Verify conservation of momentum
TEST_CASE("Momentum conservation in collisions", "[physics]") {
ParticleSystem ps;
// Two particles moving toward each other
ps.add_particle(0.0f, 0.0f, 1.0f, 0.0f, 1.0f, 1.0f); // Moving right
ps.add_particle(3.0f, 0.0f, -1.0f, 0.0f, 1.0f, 1.0f); // Moving left
float initial_momentum_x = ps.vel_x[0] * ps.mass[0] + ps.vel_x[1] * ps.mass[1];
// Simulate until collision
CollisionDetector detector(2.0f);
for (int i = 0; i < 100; ++i) {
std::vector<CollisionPair> pairs;
detector.broad_phase(ps, pairs);
std::vector<Contact> contacts;
detector.narrow_phase(ps, pairs, contacts);
detector.resolve_collisions(ps, contacts);
ps.integrate_simd(1.0f / 60.0f);
}
float final_momentum_x = ps.vel_x[0] * ps.mass[0] + ps.vel_x[1] * ps.mass[1];
REQUIRE(std::abs(initial_momentum_x - final_momentum_x) < 1e-4f);
}
Common Pitfalls
Performance Pitfalls
| Problem | Symptom | Solution |
|---|---|---|
| False sharing | Poor parallel scaling | Pad per-thread data to cache line size (64 bytes) |
| Unaligned SIMD loads | Crashes or slow performance | Use AlignedAllocator, ensure 32-byte alignment |
| Hash table contention | Parallel hash insert is slow | Use per-thread hash tables, merge at end |
| Too many collision pairs | Narrow-phase dominates | Tune cell size to average 4-16 objects/cell |
| Branch misprediction | SIMD masked code is slow | Process collisions in batches, sort by outcome |
Correctness Pitfalls
| Problem | Symptom | Solution |
|---|---|---|
| Duplicate pairs | Objects bounce twice | Use i < j check in broad-phase |
| Missing collisions | Objects pass through each other | Cell size too large; reduce to 2*max_radius |
| Energy explosion | Objects accelerate over time | Use semi-implicit Euler or Verlet; clamp velocities |
| Penetration stacking | Objects sink into each other | Add position correction step after impulse |
| Race conditions | Non-deterministic behavior | Block boundaries handled with locks or ghost particles |
SIMD Pitfalls
// WRONG: Unaligned access (may crash on some CPUs)
simd_t x(pos_x.data() + 1); // pos_x.data() + 1 is not 32-byte aligned!
// RIGHT: Ensure alignment
assert(reinterpret_cast<uintptr_t>(pos_x.data() + i) % 32 == 0);
simd_t x(pos_x.data() + i, stdx::element_aligned);
// WRONG: Mixing SIMD and scalar in tight loop
for (size_t i = 0; i < count; ++i) {
if (should_process(i)) { // Branching kills SIMD
process_simd(i);
}
}
// RIGHT: Batch by condition, then process
std::vector<size_t> to_process;
for (size_t i = 0; i < count; ++i) {
if (should_process(i)) {
to_process.push_back(i);
}
}
// Now process to_process in SIMD batches
Extensions & Challenges
Beginner Extensions
- Visualization: Add SDL2 or OpenGL rendering to see particles
- Mouse interaction: Click to spawn particles or apply forces
- Particle trails: Draw motion trails for visual effect
- Gravity wells: Add attractors that pull particles
Intermediate Extensions
- Rigid body rotation: Add angular velocity and torque
- Polygon collision: Implement SAT for convex polygons
- Constraints: Add distance constraints (springs) between particles
- Sleeping objects: Don’t simulate stationary objects
Advanced Extensions
- Continuous collision detection: Prevent tunneling for fast objects
- GPU acceleration: Port narrow-phase to CUDA/OpenCL
- Persistent contacts: Cache collision information between frames
- Hierarchical spatial hash: Multi-resolution for varying object sizes
Challenge Problems
- 1 Million Particles: Achieve 30 FPS with 1,000,000 particles
- Stack Stability: 100 boxes stacked perfectly stable for 1000 frames
- Fluid Simulation: Implement SPH (Smoothed Particle Hydrodynamics)
- Cloth Simulation: Particles connected by distance constraints forming cloth
Resources
Essential Reading
| Resource | Topic | Chapters/Sections |
|---|---|---|
| “Game Physics Engine Development” - Ian Millington | Complete physics pipeline | Chapters 7-12 |
| “Real-Time Collision Detection” - Christer Ericson | Spatial partitioning, collision | Chapters 7-8 |
| “Computer Graphics from Scratch” - Gabriel Gambetta | Rendering integration | Chapters on 3D graphics |
| “C++ Concurrency in Action” - Anthony Williams | Parallel algorithms | Chapters 7-8 |
Online Resources
- Box2D source code: https://github.com/erincatto/box2d
- GDC Physics Tutorials: Search “Game Developers Conference physics”
- Erin Catto’s GDC slides on sequential impulses
- Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Video Resources
- “Physics for Game Developers” - YouTube series by Two-Bit Coding
- “Building a Physics Engine” - Sebastian Lague
- “SIMD Programming” - Handmade Hero episodes on SIMD
Tools for Debugging
# CPU performance counters
perf stat -e cache-misses,cache-references,instructions,cycles ./physics_demo
# SIMD vectorization report (GCC)
g++ -O3 -march=native -fopt-info-vec-optimized src/Integrator.cpp
# Thread profiling
valgrind --tool=helgrind ./physics_demo
# Memory alignment check
valgrind --tool=memcheck ./physics_demo
Self-Assessment Checklist
Before considering this project complete, verify:
Understanding
- I can explain why SoA is better than AoS for SIMD operations
- I understand how spatial hashing reduces collision detection from O(N^2) to O(N)
- I can describe the tradeoffs between cell size in spatial hashing
- I know why boundary particles require special handling in parallel partitioning
- I can explain Verlet integration and why it’s preferred over Euler
- I understand the frame budget constraint and how to stay within it
Implementation
- ParticleSystem uses SoA layout with aligned memory
- Spatial hash correctly returns neighbors without duplicates
- SIMD integration is >4x faster than scalar baseline
- Parallel broad-phase scales with thread count (>60% efficiency on 8 threads)
- Collision response conserves momentum (verified by test)
- Frame time is under 10ms for 100,000 particles on target hardware
Performance
- Achieved 60 FPS with 100,000 particles
- SIMD operations dominate (scalar fallbacks < 1%)
- Cache miss rate is acceptable (profile with perf)
- Thread utilization is balanced (no straggler threads)
Code Quality
- Code is well-organized with clear separation of concerns
- Performance-critical code is documented with complexity analysis
- Unit tests cover core functionality
- Benchmarks are reproducible and documented
Submission/Completion Criteria
Minimum Viable Completion
- 10,000 particles simulated at 60 FPS
- Basic spatial hashing for broad-phase
- SIMD integration working
- Simple collision response
Full Completion
- 100,000 particles at 60 FPS with 8 threads
- Parallel spatial hash rebuild and query
- SIMD narrow-phase collision detection
- Complete collision response with position correction
- Frame profiler with per-phase timing
- Performance scaling benchmarks documented
Excellence (Going Above & Beyond)
- 250,000+ particles at 30+ FPS
- GPU acceleration for narrow-phase
- Continuous collision detection
- Visualization with real-time rendering
- Constraint system (springs, joints)
- Published benchmark comparison with other physics engines
The Core Question You’re Answering
“How do you build a physics simulation that handles 100,000+ interacting objects while guaranteeing consistent frame times under 16 milliseconds?”
This project is the ultimate integration test for everything learned in the C++ Concurrency and Parallelism series. It combines:
- Thread pools (from Project 3) for parallel work distribution
- Parallel algorithms (from Projects 7-8) for broad-phase collision
- SIMD (from Projects 12-13) for narrow-phase and integration
- Lock-free techniques (from Projects 5-6) for thread-safe data structures
- Real-time constraints requiring systematic profiling and optimization
When you complete this project, you will have demonstrated mastery of modern C++ parallelism and the ability to apply these techniques to a demanding real-world domain.
Books That Will Help
| Topic | Book | Specific Chapters |
|---|---|---|
| Physics simulation | “Game Physics Engine Development” - Millington | Chapters 7-12: Collision detection, response |
| Collision detection | “Real-Time Collision Detection” - Ericson | Chapter 7: Spatial partitioning |
| SIMD optimization | “Modern X86 Assembly Language Programming” - Kusswurm | Chapters on AVX |
| Parallel algorithms | “C++ Concurrency in Action” - Williams | Chapter 10: Parallel algorithms |
| Performance tuning | “Optimizing Software in C++” - Fog | Sections 7-9: Data layout, SIMD |
| Game architecture | “Game Programming Patterns” - Nystrom | Chapter on Update Method |
This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the README.