Project 12: SIMD Math Library

“The CPU does not care about your elegant abstractions. It cares about data locality, instruction-level parallelism, and keeping all its execution units busy.” - Casey Muratori

Quick Reference

Attribute	Value
Language	C++ (C++20 or later recommended)
Difficulty	Advanced
Time	1-2 weeks
Chapters	SIMD / Vectorization / Performance
Coolness	Level 4: Hardcore Tech Flex
Portfolio Value	High - Demonstrates low-level optimization

Learning Objectives

By completing this project, you will:

Understand SIMD fundamentals: Learn how Single Instruction Multiple Data (SIMD) allows a CPU to process multiple values simultaneously with a single instruction
Master std::experimental::simd: Use C++’s portable SIMD abstraction to write vectorized code that works across SSE, AVX, AVX-512, and ARM NEON
Implement vectorized math operations: Build optimized vector addition, dot product, and matrix multiplication routines that achieve 4-8x speedups
Handle alignment and remainders: Understand why memory alignment matters for SIMD performance and how to handle array sizes that are not multiples of SIMD width
Compare with auto-vectorization: Learn when the compiler can auto-vectorize and when manual SIMD provides measurable benefits
Benchmark and analyze performance: Develop skills in microbenchmarking to accurately measure and compare performance
Apply SIMD to real problems: Use vectorization for image processing and other data-parallel workloads

Theoretical Foundation

Core Concepts

What is SIMD?

SIMD (Single Instruction, Multiple Data) is a form of parallel processing where one instruction operates on multiple data elements simultaneously. Modern CPUs contain specialized registers and execution units designed for SIMD operations.

SCALAR vs SIMD EXECUTION
====================================================================

SCALAR (Traditional):
    One operation processes one value at a time

    Instruction: ADD
         |
         v
    +------+     +------+     +------+
    |  A1  | + |  B1  | = |  C1  |
    +------+     +------+     +------+
         |
         | (repeat 8 times)
         v
    Total: 8 instructions, 8 cycles

SIMD (Vectorized):
    One instruction processes multiple values simultaneously

    Instruction: VADDPS (Vector Add Packed Single-precision)
         |
         v
    +------+------+------+------+------+------+------+------+
    |  A1  |  A2  |  A3  |  A4  |  A5  |  A6  |  A7  |  A8  |
    +------+------+------+------+------+------+------+------+
         +         +         +         +         +         +         +         +
    +------+------+------+------+------+------+------+------+
    |  B1  |  B2  |  B3  |  B4  |  B5  |  B6  |  B7  |  B8  |
    +------+------+------+------+------+------+------+------+
         =         =         =         =         =         =         =         =
    +------+------+------+------+------+------+------+------+
    |  C1  |  C2  |  C3  |  C4  |  C5  |  C6  |  C7  |  C8  |
    +------+------+------+------+------+------+------+------+

    Total: 1 instruction, ~1 cycle
    SPEEDUP: 8x (theoretical)

SIMD Register Sizes and Lane Widths

Different CPU architectures support different SIMD register widths:

SIMD REGISTER HIERARCHY
====================================================================

                     +-------------------------------------------+
                     |            AVX-512 (512 bits)            |
                     | 16 floats | 8 doubles | 16 int32 | etc. |
                     +-------------------------------------------+
                              |
          +------------------+------------------+
          |                                     |
+---------------------+              +---------------------+
|    AVX (256 bits)   |              |    AVX (256 bits)   |
| 8 floats | 4 doubles|              | 8 floats | 4 doubles|
+---------------------+              +---------------------+
          |
     +----+----+
     |         |
+----------+ +----------+
|SSE(128b) | |SSE(128b) |
|4 floats  | |4 floats  |
|2 doubles | |2 doubles |
+----------+ +----------+

SIMD WIDTH BY ARCHITECTURE
====================================================================

| Architecture | Register Size | Floats/Op | Doubles/Op | Ints/Op |
|--------------|---------------|-----------|------------|---------|
| SSE/SSE2     | 128 bits      | 4         | 2          | 4       |
| AVX/AVX2     | 256 bits      | 8         | 4          | 8       |
| AVX-512      | 512 bits      | 16        | 8          | 16      |
| ARM NEON     | 128 bits      | 4         | 2          | 4       |
| ARM SVE      | 128-2048 bits | Variable  | Variable   | Variable|

SIMD Lanes

A SIMD “lane” refers to one element within a SIMD register. Operations apply to all lanes simultaneously:

SIMD LANES VISUALIZATION (AVX: 8 floats)
====================================================================

256-bit AVX Register (YMM0):
+--------+--------+--------+--------+--------+--------+--------+--------+
| Lane 7 | Lane 6 | Lane 5 | Lane 4 | Lane 3 | Lane 2 | Lane 1 | Lane 0 |
| 32-bit | 32-bit | 32-bit | 32-bit | 32-bit | 32-bit | 32-bit | 32-bit |
| float  | float  | float  | float  | float  | float  | float  | float  |
+--------+--------+--------+--------+--------+--------+--------+--------+
     ^        ^        ^        ^        ^        ^        ^        ^
     |        |        |        |        |        |        |        |
   a[7]     a[6]     a[5]     a[4]     a[3]     a[2]     a[1]     a[0]

Each lane is independent: Lane 3 does NOT know what Lane 4 is doing.
Cross-lane operations (like reductions) are EXPENSIVE.

Horizontal vs Vertical Operations

VERTICAL (FAST): All lanes perform the same operation on different data
====================================================================

    YMM0                    YMM1                    YMM2 (result)
+---+---+---+---+      +---+---+---+---+      +---+---+---+---+
| 1 | 2 | 3 | 4 |  +   | 5 | 6 | 7 | 8 |  =   | 6 | 8 |10 |12 |
+---+---+---+---+      +---+---+---+---+      +---+---+---+---+
  |   |   |   |          |   |   |   |          |   |   |   |
  +---+---+---+----------+---+---+---+----------+---+---+---+---> Same lane
       All operations happen in parallel - ONE instruction


HORIZONTAL (SLOW): Combine values WITHIN a single register
====================================================================

    YMM0
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |
  +---+   +---+   +---+   +---+
    |       |       |       |
    3       7       11      15
    |       |       |       |
    +-------+       +-------+
        |               |
       10              26
        |               |
        +-------+-------+
                |
               36  <-- Final sum (requires multiple shuffle + add ops)

Cross-lane communication is expensive!
Horizontal operations require 3-4 instructions on AVX.
This is why dot product is NOT 8x faster - the final reduction limits gains.

Memory Alignment

SIMD operations work best with aligned memory:

MEMORY ALIGNMENT
====================================================================

ALIGNED (Fast):
    Memory addresses are multiples of the SIMD width.

    Address: 0x1000 (multiple of 32 for AVX)
    +--------+--------+--------+--------+--------+--------+--------+--------+
    | float0 | float1 | float2 | float3 | float4 | float5 | float6 | float7 |
    +--------+--------+--------+--------+--------+--------+--------+--------+
    ^
    |__ Single aligned load: vmovaps (1 cycle)


UNALIGNED (Slower):
    Memory starts at arbitrary address.

    Address: 0x1004 (NOT a multiple of 32)
    +--------+--------+--------+--------+--------+--------+--------+--------+
    | float0 | float1 | float2 | float3 | float4 | float5 | float6 | float7 |
    +--------+--------+--------+--------+--------+--------+--------+--------+
        ^
        |__ May cross cache line boundary
        |__ Unaligned load: vmovups (1-2 cycles, or penalty)


HOW TO ENSURE ALIGNMENT
====================================================================

C++:
    alignas(32) float data[1024];           // Stack allocation
    auto* ptr = std::aligned_alloc(32, n);  // Heap allocation

std::experimental::simd:
    stdx::element_aligned  - Any alignment (safe, may be slower)
    stdx::vector_aligned   - Full SIMD width alignment (fast)
    stdx::overaligned<N>   - Custom alignment (N bytes)

Cache Line Alignment (64 bytes for most CPUs):
    alignas(64) float data[1024];  // Avoid false sharing

Why SIMD Matters

SIMD is not just an optimization technique - it is fundamental to achieving peak CPU performance. Modern CPUs are designed with SIMD in mind:

CPU Design: Modern CPUs have more SIMD execution units than scalar units
Memory Bandwidth: SIMD helps saturate memory bandwidth by processing more data per instruction
Energy Efficiency: Fewer instructions means less instruction fetch/decode energy
Competitive Parity: Libraries like BLAS, OpenCV, and game engines all use SIMD extensively

PEAK FLOPS COMPARISON (Intel Core i7-12700K)
====================================================================

                        FLOPS/cycle    Operations/second (4.9 GHz)
                        ===========    ===========================
Scalar (1 float)             1              4.9 billion
SSE    (4 floats)            4             19.6 billion
AVX    (8 floats)            8             39.2 billion
AVX-512 (16 floats)         16             78.4 billion (if supported)

With FMA (Fused Multiply-Add):
AVX + FMA (8 floats x 2)    16             78.4 billion

To achieve peak performance, you MUST use SIMD.

Historical Context: SSE, AVX, AVX-512

EVOLUTION OF x86 SIMD
====================================================================

1997: MMX (64-bit)
      First x86 SIMD. Integer only. Shared registers with FPU (ugh).

1999: SSE (128-bit)
      Dedicated XMM registers. Float support. Game changer.
      4 floats or 2 doubles per operation.

2001: SSE2-SSE4
      Integer operations, more instructions, horizontal ops.

2011: AVX (256-bit)
      Double the width! YMM registers.
      8 floats or 4 doubles per operation.
      3-operand syntax: c = a + b (not a += b).

2013: AVX2
      Integer operations in 256-bit. FMA3 (Fused Multiply-Add).
      Essential for neural networks and scientific computing.

2016: AVX-512 (512-bit)
      16 floats or 8 doubles. Mask registers for predication.
      Only on server CPUs and some high-end desktop.
      Controversial: Causes frequency throttling on some chips.

ARM:
2004: NEON (128-bit)
      ARM's answer to SSE. Found in all modern ARM chips.

2016: SVE (Scalable Vector Extension)
      Variable length: 128 to 2048 bits. Future-proof.
      Same code runs on different vector widths.

Common Misconceptions

Misconception	Reality
“SIMD always gives 4x/8x speedup”	Horizontal operations, memory bandwidth, and remainder handling reduce gains. Expect 2-6x for most workloads.
“The compiler will auto-vectorize”	Compilers struggle with aliasing, non-contiguous access, and complex loops. Manual SIMD often needed.
“AVX-512 is always fastest”	Frequency throttling on some CPUs can make AVX-512 slower than AVX2 for mixed workloads.
“Alignment doesn’t matter anymore”	Unaligned access is faster now but still has penalties, especially crossing cache lines.
“SIMD is only for floating-point”	Integer SIMD is equally important for video codecs, cryptography, and string processing.

Project Specification

What You Will Build

A high-performance math library using std::experimental::simd that provides vectorized implementations of:

Vector operations: Addition, subtraction, multiplication, division, FMA
Reductions: Sum, min, max, dot product
Transcendentals: Square root (using SIMD sqrt instruction)
Image processing: RGB to grayscale conversion
Matrix operations: Matrix-vector multiply, matrix-matrix multiply

Benchmark Targets

Your library should achieve these approximate speedups on AVX2 hardware:

Operation	Input Size	Scalar Time	SIMD Time	Expected Speedup
Vector Add	1M floats	~1.2ms	~0.30ms	3.5-4x
Dot Product	1M floats	~1.8ms	~0.25ms	6-7x
Sum (reduction)	1M floats	~0.8ms	~0.15ms	5-6x
Matrix Multiply (1024x1024)	1M elements	~2400ms	~340ms	6-7x
RGB to Grayscale	4K image (8.3M pixels)	~28ms	~4.2ms	6-7x

Real World Outcome

When you complete this project, running your benchmark will produce output like:

$ ./simd_math_benchmark

============================================================
SIMD Math Library Benchmark
============================================================

System Information:
  CPU: Intel Core i7-12700K @ 4.9 GHz
  SIMD Support: SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
  std::simd<float> size: 8 (256-bit / AVX)
  Cache Line Size: 64 bytes

------------------------------------------------------------
Vector Addition (1,000,000 floats)
------------------------------------------------------------
  Scalar:            1.23 ms  (baseline)
  Auto-vectorized:   0.35 ms  (3.5x faster)
  std::simd:         0.31 ms  (4.0x faster)
  Intrinsics (AVX):  0.30 ms  (4.1x faster)

------------------------------------------------------------
Dot Product (1,000,000 floats)
------------------------------------------------------------
  Scalar:            1.82 ms  (baseline)
  std::simd:         0.26 ms  (7.0x faster)
  Note: Horizontal reduction limits gains

------------------------------------------------------------
Sum Reduction (1,000,000 floats)
------------------------------------------------------------
  Scalar:            0.85 ms  (baseline)
  std::simd:         0.14 ms  (6.1x faster)

------------------------------------------------------------
Matrix Multiply (1024 x 1024 floats)
------------------------------------------------------------
  Scalar (naive):    2,412 ms  (baseline)
  std::simd:          342 ms  (7.1x faster)
  BLAS (reference):    85 ms  (28.4x faster)
  Note: BLAS uses blocking + SIMD + cache optimization

------------------------------------------------------------
RGB to Grayscale (3840 x 2160 = 8.3M pixels)
------------------------------------------------------------
  Scalar:            27.8 ms  (baseline)
  std::simd:          4.2 ms  (6.6x faster)

============================================================
Summary: std::simd provides 4-7x speedup over scalar code
         with portable, maintainable C++ code.
============================================================

$ ./simd_demo
Demonstrating std::simd operations:

std::simd<float, 8> a = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}
std::simd<float, 8> b = {8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0}

a + b = {9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0}
a * b = {8.0, 14.0, 18.0, 20.0, 20.0, 18.0, 14.0, 8.0}

Dot product (a . b) = 120.0
Sum of a = 36.0

Element-wise sqrt(a) = {1.0, 1.41, 1.73, 2.0, 2.24, 2.45, 2.65, 2.83}

API Design

Your library should provide an API like this:

namespace simd_math {

// Vector operations (element-wise)
void add(const float* a, const float* b, float* result, size_t n);
void sub(const float* a, const float* b, float* result, size_t n);
void mul(const float* a, const float* b, float* result, size_t n);
void div(const float* a, const float* b, float* result, size_t n);
void fma(const float* a, const float* b, const float* c, float* result, size_t n);
void sqrt(const float* a, float* result, size_t n);

// Reductions
float sum(const float* a, size_t n);
float dot(const float* a, const float* b, size_t n);
float min(const float* a, size_t n);
float max(const float* a, size_t n);

// Matrix operations
void mat_vec_mul(const float* mat, const float* vec, float* result,
                 size_t rows, size_t cols);
void mat_mat_mul(const float* a, const float* b, float* result,
                 size_t m, size_t n, size_t k);

// Image processing
void rgb_to_grayscale(const uint8_t* rgb, uint8_t* gray, size_t num_pixels);

} // namespace simd_math

Solution Architecture

High-Level Design

SIMD MATH LIBRARY ARCHITECTURE
====================================================================

+-------------------------------------------------------------------+
|                         PUBLIC API                                 |
|   add(), sub(), mul(), dot(), mat_mul(), rgb_to_gray(), ...       |
+-------------------------------------------------------------------+
              |                    |                    |
              v                    v                    v
+-------------------+  +-------------------+  +-------------------+
| ELEMENT-WISE OPS  |  |   REDUCTIONS      |  |  MATRIX OPS       |
|                   |  |                   |  |                   |
| - Vertical SIMD   |  | - Partial sums    |  | - Blocked access  |
| - Main + remainder|  | - Horizontal add  |  | - Cache-friendly  |
| - Aligned loads   |  | - Final reduce    |  | - Row-major/col   |
+-------------------+  +-------------------+  +-------------------+
              |                    |                    |
              +--------------------+--------------------+
                                   |
                                   v
+-------------------------------------------------------------------+
|                    SIMD ABSTRACTION LAYER                         |
|   std::experimental::simd (C++26 will make this std::simd)        |
+-------------------------------------------------------------------+
              |                    |                    |
              v                    v                    v
+-------------------+  +-------------------+  +-------------------+
|   SSE / SSE2      |  |   AVX / AVX2      |  |   AVX-512         |
|   (128-bit)       |  |   (256-bit)       |  |   (512-bit)       |
+-------------------+  +-------------------+  +-------------------+
              |                    |                    |
              +--------------------+--------------------+
                                   |
                                   v
+-------------------------------------------------------------------+
|                       CPU HARDWARE                                 |
|   Vector registers (XMM, YMM, ZMM), Vector execution units        |
+-------------------------------------------------------------------+

SIMD Lane Processing Pattern

PROCESSING PATTERN FOR ARRAY OPERATIONS
====================================================================

Input Arrays (n = 18 elements, SIMD width = 8):

a[]:  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
      | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|
      +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
      |<------- SIMD 1 ------>|<------- SIMD 2 ------>| REM |

Processing Strategy:

    size_t i = 0;

    // MAIN LOOP: Process 8 elements at a time
    for (; i + simd_size <= n; i += simd_size) {
        simd_t va(&a[i], stdx::element_aligned);
        simd_t vb(&b[i], stdx::element_aligned);
        simd_t vr = va + vb;
        vr.copy_to(&result[i], stdx::element_aligned);
    }

    // REMAINDER LOOP: Process remaining elements one at a time
    for (; i < n; ++i) {
        result[i] = a[i] + b[i];
    }

Iteration 1 (i = 0):
    Load a[0..7] into SIMD register
    Load b[0..7] into SIMD register
    Add them -> result[0..7]

Iteration 2 (i = 8):
    Load a[8..15] into SIMD register
    Load b[8..15] into SIMD register
    Add them -> result[8..15]

Remainder (i = 16, 17):
    result[16] = a[16] + b[16];  (scalar)
    result[17] = a[17] + b[17];  (scalar)

Key Components

Component	Responsibility
`simd_types.hpp`	Type aliases for std::simd, platform detection
`vector_ops.hpp/cpp`	Element-wise operations (add, mul, sqrt)
`reductions.hpp/cpp`	Sum, dot product, min, max
`matrix_ops.hpp/cpp`	Matrix-vector and matrix-matrix multiply
`image_ops.hpp/cpp`	RGB to grayscale and other image operations
`benchmark.cpp`	Performance measurement and comparison
`tests.cpp`	Correctness verification

Data Structures

namespace simd_math {

// Platform-specific SIMD types
namespace stdx = std::experimental;

// Use the widest SIMD available on this platform
using native_float = stdx::native_simd<float>;
using native_double = stdx::native_simd<double>;
using native_int32 = stdx::native_simd<int32_t>;

// Fixed-size for portable behavior (same on all platforms)
using float4 = stdx::fixed_size_simd<float, 4>;   // SSE-like
using float8 = stdx::fixed_size_simd<float, 8>;   // AVX-like
using float16 = stdx::fixed_size_simd<float, 16>; // AVX-512-like

// Compile-time constants
constexpr size_t native_float_size = native_float::size();
constexpr size_t cache_line_size = 64;

// Aligned buffer for SIMD operations
template<typename T, size_t Alignment = 64>
struct alignas(Alignment) AlignedBuffer {
    std::vector<T> data;

    AlignedBuffer(size_t n) : data(n) {
        // Ensure proper alignment
        assert(reinterpret_cast<uintptr_t>(data.data()) % Alignment == 0);
    }

    T* get() { return data.data(); }
    const T* get() const { return data.data(); }
    size_t size() const { return data.size(); }
};

} // namespace simd_math

Implementation Guide

Development Environment Setup

# Verify compiler SIMD support
g++ --version  # Need GCC 11+ for std::experimental::simd

# Check CPU SIMD capabilities
cat /proc/cpuinfo | grep -E "sse|avx|avx2|avx512" | head -1
# Or on macOS:
sysctl -a | grep -E "hw.optional.*avx"

# Create project structure
mkdir -p simd_math/{include,src,tests,benchmarks}
cd simd_math

# Create CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(simd_math CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Enable optimizations for benchmarks
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -DNDEBUG")

# Enable SIMD and allow auto-vectorization
add_compile_options(-ffast-math)

# Main library
add_library(simd_math STATIC
    src/vector_ops.cpp
    src/reductions.cpp
    src/matrix_ops.cpp
    src/image_ops.cpp
)

target_include_directories(simd_math PUBLIC include)

# Benchmark executable
add_executable(benchmark benchmarks/benchmark.cpp)
target_link_libraries(benchmark simd_math)

# Test executable
add_executable(tests tests/tests.cpp)
target_link_libraries(tests simd_math)

# Demo executable
add_executable(demo src/demo.cpp)
target_link_libraries(demo simd_math)
EOF

Phase 1: Basic Vector Operations (Days 1-3)

Start with vector addition - the simplest SIMD operation.

Goals:

Set up std::experimental::simd
Implement vector addition with SIMD
Handle remainder elements correctly
Verify correctness

Tasks:

Create header with SIMD type aliases
Implement add() function
Write scalar reference implementation for testing
Verify SIMD results match scalar results

Checkpoint: Vector addition produces correct results and is measurably faster.

// include/simd_types.hpp
#pragma once

#include <experimental/simd>
#include <cstddef>
#include <cassert>

namespace simd_math {

namespace stdx = std::experimental;

// Native SIMD width for this platform
using simd_float = stdx::native_simd<float>;
constexpr size_t simd_width = simd_float::size();

// Memory alignment tag
using aligned_tag = stdx::vector_aligned_tag;
using unaligned_tag = stdx::element_aligned_tag;

} // namespace simd_math

// src/vector_ops.cpp
#include "simd_types.hpp"

namespace simd_math {

void add(const float* a, const float* b, float* result, size_t n) {
    size_t i = 0;

    // Main SIMD loop
    for (; i + simd_width <= n; i += simd_width) {
        simd_float va(a + i, stdx::element_aligned);
        simd_float vb(b + i, stdx::element_aligned);
        simd_float vr = va + vb;
        vr.copy_to(result + i, stdx::element_aligned);
    }

    // Remainder (scalar)
    for (; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

// Reference scalar implementation for testing
void add_scalar(const float* a, const float* b, float* result, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

} // namespace simd_math

Phase 2: Reductions and Dot Product (Days 4-6)

Implement horizontal operations which require accumulating across SIMD lanes.

Goals:

Implement sum reduction using SIMD
Implement dot product
Understand horizontal reduction cost
Use stdx::reduce() for final accumulation

Tasks:

Implement sum() with partial sums in SIMD registers
Implement dot() as element-wise multiply + sum
Benchmark to understand horizontal operation overhead
Compare with scalar implementations

Checkpoint: Dot product is 5-7x faster than scalar.

// src/reductions.cpp
#include "simd_types.hpp"

namespace simd_math {

float sum(const float* a, size_t n) {
    simd_float partial_sum(0.0f);  // Accumulator
    size_t i = 0;

    // Main SIMD loop: accumulate vertical sums
    for (; i + simd_width <= n; i += simd_width) {
        simd_float va(a + i, stdx::element_aligned);
        partial_sum += va;  // Vertical add (fast)
    }

    // Horizontal reduce: combine all lanes
    float result = stdx::reduce(partial_sum);  // sum all lanes

    // Add remainder elements
    for (; i < n; ++i) {
        result += a[i];
    }

    return result;
}

float dot(const float* a, const float* b, size_t n) {
    simd_float partial_sum(0.0f);
    size_t i = 0;

    // Main SIMD loop: accumulate element-wise products
    for (; i + simd_width <= n; i += simd_width) {
        simd_float va(a + i, stdx::element_aligned);
        simd_float vb(b + i, stdx::element_aligned);
        partial_sum += va * vb;  // FMA would be even better
    }

    // Horizontal reduce
    float result = stdx::reduce(partial_sum);

    // Remainder
    for (; i < n; ++i) {
        result += a[i] * b[i];
    }

    return result;
}

} // namespace simd_math

Phase 3: Matrix Operations (Days 7-10)

Matrix multiplication is where SIMD really shines, but naive approaches miss cache opportunities.

Goals:

Implement matrix-vector multiplication
Implement naive matrix-matrix multiplication with SIMD
Understand the gap between naive SIMD and optimized BLAS
Learn about cache blocking (conceptually)

Tasks:

Implement mat_vec_mul() with SIMD
Implement mat_mat_mul() with SIMD on inner loop
Benchmark against scalar and (optionally) OpenBLAS
Experiment with loop ordering (ijk, ikj, etc.)

Checkpoint: Matrix-vector multiply achieves ~6x speedup.

// src/matrix_ops.cpp
#include "simd_types.hpp"

namespace simd_math {

// Matrix-vector multiply: result = mat * vec
// mat is rows x cols, vec is cols x 1, result is rows x 1
void mat_vec_mul(const float* mat, const float* vec, float* result,
                 size_t rows, size_t cols) {
    for (size_t row = 0; row < rows; ++row) {
        // Compute dot product of row with vector
        const float* row_ptr = mat + row * cols;

        simd_float partial_sum(0.0f);
        size_t col = 0;

        for (; col + simd_width <= cols; col += simd_width) {
            simd_float vmat(row_ptr + col, stdx::element_aligned);
            simd_float vvec(vec + col, stdx::element_aligned);
            partial_sum += vmat * vvec;
        }

        result[row] = stdx::reduce(partial_sum);

        // Remainder
        for (; col < cols; ++col) {
            result[row] += row_ptr[col] * vec[col];
        }
    }
}

// Matrix-matrix multiply: C = A * B
// A is m x k, B is k x n, C is m x n
// Naive implementation with SIMD on inner loop
void mat_mat_mul(const float* A, const float* B, float* C,
                 size_t m, size_t n, size_t k) {
    // Zero output matrix
    for (size_t i = 0; i < m * n; ++i) {
        C[i] = 0.0f;
    }

    // C[i,j] = sum over p of A[i,p] * B[p,j]
    // Using ikj order for better cache behavior
    for (size_t i = 0; i < m; ++i) {
        for (size_t p = 0; p < k; ++p) {
            float a_ip = A[i * k + p];
            simd_float va(a_ip);  // Broadcast scalar

            size_t j = 0;
            for (; j + simd_width <= n; j += simd_width) {
                simd_float vb(B + p * n + j, stdx::element_aligned);
                simd_float vc(C + i * n + j, stdx::element_aligned);
                vc += va * vb;  // C[i,j] += A[i,p] * B[p,j]
                vc.copy_to(C + i * n + j, stdx::element_aligned);
            }

            // Remainder
            for (; j < n; ++j) {
                C[i * n + j] += a_ip * B[p * n + j];
            }
        }
    }
}

} // namespace simd_math

Phase 4: Image Processing (Days 11-12)

Apply SIMD to a real-world application: image processing.

Goals:

Implement RGB to grayscale conversion
Handle uint8_t data with SIMD
Understand integer SIMD operations
Process multiple pixels per SIMD operation

Tasks:

Implement RGB to grayscale with SIMD
Handle uint8_t to float conversion
Verify output matches reference implementation
Benchmark with real image sizes

Checkpoint: 4K image conversion is 6x faster.

// src/image_ops.cpp
#include "simd_types.hpp"

namespace simd_math {

// RGB to Grayscale: Y = 0.299*R + 0.587*G + 0.114*B
// Input: RGB pixels (3 bytes per pixel)
// Output: Grayscale pixels (1 byte per pixel)
void rgb_to_grayscale(const uint8_t* rgb, uint8_t* gray, size_t num_pixels) {
    constexpr float r_weight = 0.299f;
    constexpr float g_weight = 0.587f;
    constexpr float b_weight = 0.114f;

    simd_float vr_weight(r_weight);
    simd_float vg_weight(g_weight);
    simd_float vb_weight(b_weight);

    size_t i = 0;

    // Process simd_width pixels at a time
    for (; i + simd_width <= num_pixels; i += simd_width) {
        // Load RGB values and convert to float
        alignas(64) float r[simd_width], g[simd_width], b[simd_width];

        for (size_t lane = 0; lane < simd_width; ++lane) {
            size_t pixel = i + lane;
            r[lane] = static_cast<float>(rgb[pixel * 3 + 0]);
            g[lane] = static_cast<float>(rgb[pixel * 3 + 1]);
            b[lane] = static_cast<float>(rgb[pixel * 3 + 2]);
        }

        simd_float vr(r, stdx::vector_aligned);
        simd_float vg(g, stdx::vector_aligned);
        simd_float vb(b, stdx::vector_aligned);

        // Compute grayscale: Y = 0.299R + 0.587G + 0.114B
        simd_float vy = vr * vr_weight + vg * vg_weight + vb * vb_weight;

        // Store results
        alignas(64) float y[simd_width];
        vy.copy_to(y, stdx::vector_aligned);

        for (size_t lane = 0; lane < simd_width; ++lane) {
            gray[i + lane] = static_cast<uint8_t>(y[lane] + 0.5f);  // Round
        }
    }

    // Remainder
    for (; i < num_pixels; ++i) {
        float r_val = static_cast<float>(rgb[i * 3 + 0]);
        float g_val = static_cast<float>(rgb[i * 3 + 1]);
        float b_val = static_cast<float>(rgb[i * 3 + 2]);
        gray[i] = static_cast<uint8_t>(
            r_val * r_weight + g_val * g_weight + b_val * b_weight + 0.5f
        );
    }
}

} // namespace simd_math

Phase 5: Benchmarking and Optimization (Days 13-14)

Build a proper benchmarking framework and compare implementations.

Goals:

Create accurate microbenchmarks
Compare scalar, auto-vectorized, std::simd, and intrinsics
Identify optimization opportunities
Document performance characteristics

Tasks:

Write benchmarking harness with warmup
Measure and report results in a table
Compare with compiler auto-vectorization
Test with different array sizes and alignments

Checkpoint: Complete benchmark suite runs successfully.

// benchmarks/benchmark.cpp
#include "simd_types.hpp"
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
#include <iomanip>

namespace simd_math {

// Declared in headers
void add(const float* a, const float* b, float* result, size_t n);
float sum(const float* a, size_t n);
float dot(const float* a, const float* b, size_t n);
void mat_vec_mul(const float* mat, const float* vec, float* result,
                 size_t rows, size_t cols);
void rgb_to_grayscale(const uint8_t* rgb, uint8_t* gray, size_t num_pixels);

} // namespace simd_math

class Timer {
public:
    using Clock = std::chrono::high_resolution_clock;

    void start() { start_time = Clock::now(); }

    double elapsed_ms() const {
        auto end = Clock::now();
        return std::chrono::duration<double, std::milli>(end - start_time).count();
    }

private:
    Clock::time_point start_time;
};

template<typename Func>
double benchmark(Func&& func, int iterations = 100) {
    // Warmup
    for (int i = 0; i < 10; ++i) {
        func();
    }

    Timer timer;
    timer.start();
    for (int i = 0; i < iterations; ++i) {
        func();
    }
    return timer.elapsed_ms() / iterations;
}

void scalar_add(const float* a, const float* b, float* r, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        r[i] = a[i] + b[i];
    }
}

float scalar_sum(const float* a, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += a[i];
    }
    return sum;
}

float scalar_dot(const float* a, const float* b, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

int main() {
    std::cout << "============================================================\n";
    std::cout << "SIMD Math Library Benchmark\n";
    std::cout << "============================================================\n\n";

    std::cout << "std::simd<float> width: " << simd_math::simd_width << " elements\n\n";

    // Generate test data
    const size_t n = 1'000'000;
    std::vector<float> a(n), b(n), result(n);

    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dist(0.0f, 1.0f);
    for (size_t i = 0; i < n; ++i) {
        a[i] = dist(gen);
        b[i] = dist(gen);
    }

    std::cout << std::fixed << std::setprecision(2);

    // Benchmark vector addition
    std::cout << "------------------------------------------------------------\n";
    std::cout << "Vector Addition (" << n << " floats)\n";
    std::cout << "------------------------------------------------------------\n";

    double scalar_time = benchmark([&]() {
        scalar_add(a.data(), b.data(), result.data(), n);
    });
    std::cout << "  Scalar:     " << std::setw(8) << scalar_time << " ms (baseline)\n";

    double simd_time = benchmark([&]() {
        simd_math::add(a.data(), b.data(), result.data(), n);
    });
    std::cout << "  std::simd:  " << std::setw(8) << simd_time
              << " ms (" << (scalar_time / simd_time) << "x faster)\n";

    // Benchmark sum
    std::cout << "\n------------------------------------------------------------\n";
    std::cout << "Sum Reduction (" << n << " floats)\n";
    std::cout << "------------------------------------------------------------\n";

    scalar_time = benchmark([&]() {
        volatile float s = scalar_sum(a.data(), n);
        (void)s;
    });
    std::cout << "  Scalar:     " << std::setw(8) << scalar_time << " ms (baseline)\n";

    simd_time = benchmark([&]() {
        volatile float s = simd_math::sum(a.data(), n);
        (void)s;
    });
    std::cout << "  std::simd:  " << std::setw(8) << simd_time
              << " ms (" << (scalar_time / simd_time) << "x faster)\n";

    // Benchmark dot product
    std::cout << "\n------------------------------------------------------------\n";
    std::cout << "Dot Product (" << n << " floats)\n";
    std::cout << "------------------------------------------------------------\n";

    scalar_time = benchmark([&]() {
        volatile float d = scalar_dot(a.data(), b.data(), n);
        (void)d;
    });
    std::cout << "  Scalar:     " << std::setw(8) << scalar_time << " ms (baseline)\n";

    simd_time = benchmark([&]() {
        volatile float d = simd_math::dot(a.data(), b.data(), n);
        (void)d;
    });
    std::cout << "  std::simd:  " << std::setw(8) << simd_time
              << " ms (" << (scalar_time / simd_time) << "x faster)\n";

    std::cout << "\n============================================================\n";
    std::cout << "Benchmark complete.\n";
    std::cout << "============================================================\n";

    return 0;
}

Testing Strategy

Correctness Testing

Every SIMD function must produce results identical to a scalar reference:

// tests/tests.cpp
#include <cassert>
#include <cmath>
#include <iostream>
#include <vector>
#include <random>

// Test that SIMD results match scalar within epsilon
template<typename T>
bool approx_equal(T a, T b, T epsilon = 1e-5) {
    return std::abs(a - b) < epsilon;
}

void test_add() {
    std::cout << "Testing add()... ";

    std::vector<size_t> sizes = {0, 1, 7, 8, 9, 15, 16, 17, 100, 1000, 1000000};

    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dist(-100.0f, 100.0f);

    for (size_t n : sizes) {
        std::vector<float> a(n), b(n), result(n), expected(n);

        for (size_t i = 0; i < n; ++i) {
            a[i] = dist(gen);
            b[i] = dist(gen);
            expected[i] = a[i] + b[i];  // Scalar reference
        }

        simd_math::add(a.data(), b.data(), result.data(), n);

        for (size_t i = 0; i < n; ++i) {
            assert(approx_equal(result[i], expected[i]));
        }
    }

    std::cout << "PASSED\n";
}

void test_dot() {
    std::cout << "Testing dot()... ";

    std::vector<size_t> sizes = {0, 1, 7, 8, 9, 100, 1000};

    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);  // Smaller range to reduce error

    for (size_t n : sizes) {
        std::vector<float> a(n), b(n);

        for (size_t i = 0; i < n; ++i) {
            a[i] = dist(gen);
            b[i] = dist(gen);
        }

        // Scalar reference
        float expected = 0.0f;
        for (size_t i = 0; i < n; ++i) {
            expected += a[i] * b[i];
        }

        float result = simd_math::dot(a.data(), b.data(), n);

        // Allow larger epsilon for accumulated floating-point error
        assert(approx_equal(result, expected, 1e-2f * n));
    }

    std::cout << "PASSED\n";
}

void test_alignment() {
    std::cout << "Testing with various alignments... ";

    // Create buffer with extra space for offset
    alignas(64) std::vector<float> buffer(1000 + 64);

    // Test with different offsets from aligned address
    for (size_t offset = 0; offset < 16; ++offset) {
        float* a = buffer.data() + offset;
        float* b = buffer.data() + offset + 100;
        float* result = buffer.data() + offset + 200;

        // Fill with test data
        for (size_t i = 0; i < 50; ++i) {
            a[i] = static_cast<float>(i);
            b[i] = static_cast<float>(i * 2);
        }

        simd_math::add(a, b, result, 50);

        for (size_t i = 0; i < 50; ++i) {
            assert(approx_equal(result[i], static_cast<float>(i * 3)));
        }
    }

    std::cout << "PASSED\n";
}

int main() {
    std::cout << "Running SIMD Math Library Tests\n";
    std::cout << "================================\n";

    test_add();
    test_dot();
    test_alignment();

    std::cout << "================================\n";
    std::cout << "All tests passed!\n";

    return 0;
}

Edge Cases to Test

Edge Case	Why It Matters
n = 0	Empty input should not crash
n = 1	Single element, no SIMD benefit
n = simd_width - 1	All remainder, no main loop
n = simd_width	Exactly one SIMD iteration
n = simd_width + 1	One SIMD + one remainder
Large n	Verify no overflow in loop counters
Unaligned pointers	Verify element_aligned works
Negative values	Verify signed operations correct
Edge floats (NaN, Inf)	Verify special values handled

Common Pitfalls

Remainder Handling

Problem	Symptom	Solution
Missing remainder loop	Wrong results for n not divisible by simd_width	Always add scalar fallback after main SIMD loop
Wrong loop condition	Buffer overrun or missing elements	Use `i + simd_width <= n`, not `i < n`
Off-by-one in remainder	Missing last element	Remainder starts at `i`, not `i + 1`

// WRONG: Will read past end of array
for (size_t i = 0; i < n; i += simd_width) {  // BUG!
    simd_float va(&a[i], stdx::element_aligned);
    // If n=10 and simd_width=8, second iteration reads a[8..15]!
}

// CORRECT: Stop when there's not enough elements for full SIMD
for (size_t i = 0; i + simd_width <= n; i += simd_width) {
    simd_float va(&a[i], stdx::element_aligned);
}
// Handle remainder with scalar loop

Memory Alignment

Problem	Symptom	Solution
Using vector_aligned on unaligned data	Crash or wrong results	Use element_aligned for arbitrary pointers
Not aligning stack buffers	Slower performance	Use `alignas(64)` for temporary buffers
Crossing cache line boundary	Performance degradation	Align data to 64 bytes when possible

// BAD: Assumes input is aligned
simd_float va(&a[i], stdx::vector_aligned);  // May crash if a is not aligned!

// SAFE: Works with any alignment (may be slightly slower)
simd_float va(&a[i], stdx::element_aligned);

// BEST: Align your own buffers
alignas(64) float my_buffer[1024];
simd_float va(&my_buffer[i], stdx::vector_aligned);  // Safe and fast

Floating-Point Precision

Problem	Symptom	Solution
Expecting bit-exact results	Tests fail despite correct code	Use epsilon comparison, not `==`
Order-dependent precision	Different results from scalar	SIMD may reorder operations; this is expected
Accumulation error	Large errors with many elements	Use Kahan summation for high precision

// BAD: Fails due to floating-point precision
assert(simd_result == scalar_result);

// GOOD: Allow small differences
assert(std::abs(simd_result - scalar_result) < 1e-5f);

// BETTER: Relative error for large values
assert(std::abs(simd_result - scalar_result) / std::abs(scalar_result) < 1e-5f);

Compiler Auto-Vectorization Interference

Problem	Symptom	Solution
Compiler optimizes away scalar baseline	Misleading benchmarks	Use `volatile` or `benchmark::DoNotOptimize`
Compiler auto-vectorizes “scalar” code	No visible SIMD benefit	Check assembly or disable vectorization with pragmas
Different results with -O3	Confusion about what’s being measured	Be explicit about compiler flags

// BAD: Compiler may optimize away the computation
float sum = scalar_sum(data, n);
// (sum is never used, so entire function may be eliminated)

// GOOD: Prevent optimization
volatile float sum = scalar_sum(data, n);
(void)sum;  // Silence unused variable warning

// Or use benchmark library's DoNotOptimize
benchmark::DoNotOptimize(scalar_sum(data, n));

Extensions and Challenges

Beginner Extensions

Add more element-wise operations: Implement sub(), mul(), div(), abs(), neg()
Add sqrt(): Use SIMD square root instruction
Support double precision: Implement the same operations for double
Add min/max reductions: Find minimum and maximum elements

Intermediate Extensions

Fused Multiply-Add (FMA): Use FMA for better performance and precision
Masked operations: Handle conditions within SIMD (e.g., clamp values)
Complex numbers: SIMD operations on complex number arrays
Polynomial evaluation: Evaluate polynomials using Horner’s method with SIMD
Compare with intrinsics: Implement the same functions using AVX intrinsics directly

Advanced Extensions

Cache-blocked matrix multiply: Implement blocking for better cache utilization
Multithreaded + SIMD: Combine OpenMP/threads with SIMD for maximum throughput
Convolution: Implement 1D/2D convolution with SIMD
FFT: Implement SIMD-optimized Fast Fourier Transform
Compare with BLAS/Eigen: Benchmark against optimized libraries
SVE/Scalable vectors: Experiment with ARM SVE’s length-agnostic approach

Expert Extensions

Custom SIMD allocator: Write an allocator that guarantees alignment
SIMD expression templates: Build a DSL that generates optimal SIMD code
Auto-tuning: Automatically select best SIMD width for given hardware
Profile-guided optimization: Use profiling to guide vectorization decisions

Resources

Essential Documentation

Resource	What It Covers
cppreference: std::experimental::simd	Official C++ SIMD library documentation
ModernesCpp: Data-Parallel Types	Practical tutorial on std::simd
Intel Intrinsics Guide	Complete reference for AVX/SSE intrinsics
Agner Fog’s Optimization Manuals	CPU architecture and optimization guides

Books

Book	Relevance
“Modern X86 Assembly Language Programming” - Daniel Kusswurm	Low-level SIMD, assembly understanding
“Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron	CPU architecture, memory hierarchy
“Optimizing Software in C++” - Agner Fog (free PDF)	Practical optimization including SIMD
“Hacker’s Delight” - Henry S. Warren	Bit manipulation tricks useful for SIMD

Code Examples and Libraries

Resource	What to Learn
xtensor	Expression templates + SIMD
Eigen	Production SIMD in linear algebra library
xsimd	Standalone SIMD wrapper library
ISPC	Intel’s SIMD compiler

Tools for Analysis

# View SIMD instructions in compiled binary
objdump -d ./benchmark | grep -E "vmov|vadd|vmul|vfma" | head -50

# Check which SIMD extensions your CPU supports
cat /proc/cpuinfo | grep flags | head -1 | tr ' ' '\n' | grep -E "sse|avx"

# Generate assembly from source
g++ -O3 -march=native -S vector_ops.cpp -o vector_ops.s

# Use Compiler Explorer (online)
# https://godbolt.org - paste code and see generated assembly

Self-Assessment Checklist

Before considering this project complete, verify:

Conceptual Understanding

I can explain what SIMD is and why it provides speedups
I understand the difference between horizontal and vertical operations
I can explain why dot product doesn’t achieve full 8x speedup
I understand memory alignment and its impact on SIMD performance
I can explain when to use native_simd vs fixed_size_simd
I understand the trade-offs between auto-vectorization and manual SIMD

Implementation

Vector addition, subtraction, multiplication work correctly
Dot product produces correct results (within floating-point tolerance)
All functions handle remainder elements correctly
Code compiles with -O3 -march=native without warnings
Functions work with any array size (including 0, 1, and non-multiples of SIMD width)
Functions work with unaligned input pointers

Performance

Vector addition achieves 3-4x speedup over scalar
Dot product achieves 5-7x speedup over scalar
Benchmarks are properly designed (warmup, multiple iterations, no optimization)
Performance scales appropriately with input size
I understand why my speedup is less than theoretical maximum

Testing

All functions pass correctness tests
Edge cases tested (size 0, 1, simd_width-1, simd_width, simd_width+1)
Alignment edge cases tested
Results verified against scalar reference implementation

Code Quality

Code is well-organized with proper headers
SIMD types are abstracted for portability
No undefined behavior (buffer overruns, etc.)
Comments explain non-obvious SIMD patterns
CMake build system works correctly

Submission/Completion Criteria

Minimum Viable Completion

Vector add, subtract, multiply implemented with SIMD
Sum reduction implemented with SIMD
Dot product implemented with SIMD
All basic tests pass
Benchmark shows measurable speedup (at least 2x for vector ops)

Full Completion

All vector operations implemented (add, sub, mul, div, sqrt)
All reductions implemented (sum, dot, min, max)
Matrix-vector multiply implemented
RGB to grayscale implemented
Comprehensive test suite passes
Benchmark suite with comparison to scalar
Documentation of performance characteristics

Excellence (Going Above and Beyond)

Matrix-matrix multiply with cache blocking
Comparison with AVX intrinsics implementation
Comparison with OpenBLAS/Eigen
Multi-threaded + SIMD combination
Performance analysis with perf/VTune
Support for multiple SIMD architectures (SSE, AVX, AVX-512)
Expression template system for lazy evaluation

The Core Question You’re Answering

“How can we leverage the CPU’s SIMD capabilities to process data in parallel within a single thread, and what are the practical considerations (alignment, remainders, horizontal operations) that determine real-world performance?”

SIMD represents data-level parallelism that is orthogonal to thread-level parallelism. Understanding SIMD is essential for anyone who wants to write high-performance code, whether for scientific computing, games, machine learning, or media processing. By building this library, you will develop intuition for when SIMD helps, when it doesn’t, and how to write portable SIMD code that runs efficiently across different CPU architectures.

Hints in Layers

Use these hints progressively if you get stuck.

Hint Layer 1: Getting Started

Start with just vector addition. Get the main loop working first.
Print simd_width to understand your platform’s capabilities.
Test with small arrays (size 16-20) where you can manually verify results.
Use element_aligned for all operations initially; optimize alignment later.

Hint Layer 2: The Main Loop Pattern

size_t i = 0;

// Main SIMD loop: process simd_width elements per iteration
for (; i + simd_width <= n; i += simd_width) {
    // Load simd_width elements into SIMD register
    // Perform operation
    // Store simd_width elements back
}

// Remainder: handle leftover elements with scalar code
for (; i < n; ++i) {
    // Scalar operation
}

Hint Layer 3: Reduction Pattern

// Use a SIMD register as accumulator
simd_float accumulator(0.0f);  // Initialize all lanes to zero

for (; i + simd_width <= n; i += simd_width) {
    simd_float v(&a[i], stdx::element_aligned);
    accumulator += v;  // Vertical add (fast)
}

// Horizontal reduce: sum all lanes
float result = stdx::reduce(accumulator);  // This is the expensive part

// Add remainder elements
for (; i < n; ++i) {
    result += a[i];
}

Hint Layer 4: Common Debugging Techniques

If results are wrong, first test with n = simd_width (no remainder)
If remainder is wrong, test with n = simd_width + 1
Use std::cout << va[i] to print individual SIMD lanes
Compare against a simple scalar implementation element-by-element
Check assembly output with objdump -d or Compiler Explorer

Previous: P11: Async File I/O Library - Async I/O, coroutine integration
Next: P13: Audio DSP with SIMD - Real-time audio, SIMD filters
See also: P14: Real-Time Game Physics - Parallel physics with SIMD integration

This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the project index.

Project 12: SIMD Math Library

Quick Reference

Learning Objectives

Theoretical Foundation

Core Concepts

What is SIMD?

SIMD Register Sizes and Lane Widths

SIMD Lanes

Horizontal vs Vertical Operations

Memory Alignment

Why SIMD Matters

Historical Context: SSE, AVX, AVX-512

Common Misconceptions

Project Specification

What You Will Build

Benchmark Targets

Real World Outcome

API Design

Solution Architecture

High-Level Design

SIMD Lane Processing Pattern

Key Components

Data Structures

Implementation Guide

Development Environment Setup

Phase 1: Basic Vector Operations (Days 1-3)

Phase 2: Reductions and Dot Product (Days 4-6)

Phase 3: Matrix Operations (Days 7-10)

Phase 4: Image Processing (Days 11-12)

Phase 5: Benchmarking and Optimization (Days 13-14)

Testing Strategy

Correctness Testing

Edge Cases to Test

Common Pitfalls

Remainder Handling

Memory Alignment

Floating-Point Precision

Compiler Auto-Vectorization Interference

Extensions and Challenges

Beginner Extensions

Intermediate Extensions

Advanced Extensions

Expert Extensions

Resources

Essential Documentation

Books

Code Examples and Libraries

Tools for Analysis

Self-Assessment Checklist

Conceptual Understanding

Implementation

Performance

Testing

Code Quality

Submission/Completion Criteria

Minimum Viable Completion

Full Completion

Excellence (Going Above and Beyond)

The Core Question You’re Answering

Hints in Layers

Hint Layer 1: Getting Started

Hint Layer 2: The Main Loop Pattern

Hint Layer 3: Reduction Pattern

Hint Layer 4: Common Debugging Techniques

Related Projects in This Series