Project 12: SIMD Math Library

“The CPU does not care about your elegant abstractions. It cares about data locality, instruction-level parallelism, and keeping all its execution units busy.” - Casey Muratori


Quick Reference

Attribute Value
Language C++ (C++20 or later recommended)
Difficulty Advanced
Time 1-2 weeks
Chapters SIMD / Vectorization / Performance
Coolness Level 4: Hardcore Tech Flex
Portfolio Value High - Demonstrates low-level optimization

Learning Objectives

By completing this project, you will:

  1. Understand SIMD fundamentals: Learn how Single Instruction Multiple Data (SIMD) allows a CPU to process multiple values simultaneously with a single instruction

  2. Master std::experimental::simd: Use C++’s portable SIMD abstraction to write vectorized code that works across SSE, AVX, AVX-512, and ARM NEON

  3. Implement vectorized math operations: Build optimized vector addition, dot product, and matrix multiplication routines that achieve 4-8x speedups

  4. Handle alignment and remainders: Understand why memory alignment matters for SIMD performance and how to handle array sizes that are not multiples of SIMD width

  5. Compare with auto-vectorization: Learn when the compiler can auto-vectorize and when manual SIMD provides measurable benefits

  6. Benchmark and analyze performance: Develop skills in microbenchmarking to accurately measure and compare performance

  7. Apply SIMD to real problems: Use vectorization for image processing and other data-parallel workloads


Theoretical Foundation

Core Concepts

What is SIMD?

SIMD (Single Instruction, Multiple Data) is a form of parallel processing where one instruction operates on multiple data elements simultaneously. Modern CPUs contain specialized registers and execution units designed for SIMD operations.

SCALAR vs SIMD EXECUTION
====================================================================

SCALAR (Traditional):
    One operation processes one value at a time

    Instruction: ADD
         |
         v
    +------+     +------+     +------+
    |  A1  | + |  B1  | = |  C1  |
    +------+     +------+     +------+
         |
         | (repeat 8 times)
         v
    Total: 8 instructions, 8 cycles

SIMD (Vectorized):
    One instruction processes multiple values simultaneously

    Instruction: VADDPS (Vector Add Packed Single-precision)
         |
         v
    +------+------+------+------+------+------+------+------+
    |  A1  |  A2  |  A3  |  A4  |  A5  |  A6  |  A7  |  A8  |
    +------+------+------+------+------+------+------+------+
         +         +         +         +         +         +         +         +
    +------+------+------+------+------+------+------+------+
    |  B1  |  B2  |  B3  |  B4  |  B5  |  B6  |  B7  |  B8  |
    +------+------+------+------+------+------+------+------+
         =         =         =         =         =         =         =         =
    +------+------+------+------+------+------+------+------+
    |  C1  |  C2  |  C3  |  C4  |  C5  |  C6  |  C7  |  C8  |
    +------+------+------+------+------+------+------+------+

    Total: 1 instruction, ~1 cycle
    SPEEDUP: 8x (theoretical)

SIMD Register Sizes and Lane Widths

Different CPU architectures support different SIMD register widths:

SIMD REGISTER HIERARCHY
====================================================================

                     +-------------------------------------------+
                     |            AVX-512 (512 bits)            |
                     | 16 floats | 8 doubles | 16 int32 | etc. |
                     +-------------------------------------------+
                              |
          +------------------+------------------+
          |                                     |
+---------------------+              +---------------------+
|    AVX (256 bits)   |              |    AVX (256 bits)   |
| 8 floats | 4 doubles|              | 8 floats | 4 doubles|
+---------------------+              +---------------------+
          |
     +----+----+
     |         |
+----------+ +----------+
|SSE(128b) | |SSE(128b) |
|4 floats  | |4 floats  |
|2 doubles | |2 doubles |
+----------+ +----------+

SIMD WIDTH BY ARCHITECTURE
====================================================================

| Architecture | Register Size | Floats/Op | Doubles/Op | Ints/Op |
|--------------|---------------|-----------|------------|---------|
| SSE/SSE2     | 128 bits      | 4         | 2          | 4       |
| AVX/AVX2     | 256 bits      | 8         | 4          | 8       |
| AVX-512      | 512 bits      | 16        | 8          | 16      |
| ARM NEON     | 128 bits      | 4         | 2          | 4       |
| ARM SVE      | 128-2048 bits | Variable  | Variable   | Variable|

SIMD Lanes

A SIMD “lane” refers to one element within a SIMD register. Operations apply to all lanes simultaneously:

SIMD LANES VISUALIZATION (AVX: 8 floats)
====================================================================

256-bit AVX Register (YMM0):
+--------+--------+--------+--------+--------+--------+--------+--------+
| Lane 7 | Lane 6 | Lane 5 | Lane 4 | Lane 3 | Lane 2 | Lane 1 | Lane 0 |
| 32-bit | 32-bit | 32-bit | 32-bit | 32-bit | 32-bit | 32-bit | 32-bit |
| float  | float  | float  | float  | float  | float  | float  | float  |
+--------+--------+--------+--------+--------+--------+--------+--------+
     ^        ^        ^        ^        ^        ^        ^        ^
     |        |        |        |        |        |        |        |
   a[7]     a[6]     a[5]     a[4]     a[3]     a[2]     a[1]     a[0]

Each lane is independent: Lane 3 does NOT know what Lane 4 is doing.
Cross-lane operations (like reductions) are EXPENSIVE.

Horizontal vs Vertical Operations

VERTICAL (FAST): All lanes perform the same operation on different data
====================================================================

    YMM0                    YMM1                    YMM2 (result)
+---+---+---+---+      +---+---+---+---+      +---+---+---+---+
| 1 | 2 | 3 | 4 |  +   | 5 | 6 | 7 | 8 |  =   | 6 | 8 |10 |12 |
+---+---+---+---+      +---+---+---+---+      +---+---+---+---+
  |   |   |   |          |   |   |   |          |   |   |   |
  +---+---+---+----------+---+---+---+----------+---+---+---+---> Same lane
       All operations happen in parallel - ONE instruction


HORIZONTAL (SLOW): Combine values WITHIN a single register
====================================================================

    YMM0
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |
  +---+   +---+   +---+   +---+
    |       |       |       |
    3       7       11      15
    |       |       |       |
    +-------+       +-------+
        |               |
       10              26
        |               |
        +-------+-------+
                |
               36  <-- Final sum (requires multiple shuffle + add ops)

Cross-lane communication is expensive!
Horizontal operations require 3-4 instructions on AVX.
This is why dot product is NOT 8x faster - the final reduction limits gains.

Memory Alignment

SIMD operations work best with aligned memory:

MEMORY ALIGNMENT
====================================================================

ALIGNED (Fast):
    Memory addresses are multiples of the SIMD width.

    Address: 0x1000 (multiple of 32 for AVX)
    +--------+--------+--------+--------+--------+--------+--------+--------+
    | float0 | float1 | float2 | float3 | float4 | float5 | float6 | float7 |
    +--------+--------+--------+--------+--------+--------+--------+--------+
    ^
    |__ Single aligned load: vmovaps (1 cycle)


UNALIGNED (Slower):
    Memory starts at arbitrary address.

    Address: 0x1004 (NOT a multiple of 32)
    +--------+--------+--------+--------+--------+--------+--------+--------+
    | float0 | float1 | float2 | float3 | float4 | float5 | float6 | float7 |
    +--------+--------+--------+--------+--------+--------+--------+--------+
        ^
        |__ May cross cache line boundary
        |__ Unaligned load: vmovups (1-2 cycles, or penalty)


HOW TO ENSURE ALIGNMENT
====================================================================

C++:
    alignas(32) float data[1024];           // Stack allocation
    auto* ptr = std::aligned_alloc(32, n);  // Heap allocation

std::experimental::simd:
    stdx::element_aligned  - Any alignment (safe, may be slower)
    stdx::vector_aligned   - Full SIMD width alignment (fast)
    stdx::overaligned<N>   - Custom alignment (N bytes)

Cache Line Alignment (64 bytes for most CPUs):
    alignas(64) float data[1024];  // Avoid false sharing

Why SIMD Matters

SIMD is not just an optimization technique - it is fundamental to achieving peak CPU performance. Modern CPUs are designed with SIMD in mind:

  1. CPU Design: Modern CPUs have more SIMD execution units than scalar units
  2. Memory Bandwidth: SIMD helps saturate memory bandwidth by processing more data per instruction
  3. Energy Efficiency: Fewer instructions means less instruction fetch/decode energy
  4. Competitive Parity: Libraries like BLAS, OpenCV, and game engines all use SIMD extensively
PEAK FLOPS COMPARISON (Intel Core i7-12700K)
====================================================================

                        FLOPS/cycle    Operations/second (4.9 GHz)
                        ===========    ===========================
Scalar (1 float)             1              4.9 billion
SSE    (4 floats)            4             19.6 billion
AVX    (8 floats)            8             39.2 billion
AVX-512 (16 floats)         16             78.4 billion (if supported)

With FMA (Fused Multiply-Add):
AVX + FMA (8 floats x 2)    16             78.4 billion

To achieve peak performance, you MUST use SIMD.

Historical Context: SSE, AVX, AVX-512

EVOLUTION OF x86 SIMD
====================================================================

1997: MMX (64-bit)
      First x86 SIMD. Integer only. Shared registers with FPU (ugh).

1999: SSE (128-bit)
      Dedicated XMM registers. Float support. Game changer.
      4 floats or 2 doubles per operation.

2001: SSE2-SSE4
      Integer operations, more instructions, horizontal ops.

2011: AVX (256-bit)
      Double the width! YMM registers.
      8 floats or 4 doubles per operation.
      3-operand syntax: c = a + b (not a += b).

2013: AVX2
      Integer operations in 256-bit. FMA3 (Fused Multiply-Add).
      Essential for neural networks and scientific computing.

2016: AVX-512 (512-bit)
      16 floats or 8 doubles. Mask registers for predication.
      Only on server CPUs and some high-end desktop.
      Controversial: Causes frequency throttling on some chips.

ARM:
2004: NEON (128-bit)
      ARM's answer to SSE. Found in all modern ARM chips.

2016: SVE (Scalable Vector Extension)
      Variable length: 128 to 2048 bits. Future-proof.
      Same code runs on different vector widths.

Common Misconceptions

Misconception Reality
“SIMD always gives 4x/8x speedup” Horizontal operations, memory bandwidth, and remainder handling reduce gains. Expect 2-6x for most workloads.
“The compiler will auto-vectorize” Compilers struggle with aliasing, non-contiguous access, and complex loops. Manual SIMD often needed.
“AVX-512 is always fastest” Frequency throttling on some CPUs can make AVX-512 slower than AVX2 for mixed workloads.
“Alignment doesn’t matter anymore” Unaligned access is faster now but still has penalties, especially crossing cache lines.
“SIMD is only for floating-point” Integer SIMD is equally important for video codecs, cryptography, and string processing.

Project Specification

What You Will Build

A high-performance math library using std::experimental::simd that provides vectorized implementations of:

  1. Vector operations: Addition, subtraction, multiplication, division, FMA
  2. Reductions: Sum, min, max, dot product
  3. Transcendentals: Square root (using SIMD sqrt instruction)
  4. Image processing: RGB to grayscale conversion
  5. Matrix operations: Matrix-vector multiply, matrix-matrix multiply

Benchmark Targets

Your library should achieve these approximate speedups on AVX2 hardware:

Operation Input Size Scalar Time SIMD Time Expected Speedup
Vector Add 1M floats ~1.2ms ~0.30ms 3.5-4x
Dot Product 1M floats ~1.8ms ~0.25ms 6-7x
Sum (reduction) 1M floats ~0.8ms ~0.15ms 5-6x
Matrix Multiply (1024x1024) 1M elements ~2400ms ~340ms 6-7x
RGB to Grayscale 4K image (8.3M pixels) ~28ms ~4.2ms 6-7x

Real World Outcome

When you complete this project, running your benchmark will produce output like:

$ ./simd_math_benchmark

============================================================
SIMD Math Library Benchmark
============================================================

System Information:
  CPU: Intel Core i7-12700K @ 4.9 GHz
  SIMD Support: SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
  std::simd<float> size: 8 (256-bit / AVX)
  Cache Line Size: 64 bytes

------------------------------------------------------------
Vector Addition (1,000,000 floats)
------------------------------------------------------------
  Scalar:            1.23 ms  (baseline)
  Auto-vectorized:   0.35 ms  (3.5x faster)
  std::simd:         0.31 ms  (4.0x faster)
  Intrinsics (AVX):  0.30 ms  (4.1x faster)

------------------------------------------------------------
Dot Product (1,000,000 floats)
------------------------------------------------------------
  Scalar:            1.82 ms  (baseline)
  std::simd:         0.26 ms  (7.0x faster)
  Note: Horizontal reduction limits gains

------------------------------------------------------------
Sum Reduction (1,000,000 floats)
------------------------------------------------------------
  Scalar:            0.85 ms  (baseline)
  std::simd:         0.14 ms  (6.1x faster)

------------------------------------------------------------
Matrix Multiply (1024 x 1024 floats)
------------------------------------------------------------
  Scalar (naive):    2,412 ms  (baseline)
  std::simd:          342 ms  (7.1x faster)
  BLAS (reference):    85 ms  (28.4x faster)
  Note: BLAS uses blocking + SIMD + cache optimization

------------------------------------------------------------
RGB to Grayscale (3840 x 2160 = 8.3M pixels)
------------------------------------------------------------
  Scalar:            27.8 ms  (baseline)
  std::simd:          4.2 ms  (6.6x faster)

============================================================
Summary: std::simd provides 4-7x speedup over scalar code
         with portable, maintainable C++ code.
============================================================

$ ./simd_demo
Demonstrating std::simd operations:

std::simd<float, 8> a = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}
std::simd<float, 8> b = {8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0}

a + b = {9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0}
a * b = {8.0, 14.0, 18.0, 20.0, 20.0, 18.0, 14.0, 8.0}

Dot product (a . b) = 120.0
Sum of a = 36.0

Element-wise sqrt(a) = {1.0, 1.41, 1.73, 2.0, 2.24, 2.45, 2.65, 2.83}

API Design

Your library should provide an API like this:

namespace simd_math {

// Vector operations (element-wise)
void add(const float* a, const float* b, float* result, size_t n);
void sub(const float* a, const float* b, float* result, size_t n);
void mul(const float* a, const float* b, float* result, size_t n);
void div(const float* a, const float* b, float* result, size_t n);
void fma(const float* a, const float* b, const float* c, float* result, size_t n);
void sqrt(const float* a, float* result, size_t n);

// Reductions
float sum(const float* a, size_t n);
float dot(const float* a, const float* b, size_t n);
float min(const float* a, size_t n);
float max(const float* a, size_t n);

// Matrix operations
void mat_vec_mul(const float* mat, const float* vec, float* result,
                 size_t rows, size_t cols);
void mat_mat_mul(const float* a, const float* b, float* result,
                 size_t m, size_t n, size_t k);

// Image processing
void rgb_to_grayscale(const uint8_t* rgb, uint8_t* gray, size_t num_pixels);

} // namespace simd_math

Solution Architecture

High-Level Design

SIMD MATH LIBRARY ARCHITECTURE
====================================================================

+-------------------------------------------------------------------+
|                         PUBLIC API                                 |
|   add(), sub(), mul(), dot(), mat_mul(), rgb_to_gray(), ...       |
+-------------------------------------------------------------------+
              |                    |                    |
              v                    v                    v
+-------------------+  +-------------------+  +-------------------+
| ELEMENT-WISE OPS  |  |   REDUCTIONS      |  |  MATRIX OPS       |
|                   |  |                   |  |                   |
| - Vertical SIMD   |  | - Partial sums    |  | - Blocked access  |
| - Main + remainder|  | - Horizontal add  |  | - Cache-friendly  |
| - Aligned loads   |  | - Final reduce    |  | - Row-major/col   |
+-------------------+  +-------------------+  +-------------------+
              |                    |                    |
              +--------------------+--------------------+
                                   |
                                   v
+-------------------------------------------------------------------+
|                    SIMD ABSTRACTION LAYER                         |
|   std::experimental::simd (C++26 will make this std::simd)        |
+-------------------------------------------------------------------+
              |                    |                    |
              v                    v                    v
+-------------------+  +-------------------+  +-------------------+
|   SSE / SSE2      |  |   AVX / AVX2      |  |   AVX-512         |
|   (128-bit)       |  |   (256-bit)       |  |   (512-bit)       |
+-------------------+  +-------------------+  +-------------------+
              |                    |                    |
              +--------------------+--------------------+
                                   |
                                   v
+-------------------------------------------------------------------+
|                       CPU HARDWARE                                 |
|   Vector registers (XMM, YMM, ZMM), Vector execution units        |
+-------------------------------------------------------------------+

SIMD Lane Processing Pattern

PROCESSING PATTERN FOR ARRAY OPERATIONS
====================================================================

Input Arrays (n = 18 elements, SIMD width = 8):

a[]:  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
      | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|
      +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
      |<------- SIMD 1 ------>|<------- SIMD 2 ------>| REM |

Processing Strategy:

    size_t i = 0;

    // MAIN LOOP: Process 8 elements at a time
    for (; i + simd_size <= n; i += simd_size) {
        simd_t va(&a[i], stdx::element_aligned);
        simd_t vb(&b[i], stdx::element_aligned);
        simd_t vr = va + vb;
        vr.copy_to(&result[i], stdx::element_aligned);
    }

    // REMAINDER LOOP: Process remaining elements one at a time
    for (; i < n; ++i) {
        result[i] = a[i] + b[i];
    }

Iteration 1 (i = 0):
    Load a[0..7] into SIMD register
    Load b[0..7] into SIMD register
    Add them -> result[0..7]

Iteration 2 (i = 8):
    Load a[8..15] into SIMD register
    Load b[8..15] into SIMD register
    Add them -> result[8..15]

Remainder (i = 16, 17):
    result[16] = a[16] + b[16];  (scalar)
    result[17] = a[17] + b[17];  (scalar)

Key Components

Component Responsibility
simd_types.hpp Type aliases for std::simd, platform detection
vector_ops.hpp/cpp Element-wise operations (add, mul, sqrt)
reductions.hpp/cpp Sum, dot product, min, max
matrix_ops.hpp/cpp Matrix-vector and matrix-matrix multiply
image_ops.hpp/cpp RGB to grayscale and other image operations
benchmark.cpp Performance measurement and comparison
tests.cpp Correctness verification

Data Structures

namespace simd_math {

// Platform-specific SIMD types
namespace stdx = std::experimental;

// Use the widest SIMD available on this platform
using native_float = stdx::native_simd<float>;
using native_double = stdx::native_simd<double>;
using native_int32 = stdx::native_simd<int32_t>;

// Fixed-size for portable behavior (same on all platforms)
using float4 = stdx::fixed_size_simd<float, 4>;   // SSE-like
using float8 = stdx::fixed_size_simd<float, 8>;   // AVX-like
using float16 = stdx::fixed_size_simd<float, 16>; // AVX-512-like

// Compile-time constants
constexpr size_t native_float_size = native_float::size();
constexpr size_t cache_line_size = 64;

// Aligned buffer for SIMD operations
template<typename T, size_t Alignment = 64>
struct alignas(Alignment) AlignedBuffer {
    std::vector<T> data;

    AlignedBuffer(size_t n) : data(n) {
        // Ensure proper alignment
        assert(reinterpret_cast<uintptr_t>(data.data()) % Alignment == 0);
    }

    T* get() { return data.data(); }
    const T* get() const { return data.data(); }
    size_t size() const { return data.size(); }
};

} // namespace simd_math

Implementation Guide

Development Environment Setup

# Verify compiler SIMD support
g++ --version  # Need GCC 11+ for std::experimental::simd

# Check CPU SIMD capabilities
cat /proc/cpuinfo | grep -E "sse|avx|avx2|avx512" | head -1
# Or on macOS:
sysctl -a | grep -E "hw.optional.*avx"

# Create project structure
mkdir -p simd_math/{include,src,tests,benchmarks}
cd simd_math

# Create CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(simd_math CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Enable optimizations for benchmarks
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -DNDEBUG")

# Enable SIMD and allow auto-vectorization
add_compile_options(-ffast-math)

# Main library
add_library(simd_math STATIC
    src/vector_ops.cpp
    src/reductions.cpp
    src/matrix_ops.cpp
    src/image_ops.cpp
)

target_include_directories(simd_math PUBLIC include)

# Benchmark executable
add_executable(benchmark benchmarks/benchmark.cpp)
target_link_libraries(benchmark simd_math)

# Test executable
add_executable(tests tests/tests.cpp)
target_link_libraries(tests simd_math)

# Demo executable
add_executable(demo src/demo.cpp)
target_link_libraries(demo simd_math)
EOF

Phase 1: Basic Vector Operations (Days 1-3)

Start with vector addition - the simplest SIMD operation.

Goals:

  • Set up std::experimental::simd
  • Implement vector addition with SIMD
  • Handle remainder elements correctly
  • Verify correctness

Tasks:

  1. Create header with SIMD type aliases
  2. Implement add() function
  3. Write scalar reference implementation for testing
  4. Verify SIMD results match scalar results

Checkpoint: Vector addition produces correct results and is measurably faster.

// include/simd_types.hpp
#pragma once

#include <experimental/simd>
#include <cstddef>
#include <cassert>

namespace simd_math {

namespace stdx = std::experimental;

// Native SIMD width for this platform
using simd_float = stdx::native_simd<float>;
constexpr size_t simd_width = simd_float::size();

// Memory alignment tag
using aligned_tag = stdx::vector_aligned_tag;
using unaligned_tag = stdx::element_aligned_tag;

} // namespace simd_math
// src/vector_ops.cpp
#include "simd_types.hpp"

namespace simd_math {

void add(const float* a, const float* b, float* result, size_t n) {
    size_t i = 0;

    // Main SIMD loop
    for (; i + simd_width <= n; i += simd_width) {
        simd_float va(a + i, stdx::element_aligned);
        simd_float vb(b + i, stdx::element_aligned);
        simd_float vr = va + vb;
        vr.copy_to(result + i, stdx::element_aligned);
    }

    // Remainder (scalar)
    for (; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

// Reference scalar implementation for testing
void add_scalar(const float* a, const float* b, float* result, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

} // namespace simd_math

Phase 2: Reductions and Dot Product (Days 4-6)

Implement horizontal operations which require accumulating across SIMD lanes.

Goals:

  • Implement sum reduction using SIMD
  • Implement dot product
  • Understand horizontal reduction cost
  • Use stdx::reduce() for final accumulation

Tasks:

  1. Implement sum() with partial sums in SIMD registers
  2. Implement dot() as element-wise multiply + sum
  3. Benchmark to understand horizontal operation overhead
  4. Compare with scalar implementations

Checkpoint: Dot product is 5-7x faster than scalar.

// src/reductions.cpp
#include "simd_types.hpp"

namespace simd_math {

float sum(const float* a, size_t n) {
    simd_float partial_sum(0.0f);  // Accumulator
    size_t i = 0;

    // Main SIMD loop: accumulate vertical sums
    for (; i + simd_width <= n; i += simd_width) {
        simd_float va(a + i, stdx::element_aligned);
        partial_sum += va;  // Vertical add (fast)
    }

    // Horizontal reduce: combine all lanes
    float result = stdx::reduce(partial_sum);  // sum all lanes

    // Add remainder elements
    for (; i < n; ++i) {
        result += a[i];
    }

    return result;
}

float dot(const float* a, const float* b, size_t n) {
    simd_float partial_sum(0.0f);
    size_t i = 0;

    // Main SIMD loop: accumulate element-wise products
    for (; i + simd_width <= n; i += simd_width) {
        simd_float va(a + i, stdx::element_aligned);
        simd_float vb(b + i, stdx::element_aligned);
        partial_sum += va * vb;  // FMA would be even better
    }

    // Horizontal reduce
    float result = stdx::reduce(partial_sum);

    // Remainder
    for (; i < n; ++i) {
        result += a[i] * b[i];
    }

    return result;
}

} // namespace simd_math

Phase 3: Matrix Operations (Days 7-10)

Matrix multiplication is where SIMD really shines, but naive approaches miss cache opportunities.

Goals:

  • Implement matrix-vector multiplication
  • Implement naive matrix-matrix multiplication with SIMD
  • Understand the gap between naive SIMD and optimized BLAS
  • Learn about cache blocking (conceptually)

Tasks:

  1. Implement mat_vec_mul() with SIMD
  2. Implement mat_mat_mul() with SIMD on inner loop
  3. Benchmark against scalar and (optionally) OpenBLAS
  4. Experiment with loop ordering (ijk, ikj, etc.)

Checkpoint: Matrix-vector multiply achieves ~6x speedup.

// src/matrix_ops.cpp
#include "simd_types.hpp"

namespace simd_math {

// Matrix-vector multiply: result = mat * vec
// mat is rows x cols, vec is cols x 1, result is rows x 1
void mat_vec_mul(const float* mat, const float* vec, float* result,
                 size_t rows, size_t cols) {
    for (size_t row = 0; row < rows; ++row) {
        // Compute dot product of row with vector
        const float* row_ptr = mat + row * cols;

        simd_float partial_sum(0.0f);
        size_t col = 0;

        for (; col + simd_width <= cols; col += simd_width) {
            simd_float vmat(row_ptr + col, stdx::element_aligned);
            simd_float vvec(vec + col, stdx::element_aligned);
            partial_sum += vmat * vvec;
        }

        result[row] = stdx::reduce(partial_sum);

        // Remainder
        for (; col < cols; ++col) {
            result[row] += row_ptr[col] * vec[col];
        }
    }
}

// Matrix-matrix multiply: C = A * B
// A is m x k, B is k x n, C is m x n
// Naive implementation with SIMD on inner loop
void mat_mat_mul(const float* A, const float* B, float* C,
                 size_t m, size_t n, size_t k) {
    // Zero output matrix
    for (size_t i = 0; i < m * n; ++i) {
        C[i] = 0.0f;
    }

    // C[i,j] = sum over p of A[i,p] * B[p,j]
    // Using ikj order for better cache behavior
    for (size_t i = 0; i < m; ++i) {
        for (size_t p = 0; p < k; ++p) {
            float a_ip = A[i * k + p];
            simd_float va(a_ip);  // Broadcast scalar

            size_t j = 0;
            for (; j + simd_width <= n; j += simd_width) {
                simd_float vb(B + p * n + j, stdx::element_aligned);
                simd_float vc(C + i * n + j, stdx::element_aligned);
                vc += va * vb;  // C[i,j] += A[i,p] * B[p,j]
                vc.copy_to(C + i * n + j, stdx::element_aligned);
            }

            // Remainder
            for (; j < n; ++j) {
                C[i * n + j] += a_ip * B[p * n + j];
            }
        }
    }
}

} // namespace simd_math

Phase 4: Image Processing (Days 11-12)

Apply SIMD to a real-world application: image processing.

Goals:

  • Implement RGB to grayscale conversion
  • Handle uint8_t data with SIMD
  • Understand integer SIMD operations
  • Process multiple pixels per SIMD operation

Tasks:

  1. Implement RGB to grayscale with SIMD
  2. Handle uint8_t to float conversion
  3. Verify output matches reference implementation
  4. Benchmark with real image sizes

Checkpoint: 4K image conversion is 6x faster.

// src/image_ops.cpp
#include "simd_types.hpp"

namespace simd_math {

// RGB to Grayscale: Y = 0.299*R + 0.587*G + 0.114*B
// Input: RGB pixels (3 bytes per pixel)
// Output: Grayscale pixels (1 byte per pixel)
void rgb_to_grayscale(const uint8_t* rgb, uint8_t* gray, size_t num_pixels) {
    constexpr float r_weight = 0.299f;
    constexpr float g_weight = 0.587f;
    constexpr float b_weight = 0.114f;

    simd_float vr_weight(r_weight);
    simd_float vg_weight(g_weight);
    simd_float vb_weight(b_weight);

    size_t i = 0;

    // Process simd_width pixels at a time
    for (; i + simd_width <= num_pixels; i += simd_width) {
        // Load RGB values and convert to float
        alignas(64) float r[simd_width], g[simd_width], b[simd_width];

        for (size_t lane = 0; lane < simd_width; ++lane) {
            size_t pixel = i + lane;
            r[lane] = static_cast<float>(rgb[pixel * 3 + 0]);
            g[lane] = static_cast<float>(rgb[pixel * 3 + 1]);
            b[lane] = static_cast<float>(rgb[pixel * 3 + 2]);
        }

        simd_float vr(r, stdx::vector_aligned);
        simd_float vg(g, stdx::vector_aligned);
        simd_float vb(b, stdx::vector_aligned);

        // Compute grayscale: Y = 0.299R + 0.587G + 0.114B
        simd_float vy = vr * vr_weight + vg * vg_weight + vb * vb_weight;

        // Store results
        alignas(64) float y[simd_width];
        vy.copy_to(y, stdx::vector_aligned);

        for (size_t lane = 0; lane < simd_width; ++lane) {
            gray[i + lane] = static_cast<uint8_t>(y[lane] + 0.5f);  // Round
        }
    }

    // Remainder
    for (; i < num_pixels; ++i) {
        float r_val = static_cast<float>(rgb[i * 3 + 0]);
        float g_val = static_cast<float>(rgb[i * 3 + 1]);
        float b_val = static_cast<float>(rgb[i * 3 + 2]);
        gray[i] = static_cast<uint8_t>(
            r_val * r_weight + g_val * g_weight + b_val * b_weight + 0.5f
        );
    }
}

} // namespace simd_math

Phase 5: Benchmarking and Optimization (Days 13-14)

Build a proper benchmarking framework and compare implementations.

Goals:

  • Create accurate microbenchmarks
  • Compare scalar, auto-vectorized, std::simd, and intrinsics
  • Identify optimization opportunities
  • Document performance characteristics

Tasks:

  1. Write benchmarking harness with warmup
  2. Measure and report results in a table
  3. Compare with compiler auto-vectorization
  4. Test with different array sizes and alignments

Checkpoint: Complete benchmark suite runs successfully.

// benchmarks/benchmark.cpp
#include "simd_types.hpp"
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
#include <iomanip>

namespace simd_math {

// Declared in headers
void add(const float* a, const float* b, float* result, size_t n);
float sum(const float* a, size_t n);
float dot(const float* a, const float* b, size_t n);
void mat_vec_mul(const float* mat, const float* vec, float* result,
                 size_t rows, size_t cols);
void rgb_to_grayscale(const uint8_t* rgb, uint8_t* gray, size_t num_pixels);

} // namespace simd_math

class Timer {
public:
    using Clock = std::chrono::high_resolution_clock;

    void start() { start_time = Clock::now(); }

    double elapsed_ms() const {
        auto end = Clock::now();
        return std::chrono::duration<double, std::milli>(end - start_time).count();
    }

private:
    Clock::time_point start_time;
};

template<typename Func>
double benchmark(Func&& func, int iterations = 100) {
    // Warmup
    for (int i = 0; i < 10; ++i) {
        func();
    }

    Timer timer;
    timer.start();
    for (int i = 0; i < iterations; ++i) {
        func();
    }
    return timer.elapsed_ms() / iterations;
}

void scalar_add(const float* a, const float* b, float* r, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        r[i] = a[i] + b[i];
    }
}

float scalar_sum(const float* a, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += a[i];
    }
    return sum;
}

float scalar_dot(const float* a, const float* b, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

int main() {
    std::cout << "============================================================\n";
    std::cout << "SIMD Math Library Benchmark\n";
    std::cout << "============================================================\n\n";

    std::cout << "std::simd<float> width: " << simd_math::simd_width << " elements\n\n";

    // Generate test data
    const size_t n = 1'000'000;
    std::vector<float> a(n), b(n), result(n);

    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dist(0.0f, 1.0f);
    for (size_t i = 0; i < n; ++i) {
        a[i] = dist(gen);
        b[i] = dist(gen);
    }

    std::cout << std::fixed << std::setprecision(2);

    // Benchmark vector addition
    std::cout << "------------------------------------------------------------\n";
    std::cout << "Vector Addition (" << n << " floats)\n";
    std::cout << "------------------------------------------------------------\n";

    double scalar_time = benchmark([&]() {
        scalar_add(a.data(), b.data(), result.data(), n);
    });
    std::cout << "  Scalar:     " << std::setw(8) << scalar_time << " ms (baseline)\n";

    double simd_time = benchmark([&]() {
        simd_math::add(a.data(), b.data(), result.data(), n);
    });
    std::cout << "  std::simd:  " << std::setw(8) << simd_time
              << " ms (" << (scalar_time / simd_time) << "x faster)\n";

    // Benchmark sum
    std::cout << "\n------------------------------------------------------------\n";
    std::cout << "Sum Reduction (" << n << " floats)\n";
    std::cout << "------------------------------------------------------------\n";

    scalar_time = benchmark([&]() {
        volatile float s = scalar_sum(a.data(), n);
        (void)s;
    });
    std::cout << "  Scalar:     " << std::setw(8) << scalar_time << " ms (baseline)\n";

    simd_time = benchmark([&]() {
        volatile float s = simd_math::sum(a.data(), n);
        (void)s;
    });
    std::cout << "  std::simd:  " << std::setw(8) << simd_time
              << " ms (" << (scalar_time / simd_time) << "x faster)\n";

    // Benchmark dot product
    std::cout << "\n------------------------------------------------------------\n";
    std::cout << "Dot Product (" << n << " floats)\n";
    std::cout << "------------------------------------------------------------\n";

    scalar_time = benchmark([&]() {
        volatile float d = scalar_dot(a.data(), b.data(), n);
        (void)d;
    });
    std::cout << "  Scalar:     " << std::setw(8) << scalar_time << " ms (baseline)\n";

    simd_time = benchmark([&]() {
        volatile float d = simd_math::dot(a.data(), b.data(), n);
        (void)d;
    });
    std::cout << "  std::simd:  " << std::setw(8) << simd_time
              << " ms (" << (scalar_time / simd_time) << "x faster)\n";

    std::cout << "\n============================================================\n";
    std::cout << "Benchmark complete.\n";
    std::cout << "============================================================\n";

    return 0;
}

Testing Strategy

Correctness Testing

Every SIMD function must produce results identical to a scalar reference:

// tests/tests.cpp
#include <cassert>
#include <cmath>
#include <iostream>
#include <vector>
#include <random>

// Test that SIMD results match scalar within epsilon
template<typename T>
bool approx_equal(T a, T b, T epsilon = 1e-5) {
    return std::abs(a - b) < epsilon;
}

void test_add() {
    std::cout << "Testing add()... ";

    std::vector<size_t> sizes = {0, 1, 7, 8, 9, 15, 16, 17, 100, 1000, 1000000};

    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dist(-100.0f, 100.0f);

    for (size_t n : sizes) {
        std::vector<float> a(n), b(n), result(n), expected(n);

        for (size_t i = 0; i < n; ++i) {
            a[i] = dist(gen);
            b[i] = dist(gen);
            expected[i] = a[i] + b[i];  // Scalar reference
        }

        simd_math::add(a.data(), b.data(), result.data(), n);

        for (size_t i = 0; i < n; ++i) {
            assert(approx_equal(result[i], expected[i]));
        }
    }

    std::cout << "PASSED\n";
}

void test_dot() {
    std::cout << "Testing dot()... ";

    std::vector<size_t> sizes = {0, 1, 7, 8, 9, 100, 1000};

    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);  // Smaller range to reduce error

    for (size_t n : sizes) {
        std::vector<float> a(n), b(n);

        for (size_t i = 0; i < n; ++i) {
            a[i] = dist(gen);
            b[i] = dist(gen);
        }

        // Scalar reference
        float expected = 0.0f;
        for (size_t i = 0; i < n; ++i) {
            expected += a[i] * b[i];
        }

        float result = simd_math::dot(a.data(), b.data(), n);

        // Allow larger epsilon for accumulated floating-point error
        assert(approx_equal(result, expected, 1e-2f * n));
    }

    std::cout << "PASSED\n";
}

void test_alignment() {
    std::cout << "Testing with various alignments... ";

    // Create buffer with extra space for offset
    alignas(64) std::vector<float> buffer(1000 + 64);

    // Test with different offsets from aligned address
    for (size_t offset = 0; offset < 16; ++offset) {
        float* a = buffer.data() + offset;
        float* b = buffer.data() + offset + 100;
        float* result = buffer.data() + offset + 200;

        // Fill with test data
        for (size_t i = 0; i < 50; ++i) {
            a[i] = static_cast<float>(i);
            b[i] = static_cast<float>(i * 2);
        }

        simd_math::add(a, b, result, 50);

        for (size_t i = 0; i < 50; ++i) {
            assert(approx_equal(result[i], static_cast<float>(i * 3)));
        }
    }

    std::cout << "PASSED\n";
}

int main() {
    std::cout << "Running SIMD Math Library Tests\n";
    std::cout << "================================\n";

    test_add();
    test_dot();
    test_alignment();

    std::cout << "================================\n";
    std::cout << "All tests passed!\n";

    return 0;
}

Edge Cases to Test

Edge Case Why It Matters
n = 0 Empty input should not crash
n = 1 Single element, no SIMD benefit
n = simd_width - 1 All remainder, no main loop
n = simd_width Exactly one SIMD iteration
n = simd_width + 1 One SIMD + one remainder
Large n Verify no overflow in loop counters
Unaligned pointers Verify element_aligned works
Negative values Verify signed operations correct
Edge floats (NaN, Inf) Verify special values handled

Common Pitfalls

Remainder Handling

Problem Symptom Solution
Missing remainder loop Wrong results for n not divisible by simd_width Always add scalar fallback after main SIMD loop
Wrong loop condition Buffer overrun or missing elements Use i + simd_width <= n, not i < n
Off-by-one in remainder Missing last element Remainder starts at i, not i + 1
// WRONG: Will read past end of array
for (size_t i = 0; i < n; i += simd_width) {  // BUG!
    simd_float va(&a[i], stdx::element_aligned);
    // If n=10 and simd_width=8, second iteration reads a[8..15]!
}

// CORRECT: Stop when there's not enough elements for full SIMD
for (size_t i = 0; i + simd_width <= n; i += simd_width) {
    simd_float va(&a[i], stdx::element_aligned);
}
// Handle remainder with scalar loop

Memory Alignment

Problem Symptom Solution
Using vector_aligned on unaligned data Crash or wrong results Use element_aligned for arbitrary pointers
Not aligning stack buffers Slower performance Use alignas(64) for temporary buffers
Crossing cache line boundary Performance degradation Align data to 64 bytes when possible
// BAD: Assumes input is aligned
simd_float va(&a[i], stdx::vector_aligned);  // May crash if a is not aligned!

// SAFE: Works with any alignment (may be slightly slower)
simd_float va(&a[i], stdx::element_aligned);

// BEST: Align your own buffers
alignas(64) float my_buffer[1024];
simd_float va(&my_buffer[i], stdx::vector_aligned);  // Safe and fast

Floating-Point Precision

Problem Symptom Solution
Expecting bit-exact results Tests fail despite correct code Use epsilon comparison, not ==
Order-dependent precision Different results from scalar SIMD may reorder operations; this is expected
Accumulation error Large errors with many elements Use Kahan summation for high precision
// BAD: Fails due to floating-point precision
assert(simd_result == scalar_result);

// GOOD: Allow small differences
assert(std::abs(simd_result - scalar_result) < 1e-5f);

// BETTER: Relative error for large values
assert(std::abs(simd_result - scalar_result) / std::abs(scalar_result) < 1e-5f);

Compiler Auto-Vectorization Interference

Problem Symptom Solution
Compiler optimizes away scalar baseline Misleading benchmarks Use volatile or benchmark::DoNotOptimize
Compiler auto-vectorizes “scalar” code No visible SIMD benefit Check assembly or disable vectorization with pragmas
Different results with -O3 Confusion about what’s being measured Be explicit about compiler flags
// BAD: Compiler may optimize away the computation
float sum = scalar_sum(data, n);
// (sum is never used, so entire function may be eliminated)

// GOOD: Prevent optimization
volatile float sum = scalar_sum(data, n);
(void)sum;  // Silence unused variable warning

// Or use benchmark library's DoNotOptimize
benchmark::DoNotOptimize(scalar_sum(data, n));

Extensions and Challenges

Beginner Extensions

  • Add more element-wise operations: Implement sub(), mul(), div(), abs(), neg()
  • Add sqrt(): Use SIMD square root instruction
  • Support double precision: Implement the same operations for double
  • Add min/max reductions: Find minimum and maximum elements

Intermediate Extensions

  • Fused Multiply-Add (FMA): Use FMA for better performance and precision
  • Masked operations: Handle conditions within SIMD (e.g., clamp values)
  • Complex numbers: SIMD operations on complex number arrays
  • Polynomial evaluation: Evaluate polynomials using Horner’s method with SIMD
  • Compare with intrinsics: Implement the same functions using AVX intrinsics directly

Advanced Extensions

  • Cache-blocked matrix multiply: Implement blocking for better cache utilization
  • Multithreaded + SIMD: Combine OpenMP/threads with SIMD for maximum throughput
  • Convolution: Implement 1D/2D convolution with SIMD
  • FFT: Implement SIMD-optimized Fast Fourier Transform
  • Compare with BLAS/Eigen: Benchmark against optimized libraries
  • SVE/Scalable vectors: Experiment with ARM SVE’s length-agnostic approach

Expert Extensions

  • Custom SIMD allocator: Write an allocator that guarantees alignment
  • SIMD expression templates: Build a DSL that generates optimal SIMD code
  • Auto-tuning: Automatically select best SIMD width for given hardware
  • Profile-guided optimization: Use profiling to guide vectorization decisions

Resources

Essential Documentation

Resource What It Covers
cppreference: std::experimental::simd Official C++ SIMD library documentation
ModernesCpp: Data-Parallel Types Practical tutorial on std::simd
Intel Intrinsics Guide Complete reference for AVX/SSE intrinsics
Agner Fog’s Optimization Manuals CPU architecture and optimization guides

Books

Book Relevance
“Modern X86 Assembly Language Programming” - Daniel Kusswurm Low-level SIMD, assembly understanding
“Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron CPU architecture, memory hierarchy
“Optimizing Software in C++” - Agner Fog (free PDF) Practical optimization including SIMD
“Hacker’s Delight” - Henry S. Warren Bit manipulation tricks useful for SIMD

Code Examples and Libraries

Resource What to Learn
xtensor Expression templates + SIMD
Eigen Production SIMD in linear algebra library
xsimd Standalone SIMD wrapper library
ISPC Intel’s SIMD compiler

Tools for Analysis

# View SIMD instructions in compiled binary
objdump -d ./benchmark | grep -E "vmov|vadd|vmul|vfma" | head -50

# Check which SIMD extensions your CPU supports
cat /proc/cpuinfo | grep flags | head -1 | tr ' ' '\n' | grep -E "sse|avx"

# Generate assembly from source
g++ -O3 -march=native -S vector_ops.cpp -o vector_ops.s

# Use Compiler Explorer (online)
# https://godbolt.org - paste code and see generated assembly

Self-Assessment Checklist

Before considering this project complete, verify:

Conceptual Understanding

  • I can explain what SIMD is and why it provides speedups
  • I understand the difference between horizontal and vertical operations
  • I can explain why dot product doesn’t achieve full 8x speedup
  • I understand memory alignment and its impact on SIMD performance
  • I can explain when to use native_simd vs fixed_size_simd
  • I understand the trade-offs between auto-vectorization and manual SIMD

Implementation

  • Vector addition, subtraction, multiplication work correctly
  • Dot product produces correct results (within floating-point tolerance)
  • All functions handle remainder elements correctly
  • Code compiles with -O3 -march=native without warnings
  • Functions work with any array size (including 0, 1, and non-multiples of SIMD width)
  • Functions work with unaligned input pointers

Performance

  • Vector addition achieves 3-4x speedup over scalar
  • Dot product achieves 5-7x speedup over scalar
  • Benchmarks are properly designed (warmup, multiple iterations, no optimization)
  • Performance scales appropriately with input size
  • I understand why my speedup is less than theoretical maximum

Testing

  • All functions pass correctness tests
  • Edge cases tested (size 0, 1, simd_width-1, simd_width, simd_width+1)
  • Alignment edge cases tested
  • Results verified against scalar reference implementation

Code Quality

  • Code is well-organized with proper headers
  • SIMD types are abstracted for portability
  • No undefined behavior (buffer overruns, etc.)
  • Comments explain non-obvious SIMD patterns
  • CMake build system works correctly

Submission/Completion Criteria

Minimum Viable Completion

  • Vector add, subtract, multiply implemented with SIMD
  • Sum reduction implemented with SIMD
  • Dot product implemented with SIMD
  • All basic tests pass
  • Benchmark shows measurable speedup (at least 2x for vector ops)

Full Completion

  • All vector operations implemented (add, sub, mul, div, sqrt)
  • All reductions implemented (sum, dot, min, max)
  • Matrix-vector multiply implemented
  • RGB to grayscale implemented
  • Comprehensive test suite passes
  • Benchmark suite with comparison to scalar
  • Documentation of performance characteristics

Excellence (Going Above and Beyond)

  • Matrix-matrix multiply with cache blocking
  • Comparison with AVX intrinsics implementation
  • Comparison with OpenBLAS/Eigen
  • Multi-threaded + SIMD combination
  • Performance analysis with perf/VTune
  • Support for multiple SIMD architectures (SSE, AVX, AVX-512)
  • Expression template system for lazy evaluation

The Core Question You’re Answering

“How can we leverage the CPU’s SIMD capabilities to process data in parallel within a single thread, and what are the practical considerations (alignment, remainders, horizontal operations) that determine real-world performance?”

SIMD represents data-level parallelism that is orthogonal to thread-level parallelism. Understanding SIMD is essential for anyone who wants to write high-performance code, whether for scientific computing, games, machine learning, or media processing. By building this library, you will develop intuition for when SIMD helps, when it doesn’t, and how to write portable SIMD code that runs efficiently across different CPU architectures.


Hints in Layers

Use these hints progressively if you get stuck.

Hint Layer 1: Getting Started

  • Start with just vector addition. Get the main loop working first.
  • Print simd_width to understand your platform’s capabilities.
  • Test with small arrays (size 16-20) where you can manually verify results.
  • Use element_aligned for all operations initially; optimize alignment later.

Hint Layer 2: The Main Loop Pattern

size_t i = 0;

// Main SIMD loop: process simd_width elements per iteration
for (; i + simd_width <= n; i += simd_width) {
    // Load simd_width elements into SIMD register
    // Perform operation
    // Store simd_width elements back
}

// Remainder: handle leftover elements with scalar code
for (; i < n; ++i) {
    // Scalar operation
}

Hint Layer 3: Reduction Pattern

// Use a SIMD register as accumulator
simd_float accumulator(0.0f);  // Initialize all lanes to zero

for (; i + simd_width <= n; i += simd_width) {
    simd_float v(&a[i], stdx::element_aligned);
    accumulator += v;  // Vertical add (fast)
}

// Horizontal reduce: sum all lanes
float result = stdx::reduce(accumulator);  // This is the expensive part

// Add remainder elements
for (; i < n; ++i) {
    result += a[i];
}

Hint Layer 4: Common Debugging Techniques

  • If results are wrong, first test with n = simd_width (no remainder)
  • If remainder is wrong, test with n = simd_width + 1
  • Use std::cout << va[i] to print individual SIMD lanes
  • Compare against a simple scalar implementation element-by-element
  • Check assembly output with objdump -d or Compiler Explorer


This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the project index.