Project 6: SIMD Throughput Explorer

Project 6: SIMD Throughput Explorer

Project Overview

Attribute Details
Difficulty Expert
Time Estimate 1 month+
Primary Language C
Alternative Languages C++, Rust, Zig
Knowledge Area SIMD and Vectorization
Tools Required perf, compiler vectorization reports, objdump
Primary Reference โ€œOptimizing Software in C++โ€ by Agner Fog

Learning Objectives

By completing this project, you will be able to:

  1. Explain SIMD fundamentals including vector registers, lane width, and instruction sets
  2. Write portable SIMD code using intrinsics and compiler auto-vectorization
  3. Identify vectorization opportunities in existing code
  4. Debug vectorization failures by interpreting compiler reports
  5. Measure and validate SIMD speedups with appropriate benchmarks
  6. Handle edge cases like unaligned data, remainder elements, and masking

Deep Theoretical Foundation

What Is SIMD?

SIMD (Single Instruction, Multiple Data) processes multiple data elements with a single instruction. Instead of:

Scalar (4 instructions):
add r1, a[0], b[0]
add r2, a[1], b[1]
add r3, a[2], b[2]
add r4, a[3], b[3]

SIMD does:

Vector (1 instruction):
vaddps ymm0, ymm1, ymm2  ; Add 8 floats simultaneously

SIMD Instruction Sets Evolution

x86/x64 History:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Instruction Set โ”‚ Year โ”‚ Register Width โ”‚ Example Use          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ MMX             โ”‚ 1997 โ”‚ 64-bit         โ”‚ 8ร—8-bit integers     โ”‚
โ”‚ SSE             โ”‚ 1999 โ”‚ 128-bit        โ”‚ 4ร—32-bit floats      โ”‚
โ”‚ SSE2            โ”‚ 2001 โ”‚ 128-bit        โ”‚ 2ร—64-bit doubles     โ”‚
โ”‚ SSE3/SSSE3      โ”‚ 2004 โ”‚ 128-bit        โ”‚ Horizontal ops       โ”‚
โ”‚ SSE4.1/4.2      โ”‚ 2008 โ”‚ 128-bit        โ”‚ String ops, blend    โ”‚
โ”‚ AVX             โ”‚ 2011 โ”‚ 256-bit        โ”‚ 8ร—32-bit floats      โ”‚
โ”‚ AVX2            โ”‚ 2013 โ”‚ 256-bit        โ”‚ 8ร—32-bit integers    โ”‚
โ”‚ AVX-512         โ”‚ 2017 โ”‚ 512-bit        โ”‚ 16ร—32-bit floats     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ARM NEON (mobile/embedded):

  • 128-bit vectors
  • Widely available on ARM Cortex-A

ARM SVE/SVE2 (servers):

  • Scalable vector length (128-2048 bits)
  • Runtime-determined width

Data Parallelism Requirements

For SIMD to work efficiently, your computation needs:

1. Independent Operations Each lane must be independentโ€”no lane reading anotherโ€™s result:

// Good: Each iteration independent
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

// Bad: Loop-carried dependency
for (int i = 1; i < n; i++) {
    a[i] = a[i-1] + 1;  // Depends on previous iteration
}

2. Contiguous Memory Access Vector loads/stores work on consecutive elements:

// Good: Sequential access
for (int i = 0; i < n; i++) {
    sum += array[i];
}

// Bad: Strided access (gather required)
for (int i = 0; i < n; i += 8) {
    sum += array[i];  // Needs expensive gather
}

3. Aligned Data Best performance with aligned memory:

// Aligned allocation for 32-byte (AVX) alignment
float *data = aligned_alloc(32, n * sizeof(float));

// Or compiler directive
float data[1024] __attribute__((aligned(32)));

Auto-Vectorization vs Intrinsics

Auto-Vectorization: Let the compiler do it

// Write normal loop, compiler vectorizes
for (int i = 0; i < n; i++) {
    c[i] = a[i] * b[i] + d[i];
}
// Compile with: gcc -O3 -march=native -ftree-vectorize

Intrinsics: Explicit vector operations

#include <immintrin.h>

// AVX2 explicit vectorization
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vd = _mm256_load_ps(&d[i]);
    __m256 vc = _mm256_fmadd_ps(va, vb, vd);  // FMA: a*b+d
    _mm256_store_ps(&c[i], vc);
}

When to Use Which:

  • Auto-vectorization: Start here, easier to maintain
  • Intrinsics: When compiler fails or for critical inner loops
  • Assembly: Rarely needed, compiler usually as good or better

Common Vectorization Blockers

1. Function Calls in Loop

for (int i = 0; i < n; i++) {
    result[i] = expensive_function(data[i]);  // Can't vectorize
}
// Solution: Inline the function or vectorize it

2. Pointer Aliasing

void add(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // Does c overlap a or b?
    }
}
// Solution: Use restrict keyword
void add(float * restrict a, float * restrict b, float * restrict c, int n)

3. Non-Contiguous Access

struct Point { float x, y, z; };
Point points[N];

for (int i = 0; i < N; i++) {
    points[i].x *= 2;  // x values are 12 bytes apart
}
// Solution: Structure of Arrays (SoA) instead of Array of Structures (AoS)
float x[N], y[N], z[N];  // Now x values are contiguous

4. Complex Control Flow

for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {
        result[i] = process_a(data[i]);
    } else {
        result[i] = process_b(data[i]);
    }
}
// Solution: Masked operations (AVX-512) or compute both, blend

Complete Project Specification

What Youโ€™re Building

An experimental toolkit called simd_lab that:

  1. Benchmarks scalar vs vectorized code for common operations
  2. Measures vectorization efficiency (achieved vs theoretical speedup)
  3. Tests alignment impact on SIMD performance
  4. Compares auto-vectorization vs intrinsics
  5. Generates vectorization diagnostics explaining success/failure

Functional Requirements

simd_lab benchmark --operation <name> --size <n> --iterations <n>
simd_lab compare --scalar --auto-vec --intrinsic --size <n>
simd_lab align-test --aligned --unaligned --size <n>
simd_lab diagnose --source <file.c> --function <name>
simd_lab report --output <report.md>

Operations to Implement

  1. vector_add: c[i] = a[i] + b[i]
  2. vector_mul: c[i] = a[i] * b[i]
  3. vector_fma: c[i] = a[i] * b[i] + d[i] (fused multiply-add)
  4. sum_reduction: sum = ฮฃ a[i]
  5. dot_product: dot = ฮฃ a[i] * b[i]
  6. saxpy: y[i] = a * x[i] + y[i]
  7. normalize: Euclidean norm and scaling

Example Output

SIMD Throughput Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Array size: 10,000,000 elements (float32)
CPU: Intel Core i7-12700K (AVX2)
Vector width: 256-bit (8 floats/vector)

Operation: vector_fma (c = a * b + d)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Version          Time      GFLOPs    Speedup    Efficiency
scalar           45.2 ms   0.44      1.0x       12.5%
auto-vec         6.8 ms    2.94      6.6x       83%
intrinsics       5.9 ms    3.39      7.7x       96%
theoretical      5.6 ms    3.57      8.0x       100%

Alignment Impact:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
aligned (32-byte)    5.9 ms    1.00x baseline
unaligned            7.2 ms    0.82x (18% slower)

Vectorization Analysis:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โœ“ Auto-vectorization succeeded for vector_fma
โœ“ Loop unrolled 8x (matching vector width)
โœ“ FMA instruction used (_mm256_fmadd_ps)
โš  Remainder loop uses scalar (last 4 elements)

Recommendations:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ€ข Align data to 32 bytes for AVX2
โ€ข Pad array length to multiple of 8
โ€ข Consider AVX-512 for 2x theoretical improvement

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    CLI Interface                             โ”‚
โ”‚   Operation selection, size configuration                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Scalar      โ”‚  โ”‚ Auto-Vec    โ”‚  โ”‚ Intrinsic   โ”‚
   โ”‚ Kernels     โ”‚  โ”‚ Kernels     โ”‚  โ”‚ Kernels     โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                โ”‚                โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Benchmark Harness         โ”‚
           โ”‚  Warmup โ†’ Measure โ†’ Validate   โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Timer       โ”‚  โ”‚ Counter     โ”‚  โ”‚ Validator   โ”‚
   โ”‚ Engine      โ”‚  โ”‚ Collector   โ”‚  โ”‚ (Correctness)โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Data Structures

// SIMD kernel function signature
typedef void (*simd_kernel_t)(const float *a, const float *b,
                               float *c, size_t n);

typedef struct {
    const char *name;
    simd_kernel_t scalar_fn;
    simd_kernel_t autovec_fn;
    simd_kernel_t intrinsic_fn;
    size_t flops_per_element;  // For GFLOP calculation
    const char *description;
} operation_t;

// Benchmark result
typedef struct {
    double wall_time_ms;
    double gflops;
    double speedup_vs_scalar;
    double efficiency;  // % of theoretical peak
    int vectorized;     // Did compiler vectorize?
    int vector_width;   // Detected vector width
} simd_result_t;

// Memory allocation with alignment tracking
typedef struct {
    void *ptr;
    size_t size;
    size_t alignment;
    int is_aligned;
} aligned_buffer_t;

Kernel Implementations

// ===== VECTOR FMA: c = a * b + d =====

// Scalar version
void fma_scalar(const float *a, const float *b, const float *d,
                float *c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] * b[i] + d[i];
    }
}

// Auto-vectorization friendly (restrict pointers, simple loop)
void fma_autovec(const float * restrict a, const float * restrict b,
                 const float * restrict d, float * restrict c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] * b[i] + d[i];
    }
}

// AVX2 intrinsics
void fma_avx2(const float *a, const float *b, const float *d,
              float *c, size_t n) {
    size_t i = 0;
    // Main vectorized loop (8 floats per iteration)
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vd = _mm256_load_ps(&d[i]);
        __m256 vc = _mm256_fmadd_ps(va, vb, vd);
        _mm256_store_ps(&c[i], vc);
    }
    // Remainder (scalar cleanup)
    for (; i < n; i++) {
        c[i] = a[i] * b[i] + d[i];
    }
}

// ===== SUM REDUCTION =====

// Scalar
float sum_scalar(const float *a, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; i++) {
        sum += a[i];
    }
    return sum;
}

// AVX2 with horizontal reduction
float sum_avx2(const float *a, size_t n) {
    __m256 vsum = _mm256_setzero_ps();
    size_t i = 0;

    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        vsum = _mm256_add_ps(vsum, va);
    }

    // Horizontal sum of 8 floats
    __m128 hi = _mm256_extractf128_ps(vsum, 1);
    __m128 lo = _mm256_castps256_ps128(vsum);
    __m128 sum128 = _mm_add_ps(hi, lo);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);

    float sum = _mm_cvtss_f32(sum128);

    // Remainder
    for (; i < n; i++) {
        sum += a[i];
    }
    return sum;
}

Phased Implementation Guide

Phase 1: Scalar and Auto-Vec Comparison (Week 1)

Goal: Establish baseline and measure auto-vectorization.

Steps:

  1. Implement scalar kernels for all operations
  2. Compile with vectorization flags: -O3 -march=native -ftree-vectorize
  3. Use compiler reports to verify vectorization:
    gcc -O3 -march=native -fopt-info-vec-all -c kernels.c 2>&1 | grep vectorized
    
  4. Benchmark scalar vs auto-vec
  5. Calculate theoretical speedup (8x for AVX2 floats)

Validation: Auto-vec should achieve 4-7x speedup on simple ops.

Phase 2: Intrinsics Implementation (Week 2)

Goal: Write explicit AVX2 implementations.

Steps:

  1. Learn AVX2 intrinsics naming conventions
  2. Implement intrinsic versions of each kernel
  3. Handle unaligned loads/stores
  4. Implement remainder loop for non-vector-multiple sizes
  5. Verify correctness against scalar version

Validation: Intrinsics should match or beat auto-vec.

Phase 3: Alignment Studies (Week 3)

Goal: Quantify alignment impact.

Steps:

  1. Create aligned and unaligned buffers
  2. Benchmark same kernel with different alignments
  3. Test unaligned load intrinsics vs aligned loads
  4. Measure impact at different sizes

Validation: Aligned should be 10-20% faster for memory-bound ops.

Phase 4: Complex Operations (Week 4)

Goal: Implement challenging patterns.

Steps:

  1. Reduction operations (sum, max, min)
  2. Horizontal operations (dot product)
  3. Conditional operations with masking
  4. Structure-of-arrays transformations

Validation: Reductions should still show meaningful speedup.

Phase 5: Diagnostics and Reporting (Week 4+)

Goal: Create analysis tools.

Steps:

  1. Parse compiler vectorization reports
  2. Generate human-readable diagnostics
  3. Create comparison reports
  4. Add recommendations based on findings

Validation: Report accurately identifies vectorization blockers.


Testing Strategy

Correctness Tests

  1. Bit-exact comparison: Scalar and SIMD produce identical results
  2. Edge cases: Size = 1, size = 7 (remainder), size = 8 (exact)
  3. Alignment variations: Test all 4-byte offset positions
  4. Large sizes: Verify no overflow in loop counters

Performance Tests

  1. Scaling: Performance should scale linearly with size
  2. Warmup stability: Results stable after warmup
  3. Theoretical peak: Compare against calculated maximum

Compiler Validation

  1. Assembly inspection: Verify expected instructions generated
  2. Multiple compilers: Test GCC, Clang, ICC
  3. Optimization levels: Compare -O2 vs -O3

Common Pitfalls and Debugging

Pitfall 1: Auto-Vectorization Silently Fails

Symptom: No speedup despite -O3 -ftree-vectorize.

Diagnosis:

# GCC verbose output
gcc -O3 -march=native -fopt-info-vec-missed -c code.c

# Clang output
clang -O3 -march=native -Rpass-missed=loop-vectorize -c code.c

Common Causes:

  • Pointer aliasing: Add restrict
  • Loop-carried dependency: Restructure algorithm
  • Function calls: Inline or vectorize called function
  • Complex control flow: Simplify or use masking

Pitfall 2: Segfault with Aligned Instructions

Symptom: Crash in _mm256_load_ps.

Cause: Using aligned load on unaligned memory.

Solution:

// Check alignment at runtime
if ((uintptr_t)ptr % 32 == 0) {
    _mm256_load_ps(ptr);   // Aligned
} else {
    _mm256_loadu_ps(ptr);  // Unaligned
}

// Or always use unaligned (slightly slower)
_mm256_loadu_ps(ptr);

Pitfall 3: Incorrect Reduction

Symptom: Sum is wrong, especially for large arrays.

Cause: Floating-point reduction order differs.

Explanation: SIMD reduces in parallel lanes, then combines. Order differs from sequential addition, changing rounding.

Solution: Accept small differences, or use Kahan summation for high precision.

Pitfall 4: Speedup Less Than Expected

Symptom: 3x speedup instead of expected 8x.

Causes:

  1. Memory-bound: Memory bandwidth limits throughput
  2. Overhead: Setup/cleanup dominate for small sizes
  3. Remainder loop: Too many elements in scalar remainder
  4. Dependencies: Latency-bound operations

Debug:

# Check if compute-bound or memory-bound
perf stat -e cycles,instructions,cache-misses ./simd_lab
# High cache-miss rate = memory-bound

Extensions and Challenges

Extension 1: AVX-512 Comparison

If hardware supports, implement AVX-512 versions:

  • 512-bit registers (16 floats)
  • Masking for conditional operations
  • Compare throughput vs AVX2

Extension 2: Multi-Platform SIMD

Create abstraction layer supporting:

  • x86 SSE/AVX/AVX-512
  • ARM NEON
  • Fallback scalar
  • Runtime dispatch based on CPU features

Extension 3: Auto-Tuner

Build auto-tuner that:

  • Tests multiple vector widths
  • Finds optimal unroll factor
  • Determines aligned vs unaligned breakeven
  • Generates optimized kernel

Challenge: SIMD-Unfriendly Algorithm

Take a fundamentally serial algorithm (e.g., linked list traversal) and:

  1. Explain why it resists vectorization
  2. Redesign data structure for vectorization
  3. Implement and measure improvement

Real-World Connections

Where SIMD Matters

  1. Image Processing: Pixel operations on millions of pixels
  2. Audio Processing: Filter banks, FFTs
  3. Machine Learning: Matrix operations in neural networks
  4. Scientific Computing: Physics simulations, linear algebra
  5. Video Encoding/Decoding: DCT, motion estimation
  6. Databases: Columnar data processing, SIMD scans

Industry Libraries

  • Intel MKL: Highly optimized linear algebra
  • OpenBLAS: Open-source BLAS with SIMD
  • FFTW: SIMD-optimized FFT
  • simdjson: SIMD JSON parser (8x faster than scalar)
  • Highway: Googleโ€™s portable SIMD library

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can explain SIMD concepts (lanes, vector width, alignment)
  • You implemented both auto-vec and intrinsic versions
  • You achieved near-theoretical speedup for compute-bound ops
  • You measured and explained alignment impact
  • You can diagnose auto-vectorization failures
  • You handled remainder elements correctly
  • Results are validated for correctness

Resources

Essential Reading

  • โ€œOptimizing Software in C++โ€ by Agner Fog, Chapter 10
  • Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
  • โ€œComputer Architecture: A Quantitative Approachโ€ by Hennessy & Patterson, Chapter 3

Reference Documentation

  • Intel 64 and IA-32 Optimization Reference Manual
  • ARM NEON Intrinsics Reference
  • GCC Auto-Vectorization Guide

Tools

  • Compiler Explorer (godbolt.org): See generated assembly
  • perf stat: Measure instruction throughput
  • Intel VTune: Detailed vectorization analysis
  • likwid: Performance counters including vector ops