Project 6: SIMD Throughput Explorer

Project Overview

Attribute	Details
Difficulty	Expert
Time Estimate	1 month+
Primary Language	C
Alternative Languages	C++, Rust, Zig
Knowledge Area	SIMD and Vectorization
Tools Required	perf, compiler vectorization reports, objdump
Primary Reference	“Optimizing Software in C++” by Agner Fog

Learning Objectives

By completing this project, you will be able to:

Explain SIMD fundamentals including vector registers, lane width, and instruction sets
Write portable SIMD code using intrinsics and compiler auto-vectorization
Identify vectorization opportunities in existing code
Debug vectorization failures by interpreting compiler reports
Measure and validate SIMD speedups with appropriate benchmarks
Handle edge cases like unaligned data, remainder elements, and masking

Deep Theoretical Foundation

What Is SIMD?

SIMD (Single Instruction, Multiple Data) processes multiple data elements with a single instruction. Instead of:

Scalar (4 instructions):
add r1, a[0], b[0]
add r2, a[1], b[1]
add r3, a[2], b[2]
add r4, a[3], b[3]

SIMD does:

Vector (1 instruction):
vaddps ymm0, ymm1, ymm2  ; Add 8 floats simultaneously

SIMD Instruction Sets Evolution

x86/x64 History:

┌────────────────────────────────────────────────────────────────┐
│ Instruction Set │ Year │ Register Width │ Example Use          │
├─────────────────┼──────┼────────────────┼──────────────────────┤
│ MMX             │ 1997 │ 64-bit         │ 8×8-bit integers     │
│ SSE             │ 1999 │ 128-bit        │ 4×32-bit floats      │
│ SSE2            │ 2001 │ 128-bit        │ 2×64-bit doubles     │
│ SSE3/SSSE3      │ 2004 │ 128-bit        │ Horizontal ops       │
│ SSE4.1/4.2      │ 2008 │ 128-bit        │ String ops, blend    │
│ AVX             │ 2011 │ 256-bit        │ 8×32-bit floats      │
│ AVX2            │ 2013 │ 256-bit        │ 8×32-bit integers    │
│ AVX-512         │ 2017 │ 512-bit        │ 16×32-bit floats     │
└─────────────────┴──────┴────────────────┴──────────────────────┘

ARM NEON (mobile/embedded):

128-bit vectors
Widely available on ARM Cortex-A

ARM SVE/SVE2 (servers):

Scalable vector length (128-2048 bits)
Runtime-determined width

Data Parallelism Requirements

For SIMD to work efficiently, your computation needs:

1. Independent Operations Each lane must be independent—no lane reading another’s result:

// Good: Each iteration independent
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

// Bad: Loop-carried dependency
for (int i = 1; i < n; i++) {
    a[i] = a[i-1] + 1;  // Depends on previous iteration
}

2. Contiguous Memory Access Vector loads/stores work on consecutive elements:

// Good: Sequential access
for (int i = 0; i < n; i++) {
    sum += array[i];
}

// Bad: Strided access (gather required)
for (int i = 0; i < n; i += 8) {
    sum += array[i];  // Needs expensive gather
}

3. Aligned Data Best performance with aligned memory:

// Aligned allocation for 32-byte (AVX) alignment
float *data = aligned_alloc(32, n * sizeof(float));

// Or compiler directive
float data[1024] __attribute__((aligned(32)));

Auto-Vectorization vs Intrinsics

Auto-Vectorization: Let the compiler do it

// Write normal loop, compiler vectorizes
for (int i = 0; i < n; i++) {
    c[i] = a[i] * b[i] + d[i];
}
// Compile with: gcc -O3 -march=native -ftree-vectorize

Intrinsics: Explicit vector operations

#include <immintrin.h>

// AVX2 explicit vectorization
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vd = _mm256_load_ps(&d[i]);
    __m256 vc = _mm256_fmadd_ps(va, vb, vd);  // FMA: a*b+d
    _mm256_store_ps(&c[i], vc);
}

When to Use Which:

Auto-vectorization: Start here, easier to maintain
Intrinsics: When compiler fails or for critical inner loops
Assembly: Rarely needed, compiler usually as good or better

Common Vectorization Blockers

1. Function Calls in Loop

for (int i = 0; i < n; i++) {
    result[i] = expensive_function(data[i]);  // Can't vectorize
}
// Solution: Inline the function or vectorize it

2. Pointer Aliasing

void add(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // Does c overlap a or b?
    }
}
// Solution: Use restrict keyword
void add(float * restrict a, float * restrict b, float * restrict c, int n)

3. Non-Contiguous Access

struct Point { float x, y, z; };
Point points[N];

for (int i = 0; i < N; i++) {
    points[i].x *= 2;  // x values are 12 bytes apart
}
// Solution: Structure of Arrays (SoA) instead of Array of Structures (AoS)
float x[N], y[N], z[N];  // Now x values are contiguous

4. Complex Control Flow

for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {
        result[i] = process_a(data[i]);
    } else {
        result[i] = process_b(data[i]);
    }
}
// Solution: Masked operations (AVX-512) or compute both, blend

Complete Project Specification

What You’re Building

An experimental toolkit called simd_lab that:

Benchmarks scalar vs vectorized code for common operations
Measures vectorization efficiency (achieved vs theoretical speedup)
Tests alignment impact on SIMD performance
Compares auto-vectorization vs intrinsics
Generates vectorization diagnostics explaining success/failure

Functional Requirements

simd_lab benchmark --operation <name> --size <n> --iterations <n>
simd_lab compare --scalar --auto-vec --intrinsic --size <n>
simd_lab align-test --aligned --unaligned --size <n>
simd_lab diagnose --source <file.c> --function <name>
simd_lab report --output <report.md>

Operations to Implement

vector_add: c[i] = a[i] + b[i]
vector_mul: c[i] = a[i] * b[i]
vector_fma: c[i] = a[i] * b[i] + d[i] (fused multiply-add)
sum_reduction: sum = Σ a[i]
dot_product: dot = Σ a[i] * b[i]
saxpy: y[i] = a * x[i] + y[i]
normalize: Euclidean norm and scaling

Example Output

SIMD Throughput Report
═══════════════════════════════════════════════════
Array size: 10,000,000 elements (float32)
CPU: Intel Core i7-12700K (AVX2)
Vector width: 256-bit (8 floats/vector)

Operation: vector_fma (c = a * b + d)
───────────────────────────────────────────────────
Version          Time      GFLOPs    Speedup    Efficiency
scalar           45.2 ms   0.44      1.0x       12.5%
auto-vec         6.8 ms    2.94      6.6x       83%
intrinsics       5.9 ms    3.39      7.7x       96%
theoretical      5.6 ms    3.57      8.0x       100%

Alignment Impact:
───────────────────────────────────────────────────
aligned (32-byte)    5.9 ms    1.00x baseline
unaligned            7.2 ms    0.82x (18% slower)

Vectorization Analysis:
───────────────────────────────────────────────────
✓ Auto-vectorization succeeded for vector_fma
✓ Loop unrolled 8x (matching vector width)
✓ FMA instruction used (_mm256_fmadd_ps)
⚠ Remainder loop uses scalar (last 4 elements)

Recommendations:
───────────────────────────────────────────────────
• Align data to 32 bytes for AVX2
• Pad array length to multiple of 8
• Consider AVX-512 for 2x theoretical improvement

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────┐
│                    CLI Interface                             │
│   Operation selection, size configuration                    │
└──────────────────────────┬──────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
   │ Scalar      │  │ Auto-Vec    │  │ Intrinsic   │
   │ Kernels     │  │ Kernels     │  │ Kernels     │
   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │      Benchmark Harness         │
           │  Warmup → Measure → Validate   │
           └───────────────┬───────────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
   │ Timer       │  │ Counter     │  │ Validator   │
   │ Engine      │  │ Collector   │  │ (Correctness)│
   └─────────────┘  └─────────────┘  └─────────────┘

Key Data Structures

// SIMD kernel function signature
typedef void (*simd_kernel_t)(const float *a, const float *b,
                               float *c, size_t n);

typedef struct {
    const char *name;
    simd_kernel_t scalar_fn;
    simd_kernel_t autovec_fn;
    simd_kernel_t intrinsic_fn;
    size_t flops_per_element;  // For GFLOP calculation
    const char *description;
} operation_t;

// Benchmark result
typedef struct {
    double wall_time_ms;
    double gflops;
    double speedup_vs_scalar;
    double efficiency;  // % of theoretical peak
    int vectorized;     // Did compiler vectorize?
    int vector_width;   // Detected vector width
} simd_result_t;

// Memory allocation with alignment tracking
typedef struct {
    void *ptr;
    size_t size;
    size_t alignment;
    int is_aligned;
} aligned_buffer_t;

Kernel Implementations

// ===== VECTOR FMA: c = a * b + d =====

// Scalar version
void fma_scalar(const float *a, const float *b, const float *d,
                float *c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] * b[i] + d[i];
    }
}

// Auto-vectorization friendly (restrict pointers, simple loop)
void fma_autovec(const float * restrict a, const float * restrict b,
                 const float * restrict d, float * restrict c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] * b[i] + d[i];
    }
}

// AVX2 intrinsics
void fma_avx2(const float *a, const float *b, const float *d,
              float *c, size_t n) {
    size_t i = 0;
    // Main vectorized loop (8 floats per iteration)
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vd = _mm256_load_ps(&d[i]);
        __m256 vc = _mm256_fmadd_ps(va, vb, vd);
        _mm256_store_ps(&c[i], vc);
    }
    // Remainder (scalar cleanup)
    for (; i < n; i++) {
        c[i] = a[i] * b[i] + d[i];
    }
}

// ===== SUM REDUCTION =====

// Scalar
float sum_scalar(const float *a, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; i++) {
        sum += a[i];
    }
    return sum;
}

// AVX2 with horizontal reduction
float sum_avx2(const float *a, size_t n) {
    __m256 vsum = _mm256_setzero_ps();
    size_t i = 0;

    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        vsum = _mm256_add_ps(vsum, va);
    }

    // Horizontal sum of 8 floats
    __m128 hi = _mm256_extractf128_ps(vsum, 1);
    __m128 lo = _mm256_castps256_ps128(vsum);
    __m128 sum128 = _mm_add_ps(hi, lo);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);

    float sum = _mm_cvtss_f32(sum128);

    // Remainder
    for (; i < n; i++) {
        sum += a[i];
    }
    return sum;
}

Phased Implementation Guide

Phase 1: Scalar and Auto-Vec Comparison (Week 1)

Goal: Establish baseline and measure auto-vectorization.

Steps:

Implement scalar kernels for all operations
Compile with vectorization flags: -O3 -march=native -ftree-vectorize

Use compiler reports to verify vectorization:

gcc -O3 -march=native -fopt-info-vec-all -c kernels.c 2>&1 | grep vectorized

Benchmark scalar vs auto-vec
Calculate theoretical speedup (8x for AVX2 floats)

Validation: Auto-vec should achieve 4-7x speedup on simple ops.

Phase 2: Intrinsics Implementation (Week 2)

Goal: Write explicit AVX2 implementations.

Steps:

Learn AVX2 intrinsics naming conventions
Implement intrinsic versions of each kernel
Handle unaligned loads/stores
Implement remainder loop for non-vector-multiple sizes
Verify correctness against scalar version

Validation: Intrinsics should match or beat auto-vec.

Phase 3: Alignment Studies (Week 3)

Goal: Quantify alignment impact.

Steps:

Create aligned and unaligned buffers
Benchmark same kernel with different alignments
Test unaligned load intrinsics vs aligned loads
Measure impact at different sizes

Validation: Aligned should be 10-20% faster for memory-bound ops.

Phase 4: Complex Operations (Week 4)

Goal: Implement challenging patterns.

Steps:

Reduction operations (sum, max, min)
Horizontal operations (dot product)
Conditional operations with masking
Structure-of-arrays transformations

Validation: Reductions should still show meaningful speedup.

Phase 5: Diagnostics and Reporting (Week 4+)

Goal: Create analysis tools.

Steps:

Parse compiler vectorization reports
Generate human-readable diagnostics
Create comparison reports
Add recommendations based on findings

Validation: Report accurately identifies vectorization blockers.

Testing Strategy

Correctness Tests

Bit-exact comparison: Scalar and SIMD produce identical results
Edge cases: Size = 1, size = 7 (remainder), size = 8 (exact)
Alignment variations: Test all 4-byte offset positions
Large sizes: Verify no overflow in loop counters

Performance Tests

Scaling: Performance should scale linearly with size
Warmup stability: Results stable after warmup
Theoretical peak: Compare against calculated maximum

Compiler Validation

Assembly inspection: Verify expected instructions generated
Multiple compilers: Test GCC, Clang, ICC
Optimization levels: Compare -O2 vs -O3

Common Pitfalls and Debugging

Pitfall 1: Auto-Vectorization Silently Fails

Symptom: No speedup despite -O3 -ftree-vectorize.

Diagnosis:

# GCC verbose output
gcc -O3 -march=native -fopt-info-vec-missed -c code.c

# Clang output
clang -O3 -march=native -Rpass-missed=loop-vectorize -c code.c

Common Causes:

Pointer aliasing: Add restrict
Loop-carried dependency: Restructure algorithm
Function calls: Inline or vectorize called function
Complex control flow: Simplify or use masking

Pitfall 2: Segfault with Aligned Instructions

Symptom: Crash in _mm256_load_ps.

Cause: Using aligned load on unaligned memory.

Solution:

// Check alignment at runtime
if ((uintptr_t)ptr % 32 == 0) {
    _mm256_load_ps(ptr);   // Aligned
} else {
    _mm256_loadu_ps(ptr);  // Unaligned
}

// Or always use unaligned (slightly slower)
_mm256_loadu_ps(ptr);

Pitfall 3: Incorrect Reduction

Symptom: Sum is wrong, especially for large arrays.

Cause: Floating-point reduction order differs.

Explanation: SIMD reduces in parallel lanes, then combines. Order differs from sequential addition, changing rounding.

Solution: Accept small differences, or use Kahan summation for high precision.

Pitfall 4: Speedup Less Than Expected

Symptom: 3x speedup instead of expected 8x.

Causes:

Memory-bound: Memory bandwidth limits throughput
Overhead: Setup/cleanup dominate for small sizes
Remainder loop: Too many elements in scalar remainder
Dependencies: Latency-bound operations

Debug:

# Check if compute-bound or memory-bound
perf stat -e cycles,instructions,cache-misses ./simd_lab
# High cache-miss rate = memory-bound

Extensions and Challenges

Extension 1: AVX-512 Comparison

If hardware supports, implement AVX-512 versions:

512-bit registers (16 floats)
Masking for conditional operations
Compare throughput vs AVX2

Extension 2: Multi-Platform SIMD

Create abstraction layer supporting:

x86 SSE/AVX/AVX-512
ARM NEON
Fallback scalar
Runtime dispatch based on CPU features

Extension 3: Auto-Tuner

Build auto-tuner that:

Tests multiple vector widths
Finds optimal unroll factor
Determines aligned vs unaligned breakeven
Generates optimized kernel

Challenge: SIMD-Unfriendly Algorithm

Take a fundamentally serial algorithm (e.g., linked list traversal) and:

Explain why it resists vectorization
Redesign data structure for vectorization
Implement and measure improvement

Real-World Connections

Where SIMD Matters

Image Processing: Pixel operations on millions of pixels
Audio Processing: Filter banks, FFTs
Machine Learning: Matrix operations in neural networks
Scientific Computing: Physics simulations, linear algebra
Video Encoding/Decoding: DCT, motion estimation
Databases: Columnar data processing, SIMD scans

Industry Libraries

Intel MKL: Highly optimized linear algebra
OpenBLAS: Open-source BLAS with SIMD
FFTW: SIMD-optimized FFT
simdjson: SIMD JSON parser (8x faster than scalar)
Highway: Google’s portable SIMD library

Self-Assessment Checklist

Before considering this project complete, verify:

You can explain SIMD concepts (lanes, vector width, alignment)
You implemented both auto-vec and intrinsic versions
You achieved near-theoretical speedup for compute-bound ops
You measured and explained alignment impact
You can diagnose auto-vectorization failures
You handled remainder elements correctly
Results are validated for correctness

Resources

Essential Reading

“Optimizing Software in C++” by Agner Fog, Chapter 10
Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
“Computer Architecture: A Quantitative Approach” by Hennessy & Patterson, Chapter 3

Reference Documentation

Intel 64 and IA-32 Optimization Reference Manual
ARM NEON Intrinsics Reference
GCC Auto-Vectorization Guide

Tools

Compiler Explorer (godbolt.org): See generated assembly
perf stat: Measure instruction throughput
Intel VTune: Detailed vectorization analysis
likwid: Performance counters including vector ops