Project 6: SIMD Throughput Explorer
Project 6: SIMD Throughput Explorer
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Expert |
| Time Estimate | 1 month+ |
| Primary Language | C |
| Alternative Languages | C++, Rust, Zig |
| Knowledge Area | SIMD and Vectorization |
| Tools Required | perf, compiler vectorization reports, objdump |
| Primary Reference | โOptimizing Software in C++โ by Agner Fog |
Learning Objectives
By completing this project, you will be able to:
- Explain SIMD fundamentals including vector registers, lane width, and instruction sets
- Write portable SIMD code using intrinsics and compiler auto-vectorization
- Identify vectorization opportunities in existing code
- Debug vectorization failures by interpreting compiler reports
- Measure and validate SIMD speedups with appropriate benchmarks
- Handle edge cases like unaligned data, remainder elements, and masking
Deep Theoretical Foundation
What Is SIMD?
SIMD (Single Instruction, Multiple Data) processes multiple data elements with a single instruction. Instead of:
Scalar (4 instructions):
add r1, a[0], b[0]
add r2, a[1], b[1]
add r3, a[2], b[2]
add r4, a[3], b[3]
SIMD does:
Vector (1 instruction):
vaddps ymm0, ymm1, ymm2 ; Add 8 floats simultaneously
SIMD Instruction Sets Evolution
x86/x64 History:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Instruction Set โ Year โ Register Width โ Example Use โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโค
โ MMX โ 1997 โ 64-bit โ 8ร8-bit integers โ
โ SSE โ 1999 โ 128-bit โ 4ร32-bit floats โ
โ SSE2 โ 2001 โ 128-bit โ 2ร64-bit doubles โ
โ SSE3/SSSE3 โ 2004 โ 128-bit โ Horizontal ops โ
โ SSE4.1/4.2 โ 2008 โ 128-bit โ String ops, blend โ
โ AVX โ 2011 โ 256-bit โ 8ร32-bit floats โ
โ AVX2 โ 2013 โ 256-bit โ 8ร32-bit integers โ
โ AVX-512 โ 2017 โ 512-bit โ 16ร32-bit floats โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโ
ARM NEON (mobile/embedded):
- 128-bit vectors
- Widely available on ARM Cortex-A
ARM SVE/SVE2 (servers):
- Scalable vector length (128-2048 bits)
- Runtime-determined width
Data Parallelism Requirements
For SIMD to work efficiently, your computation needs:
1. Independent Operations Each lane must be independentโno lane reading anotherโs result:
// Good: Each iteration independent
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
// Bad: Loop-carried dependency
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + 1; // Depends on previous iteration
}
2. Contiguous Memory Access Vector loads/stores work on consecutive elements:
// Good: Sequential access
for (int i = 0; i < n; i++) {
sum += array[i];
}
// Bad: Strided access (gather required)
for (int i = 0; i < n; i += 8) {
sum += array[i]; // Needs expensive gather
}
3. Aligned Data Best performance with aligned memory:
// Aligned allocation for 32-byte (AVX) alignment
float *data = aligned_alloc(32, n * sizeof(float));
// Or compiler directive
float data[1024] __attribute__((aligned(32)));
Auto-Vectorization vs Intrinsics
Auto-Vectorization: Let the compiler do it
// Write normal loop, compiler vectorizes
for (int i = 0; i < n; i++) {
c[i] = a[i] * b[i] + d[i];
}
// Compile with: gcc -O3 -march=native -ftree-vectorize
Intrinsics: Explicit vector operations
#include <immintrin.h>
// AVX2 explicit vectorization
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vd = _mm256_load_ps(&d[i]);
__m256 vc = _mm256_fmadd_ps(va, vb, vd); // FMA: a*b+d
_mm256_store_ps(&c[i], vc);
}
When to Use Which:
- Auto-vectorization: Start here, easier to maintain
- Intrinsics: When compiler fails or for critical inner loops
- Assembly: Rarely needed, compiler usually as good or better
Common Vectorization Blockers
1. Function Calls in Loop
for (int i = 0; i < n; i++) {
result[i] = expensive_function(data[i]); // Can't vectorize
}
// Solution: Inline the function or vectorize it
2. Pointer Aliasing
void add(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // Does c overlap a or b?
}
}
// Solution: Use restrict keyword
void add(float * restrict a, float * restrict b, float * restrict c, int n)
3. Non-Contiguous Access
struct Point { float x, y, z; };
Point points[N];
for (int i = 0; i < N; i++) {
points[i].x *= 2; // x values are 12 bytes apart
}
// Solution: Structure of Arrays (SoA) instead of Array of Structures (AoS)
float x[N], y[N], z[N]; // Now x values are contiguous
4. Complex Control Flow
for (int i = 0; i < n; i++) {
if (data[i] > threshold) {
result[i] = process_a(data[i]);
} else {
result[i] = process_b(data[i]);
}
}
// Solution: Masked operations (AVX-512) or compute both, blend
Complete Project Specification
What Youโre Building
An experimental toolkit called simd_lab that:
- Benchmarks scalar vs vectorized code for common operations
- Measures vectorization efficiency (achieved vs theoretical speedup)
- Tests alignment impact on SIMD performance
- Compares auto-vectorization vs intrinsics
- Generates vectorization diagnostics explaining success/failure
Functional Requirements
simd_lab benchmark --operation <name> --size <n> --iterations <n>
simd_lab compare --scalar --auto-vec --intrinsic --size <n>
simd_lab align-test --aligned --unaligned --size <n>
simd_lab diagnose --source <file.c> --function <name>
simd_lab report --output <report.md>
Operations to Implement
- vector_add: c[i] = a[i] + b[i]
- vector_mul: c[i] = a[i] * b[i]
- vector_fma: c[i] = a[i] * b[i] + d[i] (fused multiply-add)
- sum_reduction: sum = ฮฃ a[i]
- dot_product: dot = ฮฃ a[i] * b[i]
- saxpy: y[i] = a * x[i] + y[i]
- normalize: Euclidean norm and scaling
Example Output
SIMD Throughput Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Array size: 10,000,000 elements (float32)
CPU: Intel Core i7-12700K (AVX2)
Vector width: 256-bit (8 floats/vector)
Operation: vector_fma (c = a * b + d)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Version Time GFLOPs Speedup Efficiency
scalar 45.2 ms 0.44 1.0x 12.5%
auto-vec 6.8 ms 2.94 6.6x 83%
intrinsics 5.9 ms 3.39 7.7x 96%
theoretical 5.6 ms 3.57 8.0x 100%
Alignment Impact:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
aligned (32-byte) 5.9 ms 1.00x baseline
unaligned 7.2 ms 0.82x (18% slower)
Vectorization Analysis:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Auto-vectorization succeeded for vector_fma
โ Loop unrolled 8x (matching vector width)
โ FMA instruction used (_mm256_fmadd_ps)
โ Remainder loop uses scalar (last 4 elements)
Recommendations:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Align data to 32 bytes for AVX2
โข Pad array length to multiple of 8
โข Consider AVX-512 for 2x theoretical improvement
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Interface โ
โ Operation selection, size configuration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Scalar โ โ Auto-Vec โ โ Intrinsic โ
โ Kernels โ โ Kernels โ โ Kernels โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Benchmark Harness โ
โ Warmup โ Measure โ Validate โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Timer โ โ Counter โ โ Validator โ
โ Engine โ โ Collector โ โ (Correctness)โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Key Data Structures
// SIMD kernel function signature
typedef void (*simd_kernel_t)(const float *a, const float *b,
float *c, size_t n);
typedef struct {
const char *name;
simd_kernel_t scalar_fn;
simd_kernel_t autovec_fn;
simd_kernel_t intrinsic_fn;
size_t flops_per_element; // For GFLOP calculation
const char *description;
} operation_t;
// Benchmark result
typedef struct {
double wall_time_ms;
double gflops;
double speedup_vs_scalar;
double efficiency; // % of theoretical peak
int vectorized; // Did compiler vectorize?
int vector_width; // Detected vector width
} simd_result_t;
// Memory allocation with alignment tracking
typedef struct {
void *ptr;
size_t size;
size_t alignment;
int is_aligned;
} aligned_buffer_t;
Kernel Implementations
// ===== VECTOR FMA: c = a * b + d =====
// Scalar version
void fma_scalar(const float *a, const float *b, const float *d,
float *c, size_t n) {
for (size_t i = 0; i < n; i++) {
c[i] = a[i] * b[i] + d[i];
}
}
// Auto-vectorization friendly (restrict pointers, simple loop)
void fma_autovec(const float * restrict a, const float * restrict b,
const float * restrict d, float * restrict c, size_t n) {
for (size_t i = 0; i < n; i++) {
c[i] = a[i] * b[i] + d[i];
}
}
// AVX2 intrinsics
void fma_avx2(const float *a, const float *b, const float *d,
float *c, size_t n) {
size_t i = 0;
// Main vectorized loop (8 floats per iteration)
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vd = _mm256_load_ps(&d[i]);
__m256 vc = _mm256_fmadd_ps(va, vb, vd);
_mm256_store_ps(&c[i], vc);
}
// Remainder (scalar cleanup)
for (; i < n; i++) {
c[i] = a[i] * b[i] + d[i];
}
}
// ===== SUM REDUCTION =====
// Scalar
float sum_scalar(const float *a, size_t n) {
float sum = 0.0f;
for (size_t i = 0; i < n; i++) {
sum += a[i];
}
return sum;
}
// AVX2 with horizontal reduction
float sum_avx2(const float *a, size_t n) {
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
vsum = _mm256_add_ps(vsum, va);
}
// Horizontal sum of 8 floats
__m128 hi = _mm256_extractf128_ps(vsum, 1);
__m128 lo = _mm256_castps256_ps128(vsum);
__m128 sum128 = _mm_add_ps(hi, lo);
sum128 = _mm_hadd_ps(sum128, sum128);
sum128 = _mm_hadd_ps(sum128, sum128);
float sum = _mm_cvtss_f32(sum128);
// Remainder
for (; i < n; i++) {
sum += a[i];
}
return sum;
}
Phased Implementation Guide
Phase 1: Scalar and Auto-Vec Comparison (Week 1)
Goal: Establish baseline and measure auto-vectorization.
Steps:
- Implement scalar kernels for all operations
- Compile with vectorization flags:
-O3 -march=native -ftree-vectorize - Use compiler reports to verify vectorization:
gcc -O3 -march=native -fopt-info-vec-all -c kernels.c 2>&1 | grep vectorized - Benchmark scalar vs auto-vec
- Calculate theoretical speedup (8x for AVX2 floats)
Validation: Auto-vec should achieve 4-7x speedup on simple ops.
Phase 2: Intrinsics Implementation (Week 2)
Goal: Write explicit AVX2 implementations.
Steps:
- Learn AVX2 intrinsics naming conventions
- Implement intrinsic versions of each kernel
- Handle unaligned loads/stores
- Implement remainder loop for non-vector-multiple sizes
- Verify correctness against scalar version
Validation: Intrinsics should match or beat auto-vec.
Phase 3: Alignment Studies (Week 3)
Goal: Quantify alignment impact.
Steps:
- Create aligned and unaligned buffers
- Benchmark same kernel with different alignments
- Test unaligned load intrinsics vs aligned loads
- Measure impact at different sizes
Validation: Aligned should be 10-20% faster for memory-bound ops.
Phase 4: Complex Operations (Week 4)
Goal: Implement challenging patterns.
Steps:
- Reduction operations (sum, max, min)
- Horizontal operations (dot product)
- Conditional operations with masking
- Structure-of-arrays transformations
Validation: Reductions should still show meaningful speedup.
Phase 5: Diagnostics and Reporting (Week 4+)
Goal: Create analysis tools.
Steps:
- Parse compiler vectorization reports
- Generate human-readable diagnostics
- Create comparison reports
- Add recommendations based on findings
Validation: Report accurately identifies vectorization blockers.
Testing Strategy
Correctness Tests
- Bit-exact comparison: Scalar and SIMD produce identical results
- Edge cases: Size = 1, size = 7 (remainder), size = 8 (exact)
- Alignment variations: Test all 4-byte offset positions
- Large sizes: Verify no overflow in loop counters
Performance Tests
- Scaling: Performance should scale linearly with size
- Warmup stability: Results stable after warmup
- Theoretical peak: Compare against calculated maximum
Compiler Validation
- Assembly inspection: Verify expected instructions generated
- Multiple compilers: Test GCC, Clang, ICC
- Optimization levels: Compare -O2 vs -O3
Common Pitfalls and Debugging
Pitfall 1: Auto-Vectorization Silently Fails
Symptom: No speedup despite -O3 -ftree-vectorize.
Diagnosis:
# GCC verbose output
gcc -O3 -march=native -fopt-info-vec-missed -c code.c
# Clang output
clang -O3 -march=native -Rpass-missed=loop-vectorize -c code.c
Common Causes:
- Pointer aliasing: Add
restrict - Loop-carried dependency: Restructure algorithm
- Function calls: Inline or vectorize called function
- Complex control flow: Simplify or use masking
Pitfall 2: Segfault with Aligned Instructions
Symptom: Crash in _mm256_load_ps.
Cause: Using aligned load on unaligned memory.
Solution:
// Check alignment at runtime
if ((uintptr_t)ptr % 32 == 0) {
_mm256_load_ps(ptr); // Aligned
} else {
_mm256_loadu_ps(ptr); // Unaligned
}
// Or always use unaligned (slightly slower)
_mm256_loadu_ps(ptr);
Pitfall 3: Incorrect Reduction
Symptom: Sum is wrong, especially for large arrays.
Cause: Floating-point reduction order differs.
Explanation: SIMD reduces in parallel lanes, then combines. Order differs from sequential addition, changing rounding.
Solution: Accept small differences, or use Kahan summation for high precision.
Pitfall 4: Speedup Less Than Expected
Symptom: 3x speedup instead of expected 8x.
Causes:
- Memory-bound: Memory bandwidth limits throughput
- Overhead: Setup/cleanup dominate for small sizes
- Remainder loop: Too many elements in scalar remainder
- Dependencies: Latency-bound operations
Debug:
# Check if compute-bound or memory-bound
perf stat -e cycles,instructions,cache-misses ./simd_lab
# High cache-miss rate = memory-bound
Extensions and Challenges
Extension 1: AVX-512 Comparison
If hardware supports, implement AVX-512 versions:
- 512-bit registers (16 floats)
- Masking for conditional operations
- Compare throughput vs AVX2
Extension 2: Multi-Platform SIMD
Create abstraction layer supporting:
- x86 SSE/AVX/AVX-512
- ARM NEON
- Fallback scalar
- Runtime dispatch based on CPU features
Extension 3: Auto-Tuner
Build auto-tuner that:
- Tests multiple vector widths
- Finds optimal unroll factor
- Determines aligned vs unaligned breakeven
- Generates optimized kernel
Challenge: SIMD-Unfriendly Algorithm
Take a fundamentally serial algorithm (e.g., linked list traversal) and:
- Explain why it resists vectorization
- Redesign data structure for vectorization
- Implement and measure improvement
Real-World Connections
Where SIMD Matters
- Image Processing: Pixel operations on millions of pixels
- Audio Processing: Filter banks, FFTs
- Machine Learning: Matrix operations in neural networks
- Scientific Computing: Physics simulations, linear algebra
- Video Encoding/Decoding: DCT, motion estimation
- Databases: Columnar data processing, SIMD scans
Industry Libraries
- Intel MKL: Highly optimized linear algebra
- OpenBLAS: Open-source BLAS with SIMD
- FFTW: SIMD-optimized FFT
- simdjson: SIMD JSON parser (8x faster than scalar)
- Highway: Googleโs portable SIMD library
Self-Assessment Checklist
Before considering this project complete, verify:
- You can explain SIMD concepts (lanes, vector width, alignment)
- You implemented both auto-vec and intrinsic versions
- You achieved near-theoretical speedup for compute-bound ops
- You measured and explained alignment impact
- You can diagnose auto-vectorization failures
- You handled remainder elements correctly
- Results are validated for correctness
Resources
Essential Reading
- โOptimizing Software in C++โ by Agner Fog, Chapter 10
- Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
- โComputer Architecture: A Quantitative Approachโ by Hennessy & Patterson, Chapter 3
Reference Documentation
- Intel 64 and IA-32 Optimization Reference Manual
- ARM NEON Intrinsics Reference
- GCC Auto-Vectorization Guide
Tools
- Compiler Explorer (godbolt.org): See generated assembly
- perf stat: Measure instruction throughput
- Intel VTune: Detailed vectorization analysis
- likwid: Performance counters including vector ops