Project 9: L1 Bandwidth Stressor (Zen 5 focus)

Build a vectorized microbenchmark that saturates L1 data cache bandwidth and measures GB/s.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks
Main Programming Language C (Alternatives: C++ with intrinsics)
Alternative Programming Languages C++ with intrinsics
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 2. The “Micro-SaaS / Pro Tool”
Prerequisites C, SIMD basics, cache line concepts, timing
Key Topics L1 bandwidth, SIMD width, load/store ports, alignment, unrolling

1. Learning Objectives

By completing this project, you will:

  1. Explain how L1 bandwidth is limited by load/store ports.
  2. Write vectorized loops that approach peak L1 bandwidth.
  3. Measure GB/s and compare with theoretical limits.
  4. Understand alignment and unrolling effects on bandwidth.
  5. Produce a bandwidth report for multiple SIMD widths.

2. All Theory Needed (Per-Concept Breakdown)

2.1 L1 Data Cache Bandwidth and Port Limits

Fundamentals

L1 bandwidth is the rate at which the core can read and write data from the L1 cache. It is limited by the number of load and store ports and their width. Even if data is in L1, you cannot exceed the maximum bytes per cycle dictated by these ports. This means that code can be bandwidth-bound even without cache misses. Understanding L1 bandwidth helps you reason about why some vectorized loops plateau below theoretical peak.

Additional fundamentals for L1 Data Cache Bandwidth and Port Limits: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

The L1 data cache sits closest to the core and provides low-latency access (typically 4 cycles). But the bandwidth out of L1 is limited by how many loads and stores can be issued per cycle. For example, a core may sustain two 32-byte loads per cycle and one 32-byte store per cycle, giving a peak of 96 bytes per cycle. Multiply by core frequency to get GB/s. This is a hard limit even if the data is fully cached.

The effective bandwidth depends on instruction mix. A loop that only loads can achieve the load port peak. A loop that loads and stores may be limited by store ports or by the ratio of loads to stores. For streaming kernels like C[i] = A[i] + B[i], you have two loads and one store per iteration, which might saturate both load ports and the store port. If the store port is narrower or fewer, it becomes the bottleneck.

Alignment matters. Aligned loads/stores can be served as a single cache line transaction, while unaligned accesses may require two cache line accesses, doubling traffic. This reduces effective bandwidth. Therefore, you should align arrays to cache line boundaries and use aligned load/store intrinsics where possible. Additionally, unrolling the loop can reduce loop overhead and allow the scheduler to issue more loads/stores per cycle.

The L1 bandwidth you measure is also affected by the backend. If the loop is too small, the overhead of loop control and branch prediction can dominate. If the loop is too large, the uOp cache or instruction cache could become a bottleneck. The ideal microbenchmark uses a tight, unrolled loop with simple arithmetic to keep the backend busy and the frontend out of the way.

The Zen 5 focus means you should be aware of its vector width and load/store configuration. If AVX-512 is supported, the vector width is 64 bytes, which can increase bandwidth demand per instruction. This can quickly saturate the ports. Comparing AVX2 (32-byte) vs AVX-512 (64-byte) shows how wider vectors can approach peak bandwidth faster but may also increase power and reduce frequency. Your report should capture these trade-offs.

Additional deep dive considerations for L1 Data Cache Bandwidth and Port Limits: In real designs, L1 Data Cache Bandwidth and Port Limits is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target L1 Data Cache Bandwidth and Port Limits via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use this concept to design your vector load/store loops in §3.2 and to interpret GB/s results in §3.7.

Definitions & key terms

  • L1 bandwidth -> max bytes per cycle from L1 cache
  • load/store port -> execution port that issues memory operations
  • aligned access -> memory access that starts on cache line boundary
  • bytes per cycle -> bandwidth measure used to compute GB/s

Mental model diagram (ASCII)

L1 -> Load Port 0 -> regs
L1 -> Load Port 1 -> regs
regs -> Store Port -> L1

How it works (step-by-step, with invariants and failure modes)

  1. Load multiple cache lines per cycle using vector loads.
  2. Perform minimal arithmetic to keep dependency-free.
  3. Store results back with aligned stores.
  4. Measure cycles and compute bytes per cycle.

Invariants:

  • Data must be in L1 to measure L1 bandwidth.
  • Loads/stores must be aligned for peak bandwidth.

Failure modes:

  • Misaligned accesses halve effective bandwidth.
  • Insufficient unrolling underutilizes ports.

Minimal concrete example

for (i = 0; i < N; i += 16) {
  __m512 a = _mm512_load_ps(&A[i]);
  __m512 b = _mm512_load_ps(&B[i]);
  __m512 c = _mm512_add_ps(a, b);
  _mm512_store_ps(&C[i], c);
}

Common misconceptions

  • “If data is in L1, it is free” -> L1 bandwidth is limited.
  • “Wider vectors always faster” -> they can saturate ports and reduce frequency.

Check-your-understanding questions

  1. Why can a loop be L1 bandwidth-bound even with no cache misses?
  2. How does alignment affect bandwidth?
  3. Why might AVX-512 reduce frequency?

Check-your-understanding answers

  1. Ports limit bytes per cycle regardless of cache hit rate.
  2. Misalignment causes split loads and extra cache line transfers.
  3. Wider vectors increase power and may trigger frequency throttling.

Real-world applications

  • Vectorized kernels in ML and HPC
  • Image and signal processing pipelines

Where you’ll apply it

References

  • “Computer Systems” by Bryant and O’Hallaron, memory hierarchy chapters
  • AMD Zen optimization guide

Key insights

  • L1 bandwidth is a port-limited resource; alignment and unrolling decide whether you reach it.

Summary

Even L1 has a bandwidth ceiling. Your benchmark measures how close your vector code gets to it.

Homework/Exercises to practice the concept

  1. Compute theoretical L1 bandwidth given port widths and frequency.
  2. Predict the effect of switching from AVX2 to AVX-512.

Solutions to the homework/exercises

  1. Bandwidth = bytes per cycle * frequency.
  2. AVX-512 doubles width but may reduce frequency; net gain is not guaranteed.

2.2 SIMD Width, Unrolling, and Instruction Scheduling

Fundamentals

SIMD (Single Instruction, Multiple Data) processes multiple data elements per instruction. Wider SIMD increases data processed per instruction, but it also increases pressure on load/store ports and instruction scheduling. Unrolling a loop reduces branch overhead and exposes more independent instructions to the scheduler, improving throughput. The key is to balance SIMD width and unrolling so the core is kept busy without overwhelming the front-end or memory system.

Additional fundamentals for SIMD Width, Unrolling, and Instruction Scheduling: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

SIMD instructions operate on vectors (e.g., 128-bit SSE, 256-bit AVX2, 512-bit AVX-512). A wider vector can process more data per instruction, but it also means each instruction transfers more bytes. If the load/store ports cannot sustain the required bytes per cycle, the loop becomes bandwidth-bound. This is why you often see diminishing returns when increasing SIMD width: the port limit is reached, and additional width provides no benefit.

Unrolling helps by allowing the CPU to issue multiple independent loads and stores in the same cycle. If the loop is not unrolled, the scheduler may be limited by the dependency chain within the loop (e.g., loop counter update, branch). By unrolling by 4 or 8, you reduce branch frequency and allow the backend to overlap more operations. However, unrolling increases code size and can stress the uOp cache or instruction cache. It can also increase register pressure, forcing spills that ruin performance. Therefore, the optimal unroll factor is a balance between exposing parallelism and avoiding register spills.

Instruction scheduling is the placement of instructions to avoid port contention and to maximize throughput. For example, interleaving loads, arithmetic, and stores can help keep different ports busy. If you place all loads first and then all stores, you may create bursts that exceed port capacity and cause stalls. A good scheduling strategy alternates load/compute/store sequences and uses independent registers.

In this benchmark, you want to minimize arithmetic so that bandwidth is the limiter. A simple add or XOR is sufficient. You should also avoid data dependencies between iterations by using independent vectors or by unrolling. If you observe that increasing unroll factor improves performance, it indicates the loop was front-end or dependency limited. If performance does not change, you are already bandwidth-limited.

When comparing SIMD widths, you should normalize results in bytes per cycle or GB/s, not cycles per iteration. This lets you compare apples to apples. You should also note frequency changes when using wider SIMD, as some CPUs reduce frequency under heavy AVX-512 usage. This can show up as lower GB/s despite higher per-iteration data processed.

Additional deep dive considerations for SIMD Width, Unrolling, and Instruction Scheduling: In real designs, SIMD Width, Unrolling, and Instruction Scheduling is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target SIMD Width, Unrolling, and Instruction Scheduling via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will apply this concept when designing loop variants in §3.2 and when analyzing AVX2 vs AVX-512 results in §3.7.

Definitions & key terms

  • SIMD width -> number of bits processed per instruction
  • unrolling -> repeating loop body multiple times per iteration
  • register pressure -> demand for registers that can cause spills
  • scheduling -> ordering instructions to maximize throughput

Mental model diagram (ASCII)

Unroll x4:
[Load1][Load2][Load3][Load4]
[Add1 ][Add2 ][Add3 ][Add4 ]
[Store1][Store2][Store3][Store4]

How it works (step-by-step, with invariants and failure modes)

  1. Choose SIMD width (AVX2 or AVX-512).
  2. Unroll loop to increase independent operations.
  3. Interleave loads, compute, and stores.
  4. Measure bytes per cycle and GB/s.

Invariants:

  • Avoid register spills.
  • Maintain alignment for vector loads/stores.

Failure modes:

  • Too much unrolling causes spills and slowdown.
  • Overly wide SIMD reduces frequency.

Minimal concrete example

// Unroll by 2
__m256 a0 = _mm256_load_ps(&A[i]);
__m256 b0 = _mm256_load_ps(&B[i]);
__m256 a1 = _mm256_load_ps(&A[i+8]);
__m256 b1 = _mm256_load_ps(&B[i+8]);

Common misconceptions

  • “More unroll is always better” -> register spills can hurt.
  • “SIMD width equals speedup” -> bandwidth can be the limiter.

Check-your-understanding questions

  1. Why does unrolling help hide loop overhead?
  2. What is the risk of excessive unrolling?
  3. Why normalize results by bytes per cycle?

Check-your-understanding answers

  1. It reduces branch frequency and exposes parallelism.
  2. It increases register pressure and code size.
  3. It allows fair comparison across SIMD widths.

Real-world applications

  • Vectorized math libraries
  • Multimedia and DSP kernels

Where you’ll apply it

References

  • “Agner Fog’s Optimization Manuals”
  • AMD/Intel SIMD optimization guides

Key insights

  • SIMD width and unrolling must be balanced to reach peak bandwidth.

Summary

SIMD and unrolling expose bandwidth limits. Your benchmark shows where the core saturates.

Homework/Exercises to practice the concept

  1. Compare AVX2 vs AVX-512 with unroll 1 and unroll 4.
  2. Record frequency changes during each test.

Solutions to the homework/exercises

  1. AVX-512 may show higher per-iter throughput but similar GB/s if bandwidth-limited.
  2. Frequency drops indicate power throttling under wide vectors.

2.3 Roofline Modeling for L1 Bandwidth

Fundamentals

The roofline model links performance to two ceilings: compute throughput and memory bandwidth. For an L1 bandwidth stressor, the roofline is simple: peak bytes per cycle equals the number of load/store ports times the data width per port. Your measured GB/s should approach this ceiling when the loop is bandwidth-bound and uses aligned accesses. If you measure far below the roofline, the limiting factor is elsewhere: instruction overhead, port contention, misalignment, or write-allocate penalties. A good stressor uses the roofline to set expectations and to explain why results change when you switch from AVX2 to AVX-512 or NEON.

Deep Dive into the concept

A practical roofline estimate starts with the microarchitecture. Suppose the core can issue two 64-byte loads per cycle and one 64-byte store per cycle at 4 GHz. The theoretical peak load bandwidth is 2 * 64 * 4e9 = 512 GB/s, and combined load+store peak is even higher if stores are not write-allocate limited. But those numbers are idealized. In reality, stores can trigger read-for-ownership (RFO) traffic, which doubles the effective bandwidth demand because each store causes a read of the cache line before writing. Non-temporal stores can avoid RFO but may bypass L1, which changes the measurement goal. Your stressor should therefore include both normal stores and non-temporal stores to illustrate this difference.

Alignment and access pattern also affect roofline attainment. If your loads are misaligned and cross cache-line boundaries, the hardware may perform two loads per access, halving effective bandwidth. Similarly, bank conflicts in the L1 cache can serialize accesses. Many L1 caches are banked; if your access pattern consistently hits the same bank, you will not reach peak. The stressor should include patterns with different strides to detect bank conflicts and to show the importance of alignment.

Instruction overhead is the other silent limiter. A bandwidth-bound loop must do almost nothing besides loads and stores. If your loop includes pointer chasing, branches, or even integer arithmetic, you will spend issue slots on those instructions and reduce the available slots for loads/stores. This is why unrolling matters: it reduces branch overhead and increases the fraction of uOps devoted to memory operations. Your stressor should allow configurable unroll factors and report how bandwidth scales with unrolling.

Finally, the roofline is frequency-dependent. If the CPU downclocks when you use wide vectors (as some cores do for AVX-512), the peak GB/s changes even if bytes/cycle is constant. This is why you should report both bytes/cycle and GB/s. Bytes/cycle is the microarchitectural metric; GB/s is the user-facing metric. A clean stressor reports both and explicitly notes the measured core frequency.

How this fits on projects

You will use this in §3.3 Non-Functional Requirements to set performance goals, in §5.10 Phase 2 to interpret results, and in §7.3 to explain surprising bandwidth drops.

Definitions & key terms

  • roofline model -> performance bound defined by compute and bandwidth ceilings
  • RFO (read for ownership) -> implicit read caused by a store to a cache line
  • bytes/cycle -> normalized bandwidth independent of frequency
  • bank conflict -> contention for a cache bank that reduces throughput
  • non-temporal store -> store that bypasses certain cache levels

Mental model diagram (ASCII)

Peak BW = ports * width * freq
Measured BW <= Peak BW
   gap => overhead, alignment, bank conflicts

How it works (step-by-step, with invariants and failure modes)

  1. Compute theoretical peak bytes/cycle from port counts.
  2. Build a minimal load/store loop and measure bytes/cycle.
  3. Sweep alignment, stride, and unroll factors.
  4. Compare measured to peak and identify gaps.
  5. Repeat with different vector widths and note frequency changes.

Invariants:

  • Use aligned buffers for baseline measurements.
  • Measure frequency to avoid misleading GB/s numbers.

Failure modes:

  • RFO traffic halves effective write bandwidth.
  • Misalignment causes split loads and low throughput.

Minimal concrete example

// Pseudocode for bandwidth loop
for (i = 0; i < N; i += 16) {
  v = load256(a + i);
  store256(b + i, v);
}

Common misconceptions

  • “AVX-512 always doubles bandwidth” -> it can downclock and reduce GB/s.
  • “Unrolling is optional” -> without unrolling, branch overhead caps throughput.
  • “All stores are equal” -> RFO and non-temporal stores behave differently.

Check-your-understanding questions

  1. Why should you report bytes/cycle as well as GB/s?
  2. How does RFO affect store bandwidth?
  3. What experiment would reveal L1 bank conflicts?

Check-your-understanding answers

  1. Bytes/cycle normalizes out frequency changes and shows true microarchitectural limits.
  2. It adds an implicit read, doubling traffic for store-heavy loops.
  3. Sweep strides and look for throughput drops at specific patterns.

Real-world applications

  • Tuning SIMD kernels in DSP, graphics, or ML inference
  • Comparing cache subsystems across CPU generations
  • Validating vendor bandwidth claims

Where you’ll apply it

References

  • “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron, Ch. 6
  • “Optimizing Subroutines in Assembly” by Agner Fog (bandwidth and alignment sections)

Key insights

  • The roofline is the target; the gap is where learning happens.

Summary

A bandwidth stressor is meaningful only when compared to a roofline. Calculating the theoretical peak and analyzing the gaps reveals alignment issues, port limits, and frequency effects.

Homework/Exercises to practice the concept

  1. Compute theoretical L1 bandwidth for your CPU and compare to measured.
  2. Test normal stores vs non-temporal stores and report the difference.

Solutions to the homework/exercises

  1. Measured bandwidth should approach, but rarely reach, the theoretical peak.
  2. Non-temporal stores may reduce RFO overhead but can bypass L1.

3. Project Specification

3.1 What You Will Build

A bandwidth microbenchmark that uses vector loads and stores to saturate L1 cache bandwidth. It supports multiple SIMD widths (AVX2, AVX-512) and unroll factors, reporting GB/s and efficiency vs theoretical peak.

3.2 Functional Requirements

  1. Aligned Buffers: Allocate aligned arrays for A, B, C.
  2. Vectorized Kernels: Implement load/add/store loops for each SIMD width.
  3. Timing Harness: Measure cycles and compute GB/s.
  4. Report Generator: Output efficiency vs peak.

3.3 Non-Functional Requirements

  • Performance: Complete each test in under 0.5 seconds.
  • Reliability: Stable results across trials.
  • Usability: CLI flags for width and unroll factor.

3.4 Example Usage / Output

$ ./l1_stress --width 512 --unroll 8
Mode: AVX-512
Measured: 240 GB/s
Peak: 256 GB/s
Efficiency: 93.7%

3.5 Data Formats / Schemas / Protocols

CSV output:

width,unroll,gbps,efficiency
512,8,240,0.937

3.6 Edge Cases

  • Unaligned buffer allocation
  • CPU without AVX-512 support
  • Frequency changes under heavy SIMD

3.7 Real World Outcome

You will generate a bandwidth report that compares AVX2 and AVX-512 and shows how close you are to L1 peak.

3.7.1 How to Run (Copy/Paste)

cc -O3 -march=native -o l1_stress src/l1_stress.c
sudo taskset -c 2 ./l1_stress --width 256 --unroll 8

3.7.2 Golden Path Demo (Deterministic)

  • Use --width 256 --unroll 8 and fixed iteration count.
  • Expect stable GB/s within 5 percent across runs.

3.7.3 If CLI: Exact Terminal Transcript

$ taskset -c 2 ./l1_stress --width 256 --unroll 8
Mode: AVX2
Measured: 180 GB/s
Peak: 192 GB/s
Efficiency: 93.7%

$ echo $?
0

Failure demo (unsupported width):

$ ./l1_stress --width 512
Error: AVX-512 not supported on this CPU

$ echo $?
3

Exit codes:

  • 0 success
  • 2 timing init error
  • 3 invalid argument or unsupported ISA

4. Solution Architecture

4.1 High-Level Design

+----------------+   +------------------+   +-----------------+
| Kernel Builder |-> | Timing Harness   |-> | Report Generator|
+----------------+   +------------------+   +-----------------+

4.2 Key Components

Component Responsibility Key Decisions
Buffer Manager Aligned allocation 64-byte alignment
Kernels Vectorized load/store AVX2/AVX-512 versions
Reporter Compute GB/s normalize by cycles

4.3 Data Structures (No Full Code)

struct Result { int width; int unroll; double gbps; double eff; };

4.4 Algorithm Overview

Key Algorithm: Bandwidth Measurement

  1. Warm up buffers to L1.
  2. Run vectorized kernel for N iterations.
  3. Measure cycles and compute GB/s.

Complexity Analysis:

  • Time: O(N)
  • Space: O(buffer size)

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

l1-stressor/
├── src/
│   ├── l1_stress.c
│   ├── kernels.c
│   └── timing.c
└── README.md

5.3 The Core Question You’re Answering

“Can my code actually feed the vector units at full speed?”

5.4 Concepts You Must Understand First

  1. L1 bandwidth limits
  2. SIMD width and alignment
  3. Unrolling and scheduling

5.5 Questions to Guide Your Design

  1. How will you align buffers to 64 bytes?
  2. How will you choose unroll factor to avoid spills?
  3. How will you compute GB/s and efficiency?

5.6 Thinking Exercise

Estimate theoretical GB/s for a core that can do two 32-byte loads and one 32-byte store per cycle at 4 GHz.

5.7 The Interview Questions They’ll Ask

  1. What is the difference between bandwidth-bound and compute-bound?
  2. Why does alignment matter for vector loads?
  3. How can SIMD width reduce frequency?

5.8 Hints in Layers

Hint 1: Use aligned allocation APIs (posix_memalign).

Hint 2: Unroll to at least 4 to reduce loop overhead.

Hint 3: Normalize results by bytes per cycle.

5.9 Books That Will Help

Topic Book Chapter
Memory hierarchy “Computer Systems” Ch. 6
SIMD optimization “Agner Fog” SIMD sections

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

  • Build aligned buffers and timing harness.
  • Checkpoint: stable timing for simple loop.

Phase 2: Core Functionality (4-6 days)

  • Implement AVX2 and AVX-512 kernels.
  • Checkpoint: AVX2 reaches near-peak bandwidth.

Phase 3: Analysis (2-3 days)

  • Compute efficiency vs theoretical peak.
  • Checkpoint: report with GB/s and efficiency.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Alignment 16 vs 64 64 cache line alignment
Metric cycles/iter vs GB/s GB/s interpretable

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Alignment correctness pointer mod 64 == 0
Integration Tests Kernel runs AVX2 loop
Edge Tests AVX-512 unsupported error path

6.2 Critical Test Cases

  1. Unaligned buffers should reduce GB/s.
  2. AVX2 vs AVX-512 should show different GB/s.
  3. Unroll factor changes should shift performance.

6.3 Test Data

width: 256, 512
unroll: 4, 8

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Misalignment low GB/s align buffers
Spills unexpected slowdown reduce unroll
Frequency changes inconsistent GB/s pin and fix freq

7.2 Debugging Strategies

  • Print alignment and verify with asserts.
  • Use perf to measure load/store uOps.

7.3 Performance Traps

  • Over-unrolling increases code size and may hurt front-end.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a plot script for GB/s vs unroll.

8.2 Intermediate Extensions

  • Compare float vs integer kernels.

8.3 Advanced Extensions

  • Include a read-only bandwidth test.

9. Real-World Connections

9.1 Industry Applications

  • Vectorized ML kernels
  • Graphics and video processing
  • OpenBLAS: bandwidth-sensitive kernels
  • xsimd: SIMD abstraction layer

9.3 Interview Relevance

  • L1 bandwidth and SIMD performance are common low-level topics.

10. Resources

10.1 Essential Reading

  • “Computer Systems” by Bryant and O’Hallaron
  • AMD and Intel SIMD optimization guides

10.2 Video Resources

  • “SIMD and Cache Bandwidth” lecture

10.3 Tools & Documentation

  • perf: measure load/store events
  • immintrin.h: intrinsic definitions

11. Self-Assessment Checklist

11.1 Understanding

  • I can compute theoretical L1 bandwidth.
  • I can explain alignment effects.
  • I can compare SIMD widths fairly.

11.2 Implementation

  • The benchmark reaches near-peak bandwidth.
  • Results are stable across trials.
  • Report includes efficiency.

11.3 Growth

  • I can explain bandwidth vs compute limits in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Measure GB/s for at least one SIMD width.
  • Provide efficiency vs peak.

Full Completion:

  • Compare AVX2 and AVX-512 results.

Excellence (Going Above & Beyond):

  • Add a bandwidth roofline plot and interpret it.