Project 3: Matrix Multiplication Optimizer

A program that multiplies two large matrices using three different algorithms: a naive (i,j,k) loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages C++, Fortran, Rust
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks
Knowledge Area CPU Caches / Algorithms / Linear Algebra
Tooling GCC/Clang, Gnuplot (for visualization)
Prerequisites Project 1 (Locality Benchmarker), solid understanding of pointers and 2D arrays.

What You Will Build

A program that multiplies two large matrices using three different algorithms: a naive (i,j,k) loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Implementing naive matrix multiplication → maps to the baseline algorithm
  • Reordering loops and analyzing the access pattern → maps to a simple, no-cost optimization
  • Implementing cache-blocking (tiling) → maps to advanced cache-aware algorithm design
  • Benchmarking and plotting results → maps to visualizing the performance gains

Key Concepts

  • Cache Blocking (Tiling): “Computer Systems: A Programmer’s Perspective” Ch. 6.4 - Bryant & O’Hallaron
  • Temporal Locality: Wikipedia article on Locality of Reference
  • Performance Measurement: clock_gettime or similar high-precision timers.

Real-World Outcome

$ ./matrix_optimizer 1024
Matrix size: 1024x1024
Naive (i,j,k) implementation: 15.2 seconds
Transposed (i,k,j) implementation: 9.8 seconds
Cache-blocked (block size 32): 1.3 seconds

GFLOPS (Blocked): 1.65

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • “Computer Systems: A Programmer’s Perspective” by Randal E. Bryant & David R. O’Hallaron