Project 3: Matrix Multiplication Optimizer

A program that multiplies two large matrices using three different algorithms: a naive (i,j,k) loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.

Quick Reference

Attribute	Value
Primary Language	C
Alternative Languages	C++, Fortran, Rust
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Knowledge Area	CPU Caches / Algorithms / Linear Algebra
Tooling	GCC/Clang, Gnuplot (for visualization)
Prerequisites	Project 1 (Locality Benchmarker), solid understanding of pointers and 2D arrays.

What You Will Build

A program that multiplies two large matrices using three different algorithms: a naive (i,j,k) loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

Implementing naive matrix multiplication → maps to the baseline algorithm
Reordering loops and analyzing the access pattern → maps to a simple, no-cost optimization
Implementing cache-blocking (tiling) → maps to advanced cache-aware algorithm design
Benchmarking and plotting results → maps to visualizing the performance gains

Key Concepts

Cache Blocking (Tiling): “Computer Systems: A Programmer’s Perspective” Ch. 6.4 - Bryant & O’Hallaron
Temporal Locality: Wikipedia article on Locality of Reference
Performance Measurement: clock_gettime or similar high-precision timers.

Real-World Outcome

$ ./matrix_optimizer 1024
Matrix size: 1024x1024
Naive (i,j,k) implementation: 15.2 seconds
Transposed (i,k,j) implementation: 9.8 seconds
Cache-blocked (block size 32): 1.3 seconds

GFLOPS (Blocked): 1.65

Implementation Guide

Reproduce the simplest happy-path scenario.
Build the smallest working version of the core feature.
Add input validation and error handling.
Add instrumentation/logging to confirm behavior.
Refactor into clean modules with tests.

Milestones

Milestone 1: Minimal working program that runs end-to-end.
Milestone 2: Correct outputs for typical inputs.
Milestone 3: Robust handling of edge cases.
Milestone 4: Clean structure and documented usage.

Validation Checklist

Output matches the real-world outcome example
Handles invalid inputs safely
Provides clear errors and exit codes
Repeatable results across runs

References

Main guide: LEARN_C_PERFORMANCE_DEEP_DIVE.md
“Computer Systems: A Programmer’s Perspective” by Randal E. Bryant & David R. O’Hallaron