Project 3: Matrix Multiplication Optimizer
A program that multiplies two large matrices using three different algorithms: a naive
(i,j,k)loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C |
| Alternative Languages | C++, Fortran, Rust |
| Difficulty | Level 3: Advanced |
| Time Estimate | 1-2 weeks |
| Knowledge Area | CPU Caches / Algorithms / Linear Algebra |
| Tooling | GCC/Clang, Gnuplot (for visualization) |
| Prerequisites | Project 1 (Locality Benchmarker), solid understanding of pointers and 2D arrays. |
What You Will Build
A program that multiplies two large matrices using three different algorithms: a naive (i,j,k) loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Implementing naive matrix multiplication → maps to the baseline algorithm
- Reordering loops and analyzing the access pattern → maps to a simple, no-cost optimization
- Implementing cache-blocking (tiling) → maps to advanced cache-aware algorithm design
- Benchmarking and plotting results → maps to visualizing the performance gains
Key Concepts
- Cache Blocking (Tiling): “Computer Systems: A Programmer’s Perspective” Ch. 6.4 - Bryant & O’Hallaron
- Temporal Locality: Wikipedia article on Locality of Reference
- Performance Measurement:
clock_gettimeor similar high-precision timers.
Real-World Outcome
$ ./matrix_optimizer 1024
Matrix size: 1024x1024
Naive (i,j,k) implementation: 15.2 seconds
Transposed (i,k,j) implementation: 9.8 seconds
Cache-blocked (block size 32): 1.3 seconds
GFLOPS (Blocked): 1.65
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
LEARN_C_PERFORMANCE_DEEP_DIVE.md - “Computer Systems: A Programmer’s Perspective” by Randal E. Bryant & David R. O’Hallaron