← Back to all projects

LEARN C PERFORMANCE DEEP DIVE

Learn C Performance: From Cache-Oblivious to Cache-Aware Master

Goal: Deeply understand how modern CPUs execute your C code—from cache behavior and memory layout to locality, false sharing, and struct optimization—by building projects that make performance differences tangible and measurable.


Why Learn Low-Level C Performance?

Most C programming is taught with a simplified model of the computer: there’s a CPU and there’s memory. But the reality is a complex hierarchy of caches (L1, L2, L3) that have a monumental impact on performance. An algorithm with more steps can be faster if it’s “cache-friendly.”

After completing these projects, you will:

  • See memory not as a monolith, but as a hierarchy of caches.
  • Intuitively structure your data and loops for maximum spatial and temporal locality.
  • Diagnose and fix subtle performance killers like false sharing in multithreaded code.
  • Design C structs that are smaller and faster by controlling padding and alignment.
  • Move from writing “correct” C code to writing “fast” C code.

Core Concept Analysis

1. The Memory Hierarchy

Your CPU doesn’t just talk to RAM. It’s a pyramid of speed and size.

                  CPU Core
                    |
      +-------------+-------------+
      | (Registers) | (Registers) |  (~1 cycle latency)
      +-------------+-------------+
                    |
+---------------------------------------+
|              L1 Cache                 |  (~4 cycles)   Fastest, Smallest (e.g., 32-64 KB)
| (Instruction + Data)                  |
+---------------------------------------+
                    |
+---------------------------------------+
|              L2 Cache                 |  (~12 cycles)  Slower, Bigger (e.g., 256 KB - 1 MB)
+---------------------------------------+
                    |
+---------------------------------------+
|              L3 Cache                 |  (~40 cycles)  Shared, Largest (e.g., 8-32 MB)
+---------------------------------------+
                    |
+---------------------------------------+
|          Main Memory (RAM)            |  (~100-200 cycles) Slowest, Biggest (e.g., 16-64 GB)
+---------------------------------------+

Accessing data in L1 is orders of magnitude faster than fetching it from RAM. Cache-aware programming is about keeping your working data in the fastest caches for as long as possible.

2. Locality of Reference

Caches work by predicting what data you’ll need next. These predictions rely on two principles:

  • Temporal Locality: If you access a piece of data now, you’re likely to access it again soon.
    • Example: A loop counter variable.
    • Cache Strategy: Keep recently accessed data in the cache.
  • Spatial Locality: If you access a memory location, you’re likely to access nearby memory locations soon.
    • Example: Iterating through an array (arr[0], arr[1], arr[2], …).
    • Cache Strategy: When fetching data from RAM, fetch a whole cache line (typically 64 bytes) at once.

3. Struct Layout and Padding

The compiler isn’t always smart about ordering struct fields. It must add “padding” to ensure fields are properly aligned for the CPU architecture.

// Unoptimized Struct (16 bytes)
struct bad {
    char a;      // 1 byte
    // 7 bytes padding
    double c;    // 8 bytes
    char b;      // 1 byte
    // 7 bytes padding
}; // Total size: 24 bytes!

// Optimized Struct (16 bytes)
struct good {
    double c;    // 8 bytes
    char a;      // 1 byte
    char b;      // 1 byte
    // 6 bytes padding
}; // Total size: 16 bytes!

Poor layout wastes memory, which means fewer of your structs can fit in the cache at once, leading to more cache misses.

4. False Sharing

This is a subtle but deadly multithreading performance bug.

       Core 1 wants to write to X          Core 2 wants to write to Y
                |                                    |
+--------------------------------------------------------------------------+
|                        Single Cache Line (64 bytes)                        |
|   ...   [ int X ]   [ int Y ]   ...   ...   ...   ...   ...   ...   ...    |
+--------------------------------------------------------------------------+

Even if X and Y are logically independent, they live on the same cache line.

  1. Core 1 writes to X. It takes ownership of the cache line.
  2. Core 2 wants to write to Y. It must wait for Core 1, then invalidate Core 1’s cache line, and take ownership itself.
  3. Now Core 1 wants to write to X again. It must wait for Core 2, invalidate its cache line, and take ownership back.

The cores spend all their time fighting over the cache line instead of doing useful work.


Project List

The following 10 projects will guide you from observing cache effects to actively controlling them.


Project 1: Locality Benchmarker

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: CPU Caches / Performance Measurement
  • Software or Tool: GCC/Clang
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Randal E. Bryant & David R. O’Hallaron

What you’ll build: A program that sums a large 2D array of integers in two ways: once iterating row-by-row, and once iterating column-by-column. You will then measure and display the massive performance difference.

Why it teaches C Performance: This is the “Hello, World!” of cache performance. It provides an undeniable, measurable demonstration of spatial locality. You will see that changing the order of loops can make your code 10x faster, forcing you to internalize the concept of a row-major memory layout.

Core challenges you’ll face:

  • Setting up a large 2D array → maps to dynamic memory allocation
  • Writing a high-resolution timer → maps to measuring performance accurately
  • Implementing the two loop orders → maps to understanding row-major vs. column-major access
  • Preventing compiler optimizations → maps to ensuring the compiler doesn’t optimize away your benchmark

Key Concepts:

  • Spatial Locality: “Computer Systems: A Programmer’s Perspective” Ch. 6 - Bryant & O’Hallaron
  • Memory Hierarchy: “Computer Systems: A Programmer’s Perspective” Ch. 6 - Bryant & O’Hallaron
  • Row-Major Layout: C programming tutorials on 2D arrays

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming (loops, arrays, functions)

Real world outcome:

$ ./locality_benchmark
Initializing 10000x10000 matrix...
Row-major access (good spatial locality):
  Sum: 122713344, Time: 0.08 seconds
Column-major access (bad spatial locality):
  Sum: 122713344, Time: 1.12 seconds

Performance Ratio (bad/good): 14.00x

Implementation Hints:

Start by defining a large matrix. A 10000x10000 int matrix is about 400MB.

// How to think about the loops, not the exact code:

// Good spatial locality (row-major access)
for (row = 0; row < N; row++) {
    for (col = 0; col < N; col++) {
        sum += matrix[row][col]; // Accessing memory sequentially
    }
}

// Bad spatial locality (column-major access)
for (col = 0; col < N; col++) {
    for (row = 0; row < N; row++) {
        sum += matrix[row][col]; // Jumping N*sizeof(int) bytes on each access
    }
}

For timing, you can use clock_gettime(CLOCK_MONOTONIC, ...) on Linux/macOS or a similar high-resolution timer. To avoid the compiler optimizing out your sum, make sure you print it or use it in a way that it can’t be discarded (e.g., return it from main).

Learning milestones:

  1. The program works → You can allocate and access 2D arrays.
  2. You measure a significant time difference → You have proven spatial locality exists.
  3. You can explain why it’s faster → You understand cache lines and row-major layout.
  4. You experiment with different matrix sizes → You start to get a feel for the size of your CPU’s caches.

Project 2: Struct Layout Analyzer

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Memory Layout / Data Structures
  • Software or Tool: GCC/Clang
  • Main Book: “The C Programming Language” by Brian W. Kernighan and Dennis M. Ritchie

What you’ll build: A C program that defines several structs with identical members but different orderings. The program will then print the sizeof each struct and the memory offset of each member using the offsetof macro, visually demonstrating the impact of padding.

Why it teaches C Performance: This project demystifies memory padding and alignment. You will see how the compiler is forced to add “wasted” bytes to your structs and learn how to reorder fields to create smaller, more cache-friendly data structures.

Core challenges you’ll face:

  • Defining multiple struct variations → maps to organizing your code for comparison
  • Using sizeof correctly → maps to understanding object size
  • Using the offsetof macro → maps to inspecting internal struct layout
  • Printing a visual representation → maps to making the padding visible and understandable

Key Concepts:

  • Struct Padding and Alignment: “Effective C” Ch. 2 - Robert C. Seacord
  • offsetof macro: C standard library documentation (stddef.h)
  • Data Primitives Sizes: “The C Programming Language” Ch. 2 - K&R

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C knowledge of structs.

Real world outcome: A command-line tool that produces a report like this:

$ ./struct_analyzer

Analyzing 'struct unoptimized':
  Total size: 24 bytes (8 bytes wasted)
  Layout:
    [ 0] char    c   (1 byte)
    [ 1] --- padding --- (7 bytes)
    [ 8] double  d   (8 bytes)
    [16] int     i   (4 bytes)
    [20] --- padding --- (4 bytes)

Analyzing 'struct optimized':
  Total size: 16 bytes (0 bytes wasted)
  Layout:
    [ 0] double  d   (8 bytes)
    [ 8] int     i   (4 bytes)
    [12] char    c   (1 byte)
    [13] --- padding --- (3 bytes)

Implementation Hints:

You’ll need to include <stddef.h> for the offsetof macro.

The rule of thumb for manual optimization is to order struct members from largest to smallest.

  1. Define a struct with a mix of char, short, int, double, and pointers.
  2. Define a second struct with the exact same members but in a different order (e.g., sorted by size descending).
  3. In main, use printf to display the results of sizeof(struct_name) for both.
  4. For each member of each struct, print its name and its offset using printf("%zu", offsetof(struct_name, member_name));.
  5. Your challenge is to then calculate and display the padding between members based on these offsets.

Learning milestones:

  1. You correctly print the size and offsets → You can inspect basic struct properties.
  2. You identify padding between members → You understand that the compiler adds invisible bytes.
  3. You manually reorder a struct to reduce its size → You have learned the fundamental rule of struct layout optimization.
  4. You can predict the size of a struct before compiling → You have internalized alignment rules.

Project 3: Matrix Multiplication Optimizer

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Fortran, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: CPU Caches / Algorithms / Linear Algebra
  • Software or Tool: GCC/Clang, Gnuplot (for visualization)
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Randal E. Bryant & David R. O’Hallaron

What you’ll build: A program that multiplies two large matrices using three different algorithms: a naive (i,j,k) loop, a loop-reordered version, and a cache-blocked (tiled) version. You will benchmark all three to see the dramatic speedup from cache optimization.

Why it teaches C Performance: This is a canonical example of optimizing for the memory hierarchy. The naive implementation thrashes the cache. The tiled version is a masterclass in improving both temporal and spatial locality by breaking a large problem down into smaller chunks that fit neatly into the L1 or L2 cache.

Core challenges you’ll face:

  • Implementing naive matrix multiplication → maps to the baseline algorithm
  • Reordering loops and analyzing the access pattern → maps to a simple, no-cost optimization
  • Implementing cache-blocking (tiling) → maps to advanced cache-aware algorithm design
  • Benchmarking and plotting results → maps to visualizing the performance gains

Key Concepts:

  • Cache Blocking (Tiling): “Computer Systems: A Programmer’s Perspective” Ch. 6.4 - Bryant & O’Hallaron
  • Temporal Locality: Wikipedia article on Locality of Reference
  • Performance Measurement: clock_gettime or similar high-precision timers.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1 (Locality Benchmarker), solid understanding of pointers and 2D arrays.

Real world outcome: A program that outputs benchmark data, which can then be plotted to create a graph like this, clearly showing the superiority of the cache-blocked algorithm.

$ ./matrix_optimizer 1024
Matrix size: 1024x1024
Naive (i,j,k) implementation: 15.2 seconds
Transposed (i,k,j) implementation: 9.8 seconds
Cache-blocked (block size 32): 1.3 seconds

GFLOPS (Blocked): 1.65

Implementation Hints:

The three algorithms access memory differently. Assume C = A * B.

  1. Naive (i,j,k):
    for i=0..N: for j=0..N: for k=0..N: C[i][j] += A[i][k] * B[k][j]
    
    • Access to A is sequential (good).
    • Access to B is strided/column-wise (bad!).
    • Access to C is repeated (good).
  2. Reordered (i,k,j):
    for i=0..N: for k=0..N: for j=0..N: C[i][j] += A[i][k] * B[k][j]
    
    • Access to A is repeated (good).
    • Access to B is sequential (good!).
    • Access to C is sequential (good!). This version is much better!
  3. Cache-blocked: This is the most complex part. You add extra loops that iterate over “tiles” or “blocks” of the matrices. The inner loops then perform the multiplication just for that small block.
    // B is the block size (e.g., 32)
    for i_tile=0..N by B: for j_tile=0..N by B: for k_tile=0..N by B:
      // Now do matrix multiply on the sub-matrices
      for i=i_tile..i_tile+B: for j=j_tile..j_tile+B: for k=k_tile..k_tile+B:
        C[i][j] += A[i][k] * B[k][j]
    

    The key insight is to choose a block size B such that the B x B sub-matrices of A, B, and C can all fit into your L1 cache. This maximizes data reuse.

Learning milestones:

  1. All three algorithms produce correct results → You can implement matrix multiplication.
  2. You measure a performance difference between naive and reordered → You understand how loop order affects cache access.
  3. The blocked version is significantly faster than the other two → You have successfully implemented a cache-aware algorithm.
  4. You can find the optimal block size for your machine → You are now tuning algorithms based on hardware characteristics.

Project 4: False Sharing Detector

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Concurrency / CPU Caches / Multithreading
  • Software or Tool: pthreads (or std::thread in C++), GCC/Clang
  • Main Book: “C++ Concurrency in Action” by Anthony Williams (concepts are transferable)

What you’ll build: A multithreaded program that demonstrates the performance penalty of false sharing. It will have two modes: a “bad” mode where two threads hammer adjacent integer counters, and a “good” mode where the counters are padded to lie on separate cache lines.

Why it teaches C Performance: False sharing is one of the most counter-intuitive performance bugs. Your code is logically correct, but slow. This project makes the invisible contention visible through a stark performance difference, teaching you to think about the physical layout of data in a multithreaded context.

Core challenges you’ll face:

  • Creating and managing threads → maps to pthreads or other threading libraries
  • Designing a data structure to induce false sharing → maps to placing variables next to each other
  • Designing a data structure to prevent false sharing → maps to using padding to separate variables
  • Accurately benchmarking the threaded code → maps to joining threads and using a high-res timer

Key Concepts:

  • False Sharing: Scott Meyers’ “Cpu Caches and Why You Care” talk (YouTube)
  • Cache Coherency Protocols (MESI): Wikipedia provides a good overview.
  • Padding and Alignment: “Effective C” Ch. 2 - Robert C. Seacord

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Basic multithreading concepts (creating threads, joining them).

Real world outcome:

$ ./false_sharing_detector
Running with false sharing (adjacent counters):
  Thread 0 and 1 finished in 4.51 seconds.

Running with padding (counters on separate cache lines):
  Thread 0 and 1 finished in 0.35 seconds.

Performance Ratio (bad/good): 12.88x

Implementation Hints:

Your main data structure will hold the counters.

// Bad version (induces false sharing)
struct counters_bad {
    long long counterA; // Thread 1 updates this
    long long counterB; // Thread 2 updates this
};

// Good version (prevents false sharing)
// C11 provides `_Alignas` to control alignment.
// You can also add manual padding.
struct counters_good {
    _Alignas(64) long long counterA; // Force counterA to start on a new cache line
    _Alignas(64) long long counterB; // Force counterB to start on a new cache line
};

// Or with manual padding:
struct counters_good_manual {
    long long counterA;
    char padding[56]; // 64 - sizeof(long long) = 56
    long long counterB;
};

Your thread function will simply be a tight loop that increments one of these counters a few billion times. Create two threads, have one work on counterA and the other on counterB, and measure the total time until both threads complete. Run the benchmark for both the “good” and “bad” structs.

Learning milestones:

  1. The multithreaded program runs correctly → You can manage threads.
  2. You measure a massive performance drop in the “bad” case → You have successfully demonstrated false sharing.
  3. You can fix the issue with padding/alignment → You know how to solve this specific problem.
  4. You can explain the MESI protocol’s role → You understand why false sharing happens at the hardware level.

Project 5: Image Processing Cache Benchmark

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: CPU Caches / Image Processing / Data Representation
  • Software or Tool: A simple image library (like stb_image.h) or just a raw 2D array.
  • Main Book: “Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level” by Randall Hyde

What you’ll build: A program that applies a simple filter (like a grayscale conversion or a box blur) to an image represented as a 2D array of pixels. You will compare the performance of two data layouts: an “Array of Structs” (AoS) and a “Struct of Arrays” (SoA).

Why it teaches C Performance: This project shows how data layout choice affects cache performance for real-world algorithms. When converting to grayscale, you only need the R, G, and B components. The AoS layout forces the CPU to load the A (alpha) component into the cache as well, polluting it with unused data. The SoA layout is perfect for this task, leading to better performance.

Core challenges you’ll face:

  • Representing pixel data in AoS format → maps to the “intuitive” struct Pixel { r, g, b, a; } layout
  • Representing pixel data in SoA format → maps to the less intuitive but more performant struct Pixels { r[], g[], b[], a[]; } layout
  • Implementing the same algorithm for both layouts → maps to adapting your code to different data structures
  • Benchmarking the two approaches → maps to measuring the performance impact of data layout

Key Concepts:

  • Array of Structs vs. Struct of Arrays (AoS vs. SoA): Mike Acton’s CppCon 2014 talk “Data-Oriented Design and C++”
  • Cache Pollution: When your cache is filled with data you don’t need for the current operation.
  • Data-Oriented Design: A programming paradigm focused on data layouts for performance.

Difficulty: Intermediate Time estimate: 1-2 weeks

  • Prerequisites: Project 1 (Locality Benchmarker), comfort with pointers and dynamic allocation.

Real world outcome:

$ ./image_benchmark image.png
Processing 1920x1080 image...

Benchmarking Grayscale Conversion:
  Array of Structs (AoS) layout: 0.045 seconds
  Struct of Arrays (SoA) layout: 0.015 seconds
  Performance Ratio (SoA is faster by): 3.00x

Benchmarking Full Image Copy (reads all fields):
  Array of Structs (AoS) layout: 0.020 seconds
  Struct of Arrays (SoA) layout: 0.060 seconds
  Performance Ratio (AoS is faster by): 3.00x

The outcome interestingly shows that the “best” layout depends on the access pattern of your algorithm.

Implementation Hints:

AoS Layout:

struct Pixel_AoS {
    unsigned char r, g, b, a;
};
struct Pixel_AoS* image_aos = malloc(width * height * sizeof(struct Pixel_AoS));

A grayscale algorithm on AoS would look something like this:

for (i = 0; i < width * height; i++) {
    unsigned char gray = 0.299 * image_aos[i].r + 0.587 * image_aos[i].g + 0.114 * image_aos[i].b;
    // ... do something with gray ...
}
// Each access to .r, .g, .b pulls in the unused .a, wasting cache space and bandwidth.

SoA Layout:

struct Image_SoA {
    unsigned char* r;
    unsigned char* g;
    unsigned char* b;
    unsigned char* a;
};
// Malloc each array separately

The same algorithm on SoA:

for (i = 0; i < width * height; i++) {
    unsigned char gray = 0.299 * image_soa.r[i] + 0.587 * image_soa.g[i] + 0.114 * image_soa.b[i];
    // ... do something with gray ...
}
// Each access is sequential within its own array. The unused 'a' array is never loaded into the cache.

Learning milestones:

  1. You can represent and process an image in both AoS and SoA formats → You understand the two layouts.
  2. You measure a performance difference for algorithms that use a subset of fields → You see the cost of cache pollution.
  3. You can identify which algorithms benefit from which layout → You are thinking in a data-oriented way.
  4. You start designing your structs based on access patterns, not just logical grouping → You have leveled up as a performance-oriented programmer.

Project 6: Data Structure Cache Benchmark

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Structures / CPU Caches / Memory Access Patterns
  • Software or Tool: GCC/Clang
  • Main Book: “Data Structures and Algorithm Analysis in C” by Mark Allen Weiss

What you’ll build: A program that implements a Doubly Linked List and a Dynamic Array (std::vector equivalent). You will then perform two benchmarks on a large dataset: (1) iterating through and summing all elements, and (2) performing many random insertions and deletions.

Why it teaches C Performance: This project brutally demonstrates why pointer-chasing is a cache performance disaster. You will see that for linear traversal, the array is unbeatable due to its perfect spatial locality. Conversely, you’ll see where the linked list’s O(1) insertion (once the node is found) can shine, teaching you that the “best” data structure depends entirely on the access pattern.

Core challenges you’ll face:

  • Implementing a Doubly Linked List from scratch → maps to manual memory management and pointer manipulation
  • Implementing a Dynamic Array (vector) from scratch → maps to handling growth, realloc
  • Populating both structures with a large dataset → maps to preparing the benchmark
  • Comparing traversal vs. random access performance → maps to understanding algorithmic complexity vs. hardware performance

Key Concepts:

  • Pointer Chasing vs. Sequential Access: “Computer Systems: A Programmer’s Perspective” Ch. 6 - Bryant & O’Hallaron
  • Dynamic Array Implementation: Numerous online tutorials on creating a vector in C.
  • Linked List Implementation: Classic computer science exercise.

Difficulty: Intermediate Time estimate: 1-2 weeks

  • Prerequisites: Solid understanding of malloc/free and pointers.

Real world outcome:

$ ./datastructure_benchmark 1000000
Populating 1,000,000 elements...

Benchmark 1: Full Traversal and Summation
  Dynamic Array: 0.003 seconds
  Linked List:   0.095 seconds
  -> Array is 31.67x faster due to spatial locality.

Benchmark 2: 10,000 Random Insertions/Deletions
  Dynamic Array: 1.25 seconds (due to memory shifting)
  Linked List:   0.05 seconds (pointer manipulation is cheap)
  -> Linked List is 25.00x faster for this access pattern.

Implementation Hints:

Linked List Node:

struct Node {
    int data;
    struct Node* next;
    struct Node* prev;
};

When you malloc each new node, the operating system can place it anywhere in memory. Traversing the list means jumping from one random memory location to another, causing a cache miss for almost every single node. This is called pointer chasing.

Dynamic Array:

struct DynamicArray {
    int* data;
    size_t size;
    size_t capacity;
};

The data pointer points to a single, contiguous block of memory. Traversing this array is the ideal scenario for a CPU’s prefetcher and results in minimal cache misses. However, inserting an element in the middle requires shifting all subsequent elements, which is a very expensive memmove operation.

Learning milestones:

  1. You have working implementations of both data structures → You are comfortable with dynamic memory in C.
  2. You measure the massive traversal performance gap → You viscerally understand the cost of pointer chasing.
  3. You see the linked list win at random insertions → You understand the trade-offs between different data structures.
  4. You start choosing data structures based on hardware performance characteristics, not just theoretical Big-O notation → You have achieved a deeper level of programming maturity.

Project 7: Cache-Line Aware Memory Allocator

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Memory Management / Systems Programming / CPU Caches
  • Software or Tool: GCC/Clang
  • Main Book: “C Interfaces and Implementations: Techniques for Creating Reusable Software” by David R. Hanson

What you’ll build: A simple, special-purpose memory allocator that is aware of cache lines. It will provide a function like alloc_on_new_line(size_t size) which guarantees the returned memory block starts on a new 64-byte cache line boundary.

Why it teaches C Performance: This project forces you to think about memory not just as bytes, but as an aligned, structured resource. You will go beyond using aligned data to providing it. This is a foundational skill for writing high-performance libraries, especially for multithreading, where it can be used to prevent false sharing.

Core challenges you’ll face:

  • Requesting a large block of memory from the OS → maps to using mmap (on Linux/macOS) or VirtualAlloc (on Windows)
  • Managing your own memory pool → maps to writing a simple slab or bump allocator
  • Calculating aligned memory addresses → maps to bitwise operations and pointer arithmetic
  • Returning memory that satisfies alignment requests → maps to the core logic of the allocator

Key Concepts:

  • Data Alignment: “Write Great Code, Volume 1” Ch. 4 - Randall Hyde
  • Pointer Arithmetic: “Understanding and Using C Pointers” by Richard M Reese
  • Memory Pool / Slab Allocation: Wikipedia provides a good high-level overview.

Difficulty: Expert Time estimate: 2-3 weeks

  • Prerequisites: Strong C skills, including pointer arithmetic and bitwise operations.

Real world outcome: A library with a header file and an implementation that you can use in other projects. Its usage would look like:

#include "cache_allocator.h"

// In another project (like the False Sharing Detector)
int main() {
    pool_t* my_pool = pool_create(1024 * 1024); // 1MB pool

    // These two variables are guaranteed to be on different cache lines
    long long* counterA = alloc_on_new_line(my_pool, sizeof(long long));
    long long* counterB = alloc_on_new_line(my_pool, sizeof(long long));

    // ... run false sharing benchmark ...
    // The benchmark will show no false sharing, proving your allocator works.

    pool_destroy(my_pool);
    return 0;
}

Implementation Hints:

The core trick to alignment is to allocate more memory than requested and then return a pointer inside that block that is correctly aligned.

// Pseudo-code for your allocator's logic:
function my_aligned_alloc(size, alignment):
  // 1. alignment must be a power of 2. 64 for a cache line.
  mask = alignment - 1

  // 2. Allocate more memory than needed.
  // We need space for the data, plus space to align the pointer,
  // plus space to store the original pointer so we can free it later.
  total_size = size + mask + sizeof(void*)
  unaligned_ptr = malloc(total_size)

  // 3. Find the aligned address inside the allocated block.
  // The bitwise trick `(ptr + mask) & ~mask` rounds UP to the next alignment boundary.
  aligned_ptr = (unaligned_ptr + sizeof(void*) + mask) & ~mask

  // 4. Store the original, unaligned pointer just before the aligned block.
  // This is how `free` will find it later.
  ptr_storage_location = aligned_ptr - sizeof(void*)
  *ptr_storage_location = unaligned_ptr

  return aligned_ptr

function my_aligned_free(aligned_ptr):
  // 1. Find the location where we stored the original pointer.
  ptr_storage_location = aligned_ptr - sizeof(void*)

  // 2. Retrieve the original pointer.
  unaligned_ptr = *ptr_storage_location

  // 3. Free the original block.
  free(unaligned_ptr)

Your implementation will be more robust, likely using a single large pool from mmap instead of many malloc calls.

Learning milestones:

  1. You can allocate a block and return an aligned pointer from it → You understand pointer arithmetic for alignment.
  2. You can correctly free the memory → You have a working allocation/deallocation cycle.
  3. Your allocator works as a drop-in solution to fix the false sharing project → You have validated your work in a real-world scenario.
  4. You start to think about memory allocation as a performance tool → You’ve unlocked a new dimension of optimization.

Project 8: Profiler for Cache Misses

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Systems Programming / CPU Internals / Linux Kernel
  • Software or Tool: Linux perf_event_open syscall
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A simple command-line profiler that uses the Linux perf_event_open syscall to wrap another command and report on the L1/L2/L3 cache misses it generated.

Why it teaches C Performance: This project is the ultimate ground truth. Instead of inferring cache performance from timing, you will measure it directly by asking the CPU’s own Performance Monitoring Unit (PMU). Building this tool will demystify hardware performance counters and give you a powerful utility to analyze any other program.

Core challenges you’ll face:

  • Understanding the perf_event_open syscall → maps to reading kernel documentation and dealing with a complex API
  • Setting up the perf_event_attr struct → maps to specifying which hardware events to count (e.g., PERF_COUNT_HW_CACHE_MISSES)
  • Forking and executing a child process (execvp) → maps to standard Unix process management
  • Controlling the counters with ioctl → maps to enabling, disabling, and resetting counts around the child process execution
  • Reading and reporting the results → maps to getting the final numbers from the kernel

Key Concepts:

  • perf_event_open syscall: man perf_event_open is the primary source.
  • Performance Monitoring Units (PMU): Intel/AMD developer manuals.
  • fork/exec process model: “Advanced Programming in the UNIX Environment” by Stevens & Rago.

Difficulty: Expert Time estimate: 2-3 weeks

  • Prerequisites: Strong C skills, comfort with Linux syscalls, Project 1 (for a target to profile).

Real world outcome: You will have a tool that works like this:

# Profile the Locality Benchmarker from Project 1
$ ./cache_profiler ./locality_benchmark

> Profiling command: ./locality_benchmark
... (output from the locality_benchmark) ...
> Profiling finished. Hardware Performance Counters:
  L1d Cache Loads:    2,014,589,123
  L1d Cache Misses:   501,887,345 (24.91%)
  L3 Cache Misses:    125,234,876 ( 6.22%)
  Instructions:       10,456,123,789
  CPU Cycles:         25,123,456,901

With this tool, you can definitively prove that the column-major traversal in Project 1 caused more cache misses.

Implementation Hints:

This is a complex syscall. The basic flow is:

  1. In main, parse the command-line arguments (the command to be profiled).
  2. Set up a struct perf_event_attr for each event you want to count (e.g., PERF_TYPE_HW_CACHE, PERF_COUNT_HW_CACHE_LL for Last-Level-Cache misses).
  3. Call perf_event_open for each event. This returns a file descriptor for each counter.
  4. fork() to create a child process.
  5. In the child process:
    • Call execvp to replace the child process with the target program you want to profile.
  6. In the parent process:
    • waitpid for the child to terminate.
    • Before and after the waitpid call, use ioctl(fd, PERF_EVENT_IOC_ENABLE/DISABLE) to control the counters.
    • After the child is done, read the 64-bit count from each file descriptor.
    • Print the results in a formatted report.

Learning milestones:

  1. You can successfully count a single hardware event for a child process → You have mastered the basics of perf_event_open.
  2. You can count and report multiple events simultaneously → You can build a useful profiler.
  3. You use the tool to analyze your previous projects → You are using your own tool to connect theory (e.g., bad spatial locality) with direct hardware measurement (more cache misses).
  4. You are no longer afraid of performance analysis → You have the ultimate tool for understanding how hardware executes your code.

Project 9: AoS to SoA Converter Tool

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Python (for parsing)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Code Generation / Parsing / Data-Oriented Design
  • Software or Tool: A C parsing library (like libclang) is recommended.
  • Main Book: “Language Implementation Patterns” by Terence Parr

What you’ll build: A command-line utility that reads a C header file, finds a specific struct definition (e.g., marked with a __declspec or __attribute__), and automatically generates a new header file containing the equivalent Struct-of-Arrays (SoA) layout and accessor functions.

Why it teaches C Performance: This project takes your understanding of AoS vs. SoA to the next level: automation. Instead of manually converting data structures, you’ll build a tool that does it for you. This forces you to think about data layouts programmatically and is a great introduction to the world of source-to-source compilers and code generation.

Core challenges you’ll face:

  • Parsing C struct definitions → maps to using a library like libclang or writing a simple, brittle parser yourself
  • Transforming the parsed data → maps to converting the list of fields into a set of arrays
  • Generating the new C code → maps to printf-ing valid C code into a new .h file
  • Generating helper functions/macros → maps to creating an API to make the SoA layout usable (e.g., get_pixel_r(soa_img, i))

Key Concepts:

  • Abstract Syntax Trees (AST): The primary output of a parser like libclang.
  • Code Generation: A core concept in compilers.
  • AoS vs. SoA: The core performance concept being automated.

Difficulty: Advanced Time estimate: 2-3 weeks

  • Prerequisites: Project 5 (Image Processing Benchmark), strong C skills.

Real world outcome: A tool that transforms an input file into a new output file.

Input (image.h):

// @soa_transform
typedef struct Pixel {
    float r, g, b, a;
    int obj_id;
} Pixel;

Command:

$ ./soa_converter image.h -o image_soa.h

Output (image_soa.h):

// Generated by soa_converter. Do not edit.
typedef struct Pixel_SoA {
    float* r;
    float* g;
    float* b;
    float* a;
    int* obj_id;
    size_t count;
    size_t capacity;
} Pixel_SoA;

// Accessor functions/macros
#define Pixel_SoA_get_r(p, i) ((p)->r[i])
#define Pixel_SoA_set_r(p, i, val) ((p)->r[i] = (val))
// ... and so on for g, b, a, obj_id

Implementation Hints:

Using libclang is the robust way to do this, but has a steep learning curve. A simpler, more fragile approach is to write a parser that only handles structs and basic types.

  1. Read the input file line by line.
  2. When you find your target annotation (e.g., // @soa_transform), start parsing the struct definition that follows.
  3. For each line inside the struct (e.g., float r;), store the type (float) and the name (r) in a list.
  4. After parsing the struct, start generating the output file.
  5. For the SoA struct definition, iterate through your list of fields and print a pointer member for each (e.g., float* r;).
  6. For the accessor macros, iterate through the list again and generate a _get_ and _set_ macro for each field.

Learning milestones:

  1. You can parse a simple struct definition → You understand the basics of parsing.
  2. You can generate a valid SoA C header → You understand code generation.
  3. The generated code compiles and works → You have a working source-to-source transformation tool.
  4. You start seeing opportunities to automate other code transformations → You are thinking like a compiler writer.

  • File: LEARN_C_PERFORMANCE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: SIMD / Vectorization / Low-Level Optimization
  • Software or Tool: GCC/Clang with Intrinsics (<immintrin.h>)
  • Main Book: “Modern X86 Assembly Language Programming” by Daniel Kusswurm (for concepts)

What you’ll build: An implementation of the strstr function (find a substring in a larger string) that is accelerated using SIMD (Single Instruction, Multiple Data) intrinsics. You will compare its performance against the standard library’s strstr and a naive C implementation.

Why it teaches C Performance: This project dives into data-parallelism at the instruction level. It shows that the CPU can perform the same operation on multiple pieces of data (e.g., 16 or 32 bytes) in a single instruction. This is a powerful optimization technique that requires understanding data alignment and structuring your algorithm to be vectorizable, which are skills closely related to cache-aware design.

Core challenges you’ll face:

  • Understanding SIMD concepts → maps to vector registers (XMM/YMM/ZMM), packed data, and lanes
  • Using compiler intrinsics → maps to writing C code that maps directly to specific assembly instructions (e.g., _mm256_load_si256)
  • Designing a SIMD-friendly algorithm → maps to rethinking a simple search into a parallel operation
  • Handling edge cases → maps to unaligned data and leftover bytes at the end of a string

Key Concepts:

  • SIMD (Single Instruction, Multiple Data): Intel Intrinsics Guide (web search).
  • AVX/SSE Intrinsics: GCC documentation on x86 intrinsics.
  • Data Alignment for Vectorization: Many blog posts cover why aligned loads are faster.

Difficulty: Expert Time estimate: 2-3 weeks

  • Prerequisites: Strong C and pointer skills, basic assembly knowledge is helpful.

Real world outcome:

$ ./simd_strstr
Searching for a 16-byte needle in a 1GB haystack...

  Naive C strstr: 1.2 seconds
  Standard library strstr: 0.3 seconds
  SIMD-accelerated strstr: 0.08 seconds

  -> SIMD is 15x faster than naive C and 3.75x faster than the optimized standard library.

Implementation Hints:

A common SIMD approach for strstr (finding needle in haystack):

  1. Create a vector where every byte is the first character of the needle.
  2. Create another vector where every byte is the last character of the needle.
  3. Iterate through the haystack one vector-width (e.g., 32 bytes) at a time.
  4. In each step: a. Load 32 bytes from the haystack into a vector. b. Compare this vector against the “first character” vector (_mm256_cmpeq_epi8). This gives you a bitmask of potential starts. c. Compare this vector against the “last character” vector. This gives you a bitmask of potential ends (offset by needle_len - 1). d. AND the two bitmasks. If the result is non-zero, you have a potential match.
  5. For each potential match indicated by the final bitmask, perform a full memcmp to confirm it’s not a false positive.
  6. Handle the remaining bytes at the end of the haystack that don’t fill a full vector.

This algorithm quickly disqualifies huge chunks of the haystack that cannot possibly contain the needle.

Learning milestones:

  1. You can load data and perform a basic SIMD comparison → You understand the intrinsics workflow.
  2. Your SIMD search implementation produces correct results → You have a working vectorized algorithm.
  3. Your implementation is significantly faster than the naive C version → You have successfully applied SIMD for optimization.
  4. You can explain the trade-offs and handle unaligned data → You are a competent low-level performance engineer.

Summary

Project Main Language Difficulty Key Concept Taught
Locality Benchmarker C Beginner Spatial Locality
Struct Layout Analyzer C Beginner Struct Padding & Alignment
Matrix Multiplication Optimizer C Advanced Cache Blocking (Tiling)
False Sharing Detector C Advanced False Sharing in Concurrency
Image Processing Cache Benchmark C Intermediate AoS vs. SoA Data Layouts
Data Structure Cache Benchmark C Intermediate Pointer Chasing vs. Contiguous Access
Cache-Line Aware Allocator C Expert Memory Alignment
Profiler for Cache Misses C Expert Hardware Performance Counters
AoS to SoA Converter Tool C Advanced Data-Oriented Code Generation
SIMD-Powered String Search C Expert Instruction-Level Parallelism