Project 4: False Sharing Detector

Build a microbenchmark that creates false sharing and a profiling workflow that detects and fixes it using perf c2c.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: C++, Rust)
Alternative Programming Languages	C++, Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	2. The “Internal Tool”
Prerequisites	Threads, cache basics, Linux perf tooling
Key Topics	cache lines, coherence, false sharing, perf c2c

1. Learning Objectives

By completing this project, you will:

Explain how cache coherence works and why false sharing happens.
Build a workload that reliably triggers false sharing.
Use perf c2c to detect HITM-heavy cache lines.
Apply padding and alignment fixes and quantify improvements.
Interpret the relationship between NUMA distance and coherence traffic.
Produce deterministic benchmark output for regression testing.

2. All Theory Needed (Per-Concept Breakdown)

Fundamentals

Cache coherence ensures that all cores see a consistent view of memory. It operates at cache-line granularity (typically 64 bytes). False sharing occurs when different threads update different variables that happen to reside on the same cache line. Even though the variables are logically independent, the coherence protocol treats the whole line as a single unit, so each write invalidates the other core’s copy. This causes a ping-pong effect that can reduce throughput dramatically. False sharing is especially harmful on NUMA systems because invalidation traffic must cross the interconnect, amplifying latency and consuming bandwidth.

Deep Dive into the Concept

Most modern CPUs use a MESI or MOESI coherence protocol. Each cache line can be in a state such as Modified, Exclusive, Shared, or Invalid. When a core writes to a line, it must obtain exclusive ownership, invalidating other cores’ copies. If two threads on different cores repeatedly write different variables that share the same line, the line continually bounces between cores. Each bounce requires coherence messages across the interconnect and may trigger writebacks.

False sharing is subtle because the program’s data layout determines which variables share a line. Compilers may pack struct fields together; arrays of small structs can place adjacent elements in the same line. In multi-threaded counters, it is common for per-thread counters to end up adjacent in memory, causing massive coherence traffic when all threads update them frequently.

On NUMA systems, the cost of this coherence traffic is higher because the interconnect between sockets has higher latency and lower bandwidth than on-die connections. A false-sharing hotspot can therefore dominate runtime. The symptom is high cache misses, high HITM (Hit Modified) events, and low throughput. You can eliminate false sharing by padding or aligning data so that each thread’s hot data resides on a different cache line. This is why many performance libraries use alignas(64) or padding arrays to avoid line sharing.

Understanding false sharing requires a mental model of cache-line ownership. A line is not “owned” by a variable but by the cache as a whole. So the solution is not to “avoid writes” but to avoid co-locating hot write-heavy variables on the same line. The benchmark you build will demonstrate this effect clearly by showing a dramatic throughput increase after padding.

How this fits on projects

This concept guides how you build the benchmark and interpret perf c2c results. Your fix (padding/alignment) is directly based on this understanding.

Definitions & Key Terms

Cache line -> The smallest unit of data transferred between cache and memory.
Coherence -> Protocol ensuring consistent memory across cores.
False sharing -> Independent variables sharing a cache line, causing invalidations.
HITM -> Hit Modified; a cache access that hits a modified line in another cache.
Ping-pong -> Repeated line invalidations between cores.

Mental Model Diagram (ASCII)

Cache Line (64B): [counter0][counter1][counter2][counter3]
Thread0 writes counter0, Thread1 writes counter1

Writes cause line to bounce between cores:
Core0 M -> Core1 I -> Core1 M -> Core0 I -> ...

How It Works (Step-by-Step)

Two threads write to different variables that share a cache line.
Each write requests exclusive ownership of the line.
The coherence protocol invalidates the other core’s copy.
The line ping-pongs, causing high HITM and low throughput.

Invariants: Coherence operates on cache lines, not variables.

Failure modes: Padding too small, or data structure reordered by compiler.

Minimal Concrete Example

typedef struct {
    volatile uint64_t counter; // hot write
} Counter;

Counter counters[2]; // likely share same cache line

// Fix: pad to 64 bytes
typedef struct {
    volatile uint64_t counter;
    char pad[64 - sizeof(uint64_t)];
} PaddedCounter;

Common Misconceptions

“If variables are different, they don’t interfere” -> They do if they share a line.
“Cache lines are per-core” -> Coherence is shared across cores.
“False sharing is rare” -> It is common in naive multi-threaded code.

Check-Your-Understanding Questions

Why does false sharing reduce throughput if threads touch different variables?
What does HITM indicate in profiling output?
How does padding fix false sharing?

Check-Your-Understanding Answers

Because coherence transfers the entire line, causing invalidations.
It indicates a cache line owned by another core in Modified state.
Padding ensures each thread’s data lives in a separate cache line.

Real-World Applications

High-performance counters in telemetry systems.
Lock-free queues and ring buffers.
Thread-local statistics in databases.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.1, Sec. 7.1.
Also used in: P06-numa-aware-data-structure-library.

References

“The Art of Multiprocessor Programming” (Herlihy & Shavit) – Ch. 3-4
“Computer Architecture” (Hennessy & Patterson) – Ch. 5

Key Insights

False sharing is not a logic bug; it is a layout bug at cache-line granularity.

Summary

False sharing arises when independent hot variables occupy the same cache line. Coherence traffic forces the line to bounce between cores, destroying throughput. Aligning or padding data structures is the simplest fix.

Homework/Exercises to Practice the Concept

Write a two-thread increment loop with adjacent counters and measure throughput.
Add padding to separate the counters and measure the improvement.
Use alignas(64) and verify the memory addresses differ by 64 bytes.

Solutions to the Homework/Exercises

Throughput will be low and HITM counts high.
Throughput will improve significantly after padding.
Addresses should differ by at least 64 bytes, preventing sharing.

Fundamentals

perf c2c (cache-to-cache) profiling captures cache-line sharing events across cores. It reports lines that are frequently transferred between cores, often labeled with HITM (Hit Modified). High HITM rates on a line are a strong signal of false sharing. Using perf c2c record and perf c2c report, you can identify the hot line, map it to the source code, and confirm that a padding fix eliminates the contention. This workflow turns a vague performance problem into a precise, reproducible diagnosis.

Deep Dive into the Concept

Performance counters in modern CPUs track cache events, including cache-to-cache transfers. When one core accesses a line that another core has in Modified state, the request is serviced by the other core’s cache (HITM). This indicates a coherence transfer and usually implies writes are bouncing the line. perf c2c builds on these counters to provide a report of cache-line addresses, the number of HITM events, and the CPUs involved.

Using perf c2c requires hardware support and kernel configuration. You typically run perf c2c record -- <command> to collect events, then perf c2c report to analyze. The report includes columns for remote HITMs, local HITMs, and a list of CPUs sharing the line. If you see a line with a high HITM percentage and multiple CPUs listed, that line is a prime candidate for false sharing. You can then map the address to a symbol or a struct field using addr2line or by printing the address of your data structure in the program.

A subtle but important aspect is that not all HITM events are bad. Some sharing is legitimate, such as a producer-consumer queue. The key is to correlate HITM hotspots with the code’s intent. For this project, you deliberately create false sharing so you can see the signature clearly. Then you apply padding and rerun. A successful fix will dramatically reduce HITM events and increase throughput. This is a powerful demonstration of how performance tooling can guide code changes.

Another detail: NUMA effects can amplify the cost of HITM events, because coherence traffic traverses the interconnect. Therefore, you should run the benchmark with threads pinned to different nodes to make the penalty visible. You can even compare intra-socket vs inter-socket false sharing, which is a valuable real-world lesson.

Finally, perf c2c output is often noisy. You should collect enough samples, repeat runs, and report averages. Your tool can integrate a “profiler helper” script that wraps the perf c2c commands with consistent flags and stores results alongside the benchmark output for comparison.

How this fits on projects

This concept defines the diagnostic workflow: record, report, interpret, fix. It transforms the benchmark into a repeatable performance investigation.

Definitions & Key Terms

perf c2c -> perf mode for cache-to-cache transfer analysis.
HITM -> Hit Modified; access to a line owned by another core.
Hot cache line -> Cache line with unusually high coherence traffic.
Addr2line -> Tool for mapping addresses to source lines.

Mental Model Diagram (ASCII)

perf c2c report (simplified)
Address     HITM%   CPUs
0x7f8c...    92%     0,1,2,3,8,9,10,11

High HITM% = likely false sharing

How It Works (Step-by-Step)

Run the benchmark under perf c2c record.
Generate a report with perf c2c report.
Identify the cache line with the highest HITM.
Map the address to your data structure.
Apply padding and rerun to verify reduction.

Invariants: High HITM indicates shared modified lines.

Failure modes: Missing hardware support, insufficient sampling, symbol resolution issues.

Minimal Concrete Example

perf c2c record -- ./false_sharing_bench --threads=8 --padding=0
perf c2c report --stdio | head -n 20

Common Misconceptions

“perf c2c is always available” -> Some CPUs or kernels lack support.
“Any HITM is bad” -> Some shared lines are legitimate.
“One run is enough” -> Noise requires multiple runs for confidence.

Check-Your-Understanding Questions

What does a high HITM percentage indicate?
Why should you map hot addresses back to source code?
How do you confirm a padding fix worked?

Check-Your-Understanding Answers

A line is frequently transferred between cores in Modified state.
To identify the specific variable or struct field causing sharing.
HITM rates drop and throughput increases after padding.

Real-World Applications

Diagnosing performance regressions in multi-threaded services.
Optimizing counters and metrics systems.

Where You’ll Apply It

In this project: Sec. 3.7, Sec. 7.2.
Also used in: P06-numa-aware-data-structure-library, P09-numa-aware-thread-pool.

References

“The Art of Multiprocessor Programming” – Ch. 3-4
“Computer Architecture” – Ch. 5

Key Insights

perf c2c turns a hidden cache-line problem into a visible, fixable hotspot.

Summary

perf c2c exposes cache-line sharing behavior by reporting HITM events. By correlating the hottest line to your data structures, you can fix false sharing with padding and verify improvements quantitatively.

Homework/Exercises to Practice the Concept

Run the benchmark with padding=0 and collect a c2c report.
Apply padding and compare HITM rates.
Run with threads on different sockets and compare HITM cost.

Solutions to the Homework/Exercises

The report should show a hot line with high HITM and many CPUs.
HITM rates should drop dramatically after padding.
Cross-socket sharing should show higher latency and lower throughput.

3. Project Specification

3.1 What You Will Build

A CLI tool false-sharing-bench that generates false sharing with multi-threaded counters, profiles the run with perf c2c, and demonstrates the fix using cache-line padding.

Included: benchmark, padding toggle, profiling workflow instructions, deterministic output. Excluded: full profiler GUI or kernel modifications.

3.2 Functional Requirements

Launch N threads that repeatedly update per-thread counters.
Provide a --padding flag to control cache-line padding.
Pin threads to specific CPUs or nodes.
Print throughput and timing results.
Provide a --profile mode that emits perf c2c instructions.
Support deterministic seed for any randomized workload.

3.3 Non-Functional Requirements

Performance: Throughput results stable across runs.
Reliability: Works on systems without perf c2c (with warning).
Usability: Clear output and instructions for profiling.

3.4 Example Usage / Output

$ ./false-sharing-bench --threads=8 --padding=0
Throughput: 120 M ops/sec

$ ./false-sharing-bench --threads=8 --padding=64
Throughput: 680 M ops/sec

3.5 Data Formats / Schemas / Protocols

JSON output:

{
  "threads": 8,
  "padding": 64,
  "throughput_mops": 680,
  "duration_ms": 1000
}

3.6 Edge Cases

perf c2c unavailable (report warning).
Padding smaller than cache line.
Thread count exceeds CPU count.

3.7 Real World Outcome

You can demonstrate false sharing and eliminate it, backed by perf c2c evidence and throughput gains.

3.7.1 How to Run (Copy/Paste)

cc -O2 -pthread -o false-sharing-bench src/main.c
./false-sharing-bench --threads=8 --padding=0

3.7.2 Golden Path Demo (Deterministic)

$ ./false-sharing-bench --threads=8 --padding=0 --duration=1000
Throughput: 120 M ops/sec

$ ./false-sharing-bench --threads=8 --padding=64 --duration=1000
Throughput: 680 M ops/sec

3.7.3 If CLI: Exact Terminal Transcript

$ perf c2c record -- ./false-sharing-bench --threads=8 --padding=0
$ perf c2c report --stdio | head -n 5
0x7f8c12345000  HITM: 92%  CPUs: 0,1,2,3,8,9,10,11

3.7.4 Failure Demo (Deterministic)

$ ./false-sharing-bench --threads=999
ERROR: thread count exceeds CPU count (max=16)
EXIT: 1

Exit Codes:

0 success
1 invalid arguments
2 perf unavailable
3 runtime error

4. Solution Architecture

4.1 High-Level Design

+--------------+   +--------------+   +------------------+
| Thread Setup |-->| Counter Loop |-->| Report Renderer  |
+--------------+   +--------------+   +------------------+
        ^                                      |
        +-------------- perf c2c --------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct {
    volatile uint64_t counter;
    char pad[64 - sizeof(uint64_t)];
} PaddedCounter;

4.4 Algorithm Overview

Key Algorithm: False Sharing Benchmark

Allocate counter array.
Spawn threads pinned to CPUs.
Each thread increments its counter in a tight loop.
Measure operations per second.
Compare results with padding on/off.

Complexity Analysis:

Time: O(ops)
Space: O(threads)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential linux-tools-common linux-tools-$(uname -r)

5.2 Project Structure

false-sharing-bench/
|-- src/
|   |-- main.c
|   |-- affinity.c
|   +-- report.c
|-- scripts/
|   +-- perf_c2c.sh
|-- Makefile
+-- README.md

5.3 The Core Question You’re Answering

“How do I detect and eliminate false sharing on real hardware?”

5.4 Concepts You Must Understand First

Cache lines and coherence.
False sharing and padding.
perf c2c profiling.

5.5 Questions to Guide Your Design

How will you guarantee that counters share a cache line?
How will you choose padding size?
How will you quantify improvement?

5.6 Thinking Exercise

Sketch an array of 8 counters and mark which counters share cache lines. Where does false sharing happen?

5.7 The Interview Questions They’ll Ask

“What is false sharing and how do you fix it?”
“What is HITM and why does it matter?”
“How would you detect false sharing without perf c2c?”

5.8 Hints in Layers

Hint 1: Start with adjacent counters in a struct array.

Hint 2: Pin threads and run without padding.

Hint 3: Add 64-byte padding and compare throughput.

Hint 4: Confirm with perf c2c report.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Coherence and sharing | “The Art of Multiprocessor Programming” | Ch. 3-4 | | Cache hierarchy | “Computer Architecture” | Ch. 5 |

5.10 Implementation Phases

Phase 1: Benchmark Core (2-3 days)

Goals: implement counter loop and throughput measurement.

Tasks:

Build thread creation and synchronization.
Measure ops/sec.

Checkpoint: baseline throughput reported.

Phase 2: Padding + Affinity (2-3 days)

Goals: add padding and thread pinning.

Tasks:

Implement padded counters.
Pin threads to specific CPUs.

Checkpoint: throughput increases with padding.

Phase 3: Profiling Workflow (2 days)

Goals: integrate perf c2c guidance.

Tasks:

Add --profile output instructions.
Provide sample report parsing.

Checkpoint: perf c2c report shows HITM reduction.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Padding yields 3-5x throughput improvement.
perf c2c report shows HITM reduction.
Thread pinning reduces variance.

6.3 Test Data

threads=8, duration=1000 ms
expected_throughput_no_padding < 200 M ops/sec
expected_throughput_padding > 600 M ops/sec

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use perf stat to confirm cache-miss rates.
Print addresses of counters to confirm spacing.

7.3 Performance Traps

Hyper-threading can hide effects; pin to physical cores when possible.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --csv output.
Add per-thread throughput stats.

8.2 Intermediate Extensions

Compare intra-socket vs inter-socket false sharing.
Add padding auto-detection based on cache line size.

8.3 Advanced Extensions

Integrate hardware counter collection for HITM within the tool.

9. Real-World Connections

9.1 Industry Applications

Tuning telemetry counters in high-frequency trading systems.
Optimizing server performance dashboards.

perf – Linux performance tooling.
folly – uses padded counters to avoid false sharing.

9.3 Interview Relevance

Identifying false sharing as a root cause of scaling regressions.

10. Resources

10.1 Essential Reading

“The Art of Multiprocessor Programming” – Ch. 3-4
“Computer Architecture” – Ch. 5

10.2 Video Resources

Cache coherence and false sharing talks.

10.3 Tools & Documentation

perf c2c – cache line sharing profiler.

11. Self-Assessment Checklist

11.1 Understanding

I can explain coherence and false sharing.
I can interpret a perf c2c report.

11.2 Implementation

Padding fix improves throughput.
perf c2c report shows HITM reduction.

11.3 Growth

I can explain this workflow in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Benchmark runs with padding toggle.
Throughput reported for both modes.

Full Completion:

perf c2c workflow documented and verified.
Deterministic output with fixed duration.

Excellence (Going Above & Beyond):

Automated analysis of perf output within the tool.

Project 4: False Sharing Detector

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

2.1 Cache Coherence, Cache Lines, and False Sharing

Fundamentals

Deep Dive into the Concept

How this fits on projects

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Check-Your-Understanding Answers

Real-World Applications

Where You’ll Apply It

References

Key Insights

Summary

Homework/Exercises to Practice the Concept

Solutions to the Homework/Exercises

2.2 Profiling False Sharing with perf c2c and HITM Events

Fundamentals

Deep Dive into the Concept

How this fits on projects

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Check-Your-Understanding Answers

Real-World Applications

Where You’ll Apply It

References

Key Insights

Summary

Homework/Exercises to Practice the Concept

Solutions to the Homework/Exercises

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: Exact Terminal Transcript

3.7.4 Failure Demo (Deterministic)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Benchmark Core (2-3 days)

Phase 2: Padding + Affinity (2-3 days)

Phase 3: Profiling Workflow (2 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps