Project 4: False Sharing Detector
Build a microbenchmark that creates false sharing and a profiling workflow that detects and fixes it using
perf c2c.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C (Alternatives: C++, Rust) |
| Alternative Programming Languages | C++, Rust |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 2. The “Internal Tool” |
| Prerequisites | Threads, cache basics, Linux perf tooling |
| Key Topics | cache lines, coherence, false sharing, perf c2c |
1. Learning Objectives
By completing this project, you will:
- Explain how cache coherence works and why false sharing happens.
- Build a workload that reliably triggers false sharing.
- Use
perf c2cto detect HITM-heavy cache lines. - Apply padding and alignment fixes and quantify improvements.
- Interpret the relationship between NUMA distance and coherence traffic.
- Produce deterministic benchmark output for regression testing.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Cache Coherence, Cache Lines, and False Sharing
Fundamentals
Cache coherence ensures that all cores see a consistent view of memory. It operates at cache-line granularity (typically 64 bytes). False sharing occurs when different threads update different variables that happen to reside on the same cache line. Even though the variables are logically independent, the coherence protocol treats the whole line as a single unit, so each write invalidates the other core’s copy. This causes a ping-pong effect that can reduce throughput dramatically. False sharing is especially harmful on NUMA systems because invalidation traffic must cross the interconnect, amplifying latency and consuming bandwidth.
Deep Dive into the Concept
Most modern CPUs use a MESI or MOESI coherence protocol. Each cache line can be in a state such as Modified, Exclusive, Shared, or Invalid. When a core writes to a line, it must obtain exclusive ownership, invalidating other cores’ copies. If two threads on different cores repeatedly write different variables that share the same line, the line continually bounces between cores. Each bounce requires coherence messages across the interconnect and may trigger writebacks.
False sharing is subtle because the program’s data layout determines which variables share a line. Compilers may pack struct fields together; arrays of small structs can place adjacent elements in the same line. In multi-threaded counters, it is common for per-thread counters to end up adjacent in memory, causing massive coherence traffic when all threads update them frequently.
On NUMA systems, the cost of this coherence traffic is higher because the interconnect between sockets has higher latency and lower bandwidth than on-die connections. A false-sharing hotspot can therefore dominate runtime. The symptom is high cache misses, high HITM (Hit Modified) events, and low throughput. You can eliminate false sharing by padding or aligning data so that each thread’s hot data resides on a different cache line. This is why many performance libraries use alignas(64) or padding arrays to avoid line sharing.
Understanding false sharing requires a mental model of cache-line ownership. A line is not “owned” by a variable but by the cache as a whole. So the solution is not to “avoid writes” but to avoid co-locating hot write-heavy variables on the same line. The benchmark you build will demonstrate this effect clearly by showing a dramatic throughput increase after padding.
How this fits on projects
This concept guides how you build the benchmark and interpret perf c2c results. Your fix (padding/alignment) is directly based on this understanding.
Definitions & Key Terms
- Cache line -> The smallest unit of data transferred between cache and memory.
- Coherence -> Protocol ensuring consistent memory across cores.
- False sharing -> Independent variables sharing a cache line, causing invalidations.
- HITM -> Hit Modified; a cache access that hits a modified line in another cache.
- Ping-pong -> Repeated line invalidations between cores.
Mental Model Diagram (ASCII)
Cache Line (64B): [counter0][counter1][counter2][counter3]
Thread0 writes counter0, Thread1 writes counter1
Writes cause line to bounce between cores:
Core0 M -> Core1 I -> Core1 M -> Core0 I -> ...
How It Works (Step-by-Step)
- Two threads write to different variables that share a cache line.
- Each write requests exclusive ownership of the line.
- The coherence protocol invalidates the other core’s copy.
- The line ping-pongs, causing high HITM and low throughput.
Invariants: Coherence operates on cache lines, not variables.
Failure modes: Padding too small, or data structure reordered by compiler.
Minimal Concrete Example
typedef struct {
volatile uint64_t counter; // hot write
} Counter;
Counter counters[2]; // likely share same cache line
// Fix: pad to 64 bytes
typedef struct {
volatile uint64_t counter;
char pad[64 - sizeof(uint64_t)];
} PaddedCounter;
Common Misconceptions
- “If variables are different, they don’t interfere” -> They do if they share a line.
- “Cache lines are per-core” -> Coherence is shared across cores.
- “False sharing is rare” -> It is common in naive multi-threaded code.
Check-Your-Understanding Questions
- Why does false sharing reduce throughput if threads touch different variables?
- What does HITM indicate in profiling output?
- How does padding fix false sharing?
Check-Your-Understanding Answers
- Because coherence transfers the entire line, causing invalidations.
- It indicates a cache line owned by another core in Modified state.
- Padding ensures each thread’s data lives in a separate cache line.
Real-World Applications
- High-performance counters in telemetry systems.
- Lock-free queues and ring buffers.
- Thread-local statistics in databases.
Where You’ll Apply It
- In this project: Sec. 3.1, Sec. 4.1, Sec. 7.1.
- Also used in: P06-numa-aware-data-structure-library.
References
- “The Art of Multiprocessor Programming” (Herlihy & Shavit) – Ch. 3-4
- “Computer Architecture” (Hennessy & Patterson) – Ch. 5
Key Insights
False sharing is not a logic bug; it is a layout bug at cache-line granularity.
Summary
False sharing arises when independent hot variables occupy the same cache line. Coherence traffic forces the line to bounce between cores, destroying throughput. Aligning or padding data structures is the simplest fix.
Homework/Exercises to Practice the Concept
- Write a two-thread increment loop with adjacent counters and measure throughput.
- Add padding to separate the counters and measure the improvement.
- Use
alignas(64)and verify the memory addresses differ by 64 bytes.
Solutions to the Homework/Exercises
- Throughput will be low and HITM counts high.
- Throughput will improve significantly after padding.
- Addresses should differ by at least 64 bytes, preventing sharing.
2.2 Profiling False Sharing with perf c2c and HITM Events
Fundamentals
perf c2c (cache-to-cache) profiling captures cache-line sharing events across cores. It reports lines that are frequently transferred between cores, often labeled with HITM (Hit Modified). High HITM rates on a line are a strong signal of false sharing. Using perf c2c record and perf c2c report, you can identify the hot line, map it to the source code, and confirm that a padding fix eliminates the contention. This workflow turns a vague performance problem into a precise, reproducible diagnosis.
Deep Dive into the Concept
Performance counters in modern CPUs track cache events, including cache-to-cache transfers. When one core accesses a line that another core has in Modified state, the request is serviced by the other core’s cache (HITM). This indicates a coherence transfer and usually implies writes are bouncing the line. perf c2c builds on these counters to provide a report of cache-line addresses, the number of HITM events, and the CPUs involved.
Using perf c2c requires hardware support and kernel configuration. You typically run perf c2c record -- <command> to collect events, then perf c2c report to analyze. The report includes columns for remote HITMs, local HITMs, and a list of CPUs sharing the line. If you see a line with a high HITM percentage and multiple CPUs listed, that line is a prime candidate for false sharing. You can then map the address to a symbol or a struct field using addr2line or by printing the address of your data structure in the program.
A subtle but important aspect is that not all HITM events are bad. Some sharing is legitimate, such as a producer-consumer queue. The key is to correlate HITM hotspots with the code’s intent. For this project, you deliberately create false sharing so you can see the signature clearly. Then you apply padding and rerun. A successful fix will dramatically reduce HITM events and increase throughput. This is a powerful demonstration of how performance tooling can guide code changes.
Another detail: NUMA effects can amplify the cost of HITM events, because coherence traffic traverses the interconnect. Therefore, you should run the benchmark with threads pinned to different nodes to make the penalty visible. You can even compare intra-socket vs inter-socket false sharing, which is a valuable real-world lesson.
Finally, perf c2c output is often noisy. You should collect enough samples, repeat runs, and report averages. Your tool can integrate a “profiler helper” script that wraps the perf c2c commands with consistent flags and stores results alongside the benchmark output for comparison.
How this fits on projects
This concept defines the diagnostic workflow: record, report, interpret, fix. It transforms the benchmark into a repeatable performance investigation.
Definitions & Key Terms
- perf c2c -> perf mode for cache-to-cache transfer analysis.
- HITM -> Hit Modified; access to a line owned by another core.
- Hot cache line -> Cache line with unusually high coherence traffic.
- Addr2line -> Tool for mapping addresses to source lines.
Mental Model Diagram (ASCII)
perf c2c report (simplified)
Address HITM% CPUs
0x7f8c... 92% 0,1,2,3,8,9,10,11
High HITM% = likely false sharing
How It Works (Step-by-Step)
- Run the benchmark under
perf c2c record. - Generate a report with
perf c2c report. - Identify the cache line with the highest HITM.
- Map the address to your data structure.
- Apply padding and rerun to verify reduction.
Invariants: High HITM indicates shared modified lines.
Failure modes: Missing hardware support, insufficient sampling, symbol resolution issues.
Minimal Concrete Example
perf c2c record -- ./false_sharing_bench --threads=8 --padding=0
perf c2c report --stdio | head -n 20
Common Misconceptions
- “perf c2c is always available” -> Some CPUs or kernels lack support.
- “Any HITM is bad” -> Some shared lines are legitimate.
- “One run is enough” -> Noise requires multiple runs for confidence.
Check-Your-Understanding Questions
- What does a high HITM percentage indicate?
- Why should you map hot addresses back to source code?
- How do you confirm a padding fix worked?
Check-Your-Understanding Answers
- A line is frequently transferred between cores in Modified state.
- To identify the specific variable or struct field causing sharing.
- HITM rates drop and throughput increases after padding.
Real-World Applications
- Diagnosing performance regressions in multi-threaded services.
- Optimizing counters and metrics systems.
Where You’ll Apply It
- In this project: Sec. 3.7, Sec. 7.2.
- Also used in: P06-numa-aware-data-structure-library, P09-numa-aware-thread-pool.
References
- “The Art of Multiprocessor Programming” – Ch. 3-4
- “Computer Architecture” – Ch. 5
Key Insights
perf c2c turns a hidden cache-line problem into a visible, fixable hotspot.
Summary
perf c2c exposes cache-line sharing behavior by reporting HITM events. By correlating the hottest line to your data structures, you can fix false sharing with padding and verify improvements quantitatively.
Homework/Exercises to Practice the Concept
- Run the benchmark with padding=0 and collect a c2c report.
- Apply padding and compare HITM rates.
- Run with threads on different sockets and compare HITM cost.
Solutions to the Homework/Exercises
- The report should show a hot line with high HITM and many CPUs.
- HITM rates should drop dramatically after padding.
- Cross-socket sharing should show higher latency and lower throughput.
3. Project Specification
3.1 What You Will Build
A CLI tool false-sharing-bench that generates false sharing with multi-threaded counters, profiles the run with perf c2c, and demonstrates the fix using cache-line padding.
Included: benchmark, padding toggle, profiling workflow instructions, deterministic output. Excluded: full profiler GUI or kernel modifications.
3.2 Functional Requirements
- Launch N threads that repeatedly update per-thread counters.
- Provide a
--paddingflag to control cache-line padding. - Pin threads to specific CPUs or nodes.
- Print throughput and timing results.
- Provide a
--profilemode that emitsperf c2cinstructions. - Support deterministic seed for any randomized workload.
3.3 Non-Functional Requirements
- Performance: Throughput results stable across runs.
- Reliability: Works on systems without perf c2c (with warning).
- Usability: Clear output and instructions for profiling.
3.4 Example Usage / Output
$ ./false-sharing-bench --threads=8 --padding=0
Throughput: 120 M ops/sec
$ ./false-sharing-bench --threads=8 --padding=64
Throughput: 680 M ops/sec
3.5 Data Formats / Schemas / Protocols
JSON output:
{
"threads": 8,
"padding": 64,
"throughput_mops": 680,
"duration_ms": 1000
}
3.6 Edge Cases
- perf c2c unavailable (report warning).
- Padding smaller than cache line.
- Thread count exceeds CPU count.
3.7 Real World Outcome
You can demonstrate false sharing and eliminate it, backed by perf c2c evidence and throughput gains.
3.7.1 How to Run (Copy/Paste)
cc -O2 -pthread -o false-sharing-bench src/main.c
./false-sharing-bench --threads=8 --padding=0
3.7.2 Golden Path Demo (Deterministic)
$ ./false-sharing-bench --threads=8 --padding=0 --duration=1000
Throughput: 120 M ops/sec
$ ./false-sharing-bench --threads=8 --padding=64 --duration=1000
Throughput: 680 M ops/sec
3.7.3 If CLI: Exact Terminal Transcript
$ perf c2c record -- ./false-sharing-bench --threads=8 --padding=0
$ perf c2c report --stdio | head -n 5
0x7f8c12345000 HITM: 92% CPUs: 0,1,2,3,8,9,10,11
3.7.4 Failure Demo (Deterministic)
$ ./false-sharing-bench --threads=999
ERROR: thread count exceeds CPU count (max=16)
EXIT: 1
Exit Codes:
0success1invalid arguments2perf unavailable3runtime error
4. Solution Architecture
4.1 High-Level Design
+--------------+ +--------------+ +------------------+
| Thread Setup |-->| Counter Loop |-->| Report Renderer |
+--------------+ +--------------+ +------------------+
^ |
+-------------- perf c2c --------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |—|—|—| | Thread Manager | Spawn threads, set affinity | static pinning | | Counter Array | Shared counters with padding | cache-line size | | Reporter | Measure throughput | time-based loop | | Profiler Helper | Guidance for perf c2c | optional mode |
4.3 Data Structures (No Full Code)
typedef struct {
volatile uint64_t counter;
char pad[64 - sizeof(uint64_t)];
} PaddedCounter;
4.4 Algorithm Overview
Key Algorithm: False Sharing Benchmark
- Allocate counter array.
- Spawn threads pinned to CPUs.
- Each thread increments its counter in a tight loop.
- Measure operations per second.
- Compare results with padding on/off.
Complexity Analysis:
- Time: O(ops)
- Space: O(threads)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install -y build-essential linux-tools-common linux-tools-$(uname -r)
5.2 Project Structure
false-sharing-bench/
|-- src/
| |-- main.c
| |-- affinity.c
| +-- report.c
|-- scripts/
| +-- perf_c2c.sh
|-- Makefile
+-- README.md
5.3 The Core Question You’re Answering
“How do I detect and eliminate false sharing on real hardware?”
5.4 Concepts You Must Understand First
- Cache lines and coherence.
- False sharing and padding.
- perf c2c profiling.
5.5 Questions to Guide Your Design
- How will you guarantee that counters share a cache line?
- How will you choose padding size?
- How will you quantify improvement?
5.6 Thinking Exercise
Sketch an array of 8 counters and mark which counters share cache lines. Where does false sharing happen?
5.7 The Interview Questions They’ll Ask
- “What is false sharing and how do you fix it?”
- “What is HITM and why does it matter?”
- “How would you detect false sharing without perf c2c?”
5.8 Hints in Layers
Hint 1: Start with adjacent counters in a struct array.
Hint 2: Pin threads and run without padding.
Hint 3: Add 64-byte padding and compare throughput.
Hint 4: Confirm with perf c2c report.
5.9 Books That Will Help
| Topic | Book | Chapter | |—|—|—| | Coherence and sharing | “The Art of Multiprocessor Programming” | Ch. 3-4 | | Cache hierarchy | “Computer Architecture” | Ch. 5 |
5.10 Implementation Phases
Phase 1: Benchmark Core (2-3 days)
Goals: implement counter loop and throughput measurement.
Tasks:
- Build thread creation and synchronization.
- Measure ops/sec.
Checkpoint: baseline throughput reported.
Phase 2: Padding + Affinity (2-3 days)
Goals: add padding and thread pinning.
Tasks:
- Implement padded counters.
- Pin threads to specific CPUs.
Checkpoint: throughput increases with padding.
Phase 3: Profiling Workflow (2 days)
Goals: integrate perf c2c guidance.
Tasks:
- Add
--profileoutput instructions. - Provide sample report parsing.
Checkpoint: perf c2c report shows HITM reduction.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Padding size | 64 vs 128 bytes | 64 bytes | matches cache line | | Timing | fixed ops vs time-based | time-based | stable throughput | | Pinning | none vs affinity | affinity | reduce noise |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |—|—|—| | Unit Tests | Validate padding size | struct size check | | Integration Tests | Verify throughput improvement | padding vs no padding | | Edge Case Tests | Invalid thread count | exit code 1 |
6.2 Critical Test Cases
- Padding yields 3-5x throughput improvement.
- perf c2c report shows HITM reduction.
- Thread pinning reduces variance.
6.3 Test Data
threads=8, duration=1000 ms
expected_throughput_no_padding < 200 M ops/sec
expected_throughput_padding > 600 M ops/sec
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |—|—|—| | Padding too small | No improvement | Use 64-byte alignment | | No pinning | High variance | Set affinity | | perf missing | No c2c output | install linux-tools |
7.2 Debugging Strategies
- Use
perf statto confirm cache-miss rates. - Print addresses of counters to confirm spacing.
7.3 Performance Traps
- Hyper-threading can hide effects; pin to physical cores when possible.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--csvoutput. - Add per-thread throughput stats.
8.2 Intermediate Extensions
- Compare intra-socket vs inter-socket false sharing.
- Add padding auto-detection based on cache line size.
8.3 Advanced Extensions
- Integrate hardware counter collection for HITM within the tool.
9. Real-World Connections
9.1 Industry Applications
- Tuning telemetry counters in high-frequency trading systems.
- Optimizing server performance dashboards.
9.2 Related Open Source Projects
- perf – Linux performance tooling.
- folly – uses padded counters to avoid false sharing.
9.3 Interview Relevance
- Identifying false sharing as a root cause of scaling regressions.
10. Resources
10.1 Essential Reading
- “The Art of Multiprocessor Programming” – Ch. 3-4
- “Computer Architecture” – Ch. 5
10.2 Video Resources
- Cache coherence and false sharing talks.
10.3 Tools & Documentation
- perf c2c – cache line sharing profiler.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain coherence and false sharing.
- I can interpret a perf c2c report.
11.2 Implementation
- Padding fix improves throughput.
- perf c2c report shows HITM reduction.
11.3 Growth
- I can explain this workflow in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Benchmark runs with padding toggle.
- Throughput reported for both modes.
Full Completion:
- perf c2c workflow documented and verified.
- Deterministic output with fixed duration.
Excellence (Going Above & Beyond):
- Automated analysis of perf output within the tool.