Project 6: Memory Disambiguation Probe
Build a microbenchmark that reveals memory aliasing stalls and disambiguation behavior.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C (Alternatives: Assembly) |
| Alternative Programming Languages | Assembly |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 1. The “Resume Gold” |
| Prerequisites | C, pointers, cache basics, RDTSC timing |
| Key Topics | memory disambiguation, aliasing, store-to-load forwarding, load/store buffers |
1. Learning Objectives
By completing this project, you will:
- Explain why CPUs speculate on load/store ordering.
- Measure 4K aliasing and disambiguation penalties.
- Distinguish store-to-load forwarding from true dependencies.
- Build a stall map over address offsets.
- Produce a reproducible report of aliasing behavior.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Memory Disambiguation and 4K Aliasing
Fundamentals
Memory disambiguation is the CPU’s process of deciding whether a load depends on an earlier store when the addresses are not yet known. The CPU often speculates that a load is independent to keep the pipeline moving, but if the speculation is wrong, it must stall or replay the load. A classic hazard is 4K aliasing: when a store and a later load have the same lower 12 bits of the address, the CPU cannot easily disambiguate, so it may stall or enforce ordering. This creates measurable latency spikes that your benchmark can detect.
Additional fundamentals for Memory Disambiguation and 4K Aliasing: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.
Deep Dive into the concept
Loads and stores are issued out-of-order by modern CPUs to hide memory latency. To preserve correctness, the CPU must ensure that a load does not read stale data that should have been written by an earlier store. If the store address is unknown or partially known, the CPU has to decide whether to allow the load to proceed speculatively. This decision is made by the memory disambiguation logic, which uses partial address information (often the page offset, lower 12 bits) to guess whether the load might alias a store.
The 4K aliasing effect arises because the lower 12 bits correspond to the offset within a page. Many CPUs compare only these bits early in the pipeline. If a load and a store share the same lower 12 bits, the CPU treats them as potentially aliasing until full address resolution. This can trigger a stall or a replay, which appears as a latency spike. By scanning a range of offsets between a store and a load, you can observe that the latency is low for most offsets and spikes at 4K boundaries. This is a strong signal of the disambiguation policy.
Store-to-load forwarding (STLF) is the case where a load reads a value from a previous store that has not yet reached the cache. The CPU forwards the store data directly from the store buffer to the load. This is a fast path, but it is only possible when addresses match exactly and data size/alignment conditions are satisfied. Partial overlaps or misaligned stores can prevent forwarding and cause stalls. In your benchmark, you can create a store followed by a load at various offsets and sizes to see when STLF occurs and when it fails.
The memory subsystem maintains structures like the store buffer, load buffer, and memory order buffer. The store buffer holds pending stores, while the load buffer tracks in-flight loads. The disambiguation logic checks these buffers to ensure ordering. When it cannot resolve dependencies early, it may serialize the load, which reduces out-of-order benefits. Your benchmark reveals these costs by measuring cycles per iteration for different address offsets.
A key experimental challenge is avoiding confounding effects from caches and prefetchers. You want to measure aliasing, not cache misses. Therefore, keep data within L1 and use repeated accesses to warm caches. Use randomization to avoid prefetcher patterns. The observed latency spikes should align with 4K offsets if disambiguation is the cause. If you see broader or irregular spikes, you may be hitting other effects such as TLB misses or page conflicts.
Additional deep dive considerations for Memory Disambiguation and 4K Aliasing: In real designs, Memory Disambiguation and 4K Aliasing is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Memory Disambiguation and 4K Aliasing via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.
How this fits on projects
You will use this concept to design the offset scanning harness in §3.2 and to interpret the stall map in §3.7.
Definitions & key terms
- disambiguation -> deciding whether a load depends on an earlier store
- 4K aliasing -> false dependency due to matching lower 12 address bits
- store buffer -> structure holding pending stores
- load buffer -> structure tracking in-flight loads
Mental model diagram (ASCII)
Store addr: [page | offset]
Load addr: [page | offset]
Same offset -> possible alias -> stall/replay
How it works (step-by-step, with invariants and failure modes)
- Issue a store, then a load with varying offset.
- CPU compares partial address bits.
- If alias suspected, stall or replay load.
- Measure cycles and plot offset vs latency.
Invariants:
- Correctness requires loads not bypass true dependencies.
- Speculation must be corrected if wrong.
Failure modes:
- Cache misses mask aliasing effects.
- Prefetching reduces measured stalls.
Minimal concrete example
*store_ptr = value;
uint64_t x = *load_ptr; // offset varies
Common misconceptions
- “If addresses are different, no stall” -> partial aliasing can still stall.
- “STLF always works” -> misalignment or size mismatch can break it.
Check-your-understanding questions
- Why does the CPU use only partial address bits early?
- What is 4K aliasing and why 4K specifically?
- When does store-to-load forwarding fail?
Check-your-understanding answers
- Full address computation may not be complete yet, but speculation is needed.
- 4K is page size, so lower 12 bits are the page offset.
- If addresses do not fully match or alignment is incompatible.
Real-world applications
- Memory-intensive code performance tuning
- Understanding stalls in database or HPC workloads
Where you’ll apply it
- In this project: see §3.2 Functional Requirements and §5.10 Phase 2.
- Also used in: P07-the-reorder-buffer-rob-boundary-finder.md.
References
- “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson
- Intel Optimization Manual, memory ordering section
Key insights
- Partial address comparisons create false dependencies that show up as periodic stalls.
Summary
Memory disambiguation balances correctness and speed. 4K aliasing is the signature of this compromise.
Homework/Exercises to practice the concept
- Predict where stalls occur if the offset increments by 64 bytes.
- Explain how STLF can hide true dependencies.
Solutions to the homework/exercises
- Stalls should appear when offset mod 4096 is constant.
- STLF forwards data and avoids waiting for cache writeback.
2.2 Store-to-Load Forwarding and Load/Store Buffers
Fundamentals
Store-to-load forwarding allows a load to read a value from a previous store that has not yet committed to the cache. This prevents unnecessary stalls and is essential for performance. The load and store buffers track in-flight memory operations and make forwarding possible. However, forwarding has strict requirements: addresses must match, and size/alignment must be compatible. If these conditions fail, the CPU may stall the load until the store completes.
Additional fundamentals for Store-to-Load Forwarding and Load/Store Buffers: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.
Deep Dive into the concept
The store buffer is a queue of pending stores that have not yet been committed to the cache hierarchy. The load buffer tracks in-flight loads. When a load is issued, the CPU checks the store buffer to see if any older stores might target the same address. If it finds a matching store, it can forward the data directly, bypassing cache. This is faster than waiting for the store to complete and avoids pipeline stalls.
Forwarding is not always possible. The CPU may only know part of the store address at the time of the load. It may detect a partial match (same 4K offset) and stall until full address resolution. Even when addresses match, alignment and size matter. For example, a 4-byte load from an address that overlaps a previous 8-byte store can be forwarded, but if the load is unaligned or spans multiple store entries, forwarding may fail or be split. Some CPUs handle partial forwarding; others force a stall and replay.
Load and store buffers are limited in size. If too many memory operations are in flight, the buffers can fill, causing stalls. This is another reason memory-heavy loops can bottleneck even if caches are fast. In this project, your microbenchmark is small, so buffer capacity is not the primary limit, but the same mechanisms are at play.
To observe forwarding, you can set up a loop that stores to an address and then loads from the same address with different sizes and alignments. The cases where forwarding succeeds will show low latency; cases where it fails will show spikes. By varying the offset and alignment, you can map the forwarding rules of your CPU. This is especially relevant for systems programming and compiler optimization, where alignment decisions can have large performance impacts.
Additional deep dive considerations for Store-to-Load Forwarding and Load/Store Buffers: In real designs, Store-to-Load Forwarding and Load/Store Buffers is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Store-to-Load Forwarding and Load/Store Buffers via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.
Supplemental note for Store-to-Load Forwarding and Load/Store Buffers: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to Store-to-Load Forwarding and Load/Store Buffers and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like Store-to-Load Forwarding and Load/Store Buffers often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.
How this fits on projects
You will use this concept to interpret your offset-vs-latency plot in §3.7 and to design targeted tests in §6.2.
Definitions & key terms
- store buffer -> queue holding pending stores
- load buffer -> queue tracking in-flight loads
- forwarding -> direct data bypass from store buffer to load
- replay -> re-issuing a load after a dependency was mispredicted
Mental model diagram (ASCII)
Store -> Store Buffer -> (forward) -> Load
\-> Cache (later)
How it works (step-by-step, with invariants and failure modes)
- Issue store; it enters store buffer.
- Issue load; check store buffer for match.
- If match and aligned, forward data to load.
- If uncertain, stall or replay.
Invariants:
- Loads must see the most recent older store.
- Forwarding must preserve correctness.
Failure modes:
- Partial overlap causes forward failure.
- Buffer overflow stalls loads.
Minimal concrete example
*(uint32_t*)p = 0xdeadbeef;
uint32_t x = *(uint32_t*)p; // should forward
Common misconceptions
- “Forwarding is always free” -> forwarding logic can add latency.
- “Alignment doesn’t matter” -> alignment affects forward eligibility.
Check-your-understanding questions
- Why does a load need to check the store buffer?
- What can cause a load to replay?
- Why might a misaligned load fail to forward?
Check-your-understanding answers
- To ensure it does not bypass a prior store.
- A late-discovered dependency or aliasing.
- The CPU cannot assemble the correct bytes efficiently.
Real-world applications
- Compiler alignment optimizations
- Debugging memory stalls in systems code
Where you’ll apply it
- In this project: see §3.7 Real World Outcome and §7.1 Frequent Mistakes.
- Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md.
References
- Intel Optimization Manual, memory ordering
- AMD Software Optimization Guide
Key insights
- Forwarding is fast but fragile; alignment and size determine success.
Summary
Store-to-load forwarding is a critical optimization. Your benchmark makes its success and failure visible as latency spikes.
Homework/Exercises to practice the concept
- Test 4-byte vs 8-byte loads after a store and note latency differences.
- Align and misalign the store address and compare results.
Solutions to the homework/exercises
- Matching sizes usually forward; mismatched sizes may stall.
- Misalignment often breaks forwarding and increases latency.
2.3 Store Buffers, Store-to-Load Forwarding, and Alias Prediction
Fundamentals
Modern CPUs let loads execute before older stores when they predict the addresses do not alias. This speculation is handled by the store buffer and load queue. If the prediction is correct, the load can complete early, improving performance. If incorrect, the CPU must squash the load and re-execute it, costing cycles. Store-to-load forwarding is the fast path: if a load reads from an address recently written by a store in the buffer, the CPU can forward the value without waiting for the store to reach cache. Your memory disambiguation probe is essentially a measurement of how well these mechanisms work and where they fail.
Deep Dive into the concept
The store buffer holds pending stores that have not yet been committed to the cache or memory. This allows the CPU to retire stores quickly while deferring the actual write. Loads that are younger than those stores face a question: do they depend on any of the pending stores? If the CPU waited for every store to drain, it would lose much of its out-of-order advantage. Instead, it predicts. The first phase is an address-availability check: if the store address is not yet known (because address generation is still in flight), the CPU may conservatively stall the load, or it may use a partial-address predictor that guesses whether aliasing is likely.
Many CPUs implement a partial-address check using the low bits of the address (for example, 12 bits for the page offset). If a load and a store share the same low bits, they might alias, so the CPU may delay the load or mark it as speculative. This is why aliasing at page offsets is a key stress case: you can create two arrays separated by 4 KB or 64 KB so they share low bits but not the full address. When the predictor sees matching low bits, it may falsely assume a dependency and delay the load, causing a performance cliff even though the addresses differ. Your probe should intentionally sweep address offsets to find these cliffs.
Store-to-load forwarding is the fast path for true dependencies. If a load reads the same address as a store that is still in the buffer, the CPU can forward the store data directly to the load, bypassing the cache. But forwarding has constraints: size must match, alignment must be compatible, and the store must be the most recent writer. Partial overlap can prevent forwarding and force the load to wait until the store drains. This is why mismatched sizes (like a 4-byte store followed by an 8-byte load) are valuable test cases. A good probe includes multiple access widths and alignment offsets to reveal forwarding rules.
When prediction is wrong, the CPU has to “replay” the load. The load executes speculatively, reads stale data, and then later the store address resolves and the CPU detects the alias. It squashes the load, reissues it, and any dependent operations are re-executed. This can create large latency spikes. Some CPUs expose counters for load replays or memory-ordering violations; if available, they are excellent validation tools. If not, you can infer replays from sudden jumps in cycle counts when you introduce aliasing patterns.
A practical experiment uses two pointer streams: one that produces stores and another that produces loads. By varying the distance between them and the alignment of their addresses, you can observe three regimes: no aliasing and high throughput; false aliasing where throughput drops due to conservative prediction; and true aliasing where throughput remains stable because forwarding kicks in. The interesting region is false aliasing: it reveals the predictor’s limitations and can explain real-world performance cliffs in code that uses multiple arrays with power-of-two strides.
How this fits on projects
You will use this in §3.6 Edge Cases to define aliasing stress cases, in §5.10 Phase 2 to build the access patterns, and in §7.3 to debug unexpected stalls.
Definitions & key terms
- store buffer -> queue of pending stores waiting to commit to cache
- store-to-load forwarding -> delivering store data directly to a dependent load
- alias prediction -> heuristic that guesses if a load depends on an older store
- replay -> re-execution of a load after detecting mis-speculation
- partial-address check -> comparing low address bits to detect potential aliasing
Mental model diagram (ASCII)
Store Buffer: [S1 addr=?, data] [S2 addr=0x1000]
Load Q: [L1 addr=0x9000] -> may alias if low bits match
How it works (step-by-step, with invariants and failure modes)
- Issue stores into the store buffer and loads into the load queue.
- If a load address is known, compare with pending stores.
- If match, forward data; if uncertain, speculate.
- If later a store resolves and aliases, replay the load.
- Measure cycles and identify aliasing cliffs.
Invariants:
- A load must see the most recent older store to the same address.
- Replay occurs if a speculative load read was incorrect.
Failure modes:
- False aliasing causes unnecessary stalls.
- Misaligned accesses disable forwarding and increase latency.
Minimal concrete example
// Potential false alias: arrays 4 KB apart share low bits
int *a = base; // aligned
int *b = base + 1024; // 4 KB offset
store(a[i]);
load(b[i]);
Common misconceptions
- “Alias prediction only matters for unsafe code” -> it affects all loads/stores.
- “Forwarding always works” -> size and alignment can block forwarding.
- “Replays are rare” -> they can be common in pointer-heavy code.
Check-your-understanding questions
- Why do page-offset collisions cause false aliasing?
- When does store-to-load forwarding fail even for the same address?
- What evidence would suggest load replays are occurring?
Check-your-understanding answers
- The predictor often compares only low address bits, which match across pages.
- When access sizes or alignments differ, forwarding cannot supply correct data.
- Sudden spikes in cycles and increased replay-related counters.
Real-world applications
- Optimizing array layouts in high-performance computing
- Understanding performance cliffs in database or analytics kernels
- Designing microbenchmarks for memory-ordering behavior
Where you’ll apply it
- In this project: see §3.6 Edge Cases and §5.10 Phase 2.
- Also used in: P07-the-reorder-buffer-rob-boundary-finder.md, P09-l1-bandwidth-stressor-zen-5-focus.md.
References
- “Memory Systems: Cache, DRAM, Disk” by Jacob, Ng, and Wang, Ch. 8
- “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3
Key insights
- Disambiguation is prediction; your probe reveals where that prediction fails.
Summary
Store buffers and disambiguation predictors let loads run ahead, but they can also create stalls and replays when aliasing is mispredicted. Measuring those boundaries explains many real-world slowdowns.
Homework/Exercises to practice the concept
- Create two arrays offset by 4 KB and measure throughput vs 2 KB offset.
- Test store->load with mismatched sizes and observe the forwarding penalty.
Solutions to the homework/exercises
- The 4 KB offset often shows a throughput drop due to false aliasing.
- Mismatched sizes usually disable forwarding and increase latency.
3. Project Specification
3.1 What You Will Build
A microbenchmark that scans address offsets between a store and a subsequent load, measuring cycles for each offset. The output is a stall map showing where disambiguation and forwarding fail, including 4K aliasing spikes.
3.2 Functional Requirements
- Offset Scanner: Iterate over offsets from 0 to at least 8192 bytes.
- Timing Harness: Measure cycles per store-load pair.
- Warmup and Pinning: Stabilize caches and run on a single core.
- Report Generator: Output offset vs latency.
3.3 Non-Functional Requirements
- Performance: Complete a full scan in under 2 seconds.
- Reliability: Repeatable spike patterns across trials.
- Usability: CLI flags for offset range and stride.
3.4 Example Usage / Output
$ ./alias_probe --max-offset 8192 --stride 64
offset,cycles
0,4
64,4
4096,18 <-- aliasing spike
3.5 Data Formats / Schemas / Protocols
CSV output:
offset,cycles
0,4
64,4
4096,18
3.6 Edge Cases
- Stride larger than offset range
- Cache miss interference from large buffers
- Prefetcher hiding stalls
3.7 Real World Outcome
You will produce a graph showing periodic latency spikes at 4K boundaries and note how alignment affects them.
3.7.1 How to Run (Copy/Paste)
cc -O2 -Wall -o alias_probe src/alias_probe.c
sudo taskset -c 2 ./alias_probe --max-offset 8192 --stride 64 --trials 5
3.7.2 Golden Path Demo (Deterministic)
- Use fixed stride 64 and 5 trials.
- Expect spikes at offsets 4096 and 8192.
3.7.3 If CLI: Exact Terminal Transcript
$ taskset -c 2 ./alias_probe --max-offset 8192 --stride 64
offset,cycles
0,4
64,4
4096,18
8192,17
$ echo $?
0
Failure demo (bad stride):
$ ./alias_probe --stride 0
Error: stride must be >= 1
$ echo $?
3
Exit codes:
0success2timing init error3invalid argument
4. Solution Architecture
4.1 High-Level Design
+---------------+ +------------------+ +-----------------+
| Offset Scanner|-> | Timing Harness |-> | Report Builder |
+---------------+ +------------------+ +-----------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Scanner | Generate offsets | Use 64-byte stride |
| Harness | Measure latency | RDTSCP + fences |
| Reporter | CSV output | easy plotting |
4.3 Data Structures (No Full Code)
struct Sample { int offset; double cycles; };
4.4 Algorithm Overview
Key Algorithm: Alias Scan
- Warm up memory.
- For each offset, perform store + load N times.
- Record median cycles.
Complexity Analysis:
- Time: O(O * N) where O is number of offsets
- Space: O(O)
5. Implementation Guide
5.1 Development Environment Setup
cc --version
5.2 Project Structure
alias-probe/
├── src/
│ ├── alias_probe.c
│ └── timing.c
└── README.md
5.3 The Core Question You’re Answering
“When does my CPU stall because it cannot disambiguate memory?”
The answer is the aliasing spike map.
5.4 Concepts You Must Understand First
- 4K aliasing and disambiguation
- Store-to-load forwarding rules
- Cache warmup and timing
5.5 Questions to Guide Your Design
- How will you keep data in L1 to avoid cache misses?
- How will you choose a stride that reveals 4K aliasing?
- How will you detect forwarding success vs failure?
5.6 Thinking Exercise
If you shift the base address by 32 bytes, how will the aliasing spikes move? Explain why.
5.7 The Interview Questions They’ll Ask
- What is 4K aliasing and how does it affect performance?
- What is store-to-load forwarding?
- Why do CPUs speculate on memory ordering?
5.8 Hints in Layers
Hint 1: Use a small buffer to keep everything in L1.
Hint 2: Measure median of multiple trials per offset.
Hint 3: Plot the results to see spikes clearly.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Memory ordering | “Computer Architecture” | Ch. 3 |
| Cache effects | “Computer Systems” | Ch. 6 |
5.10 Implementation Phases
Phase 1: Foundation (2-3 days)
- Implement timing harness and store/load loop.
- Checkpoint: stable cycles for offset 0.
Phase 2: Core Functionality (4-6 days)
- Add offset scanning and CSV output.
- Checkpoint: spike at 4096 observed.
Phase 3: Analysis (2-3 days)
- Analyze STLF success and failure cases.
- Checkpoint: report includes aliasing explanation.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Stride | 64 vs 128 | 64 | matches cache line |
| Statistic | mean vs median | median | robust to noise |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Timing correctness | cache hit vs miss |
| Integration Tests | Full scan | offsets 0-8192 |
| Edge Tests | Invalid stride | stride 0 |
6.2 Critical Test Cases
- Offset 0 should show low latency (forwarding success).
- Offset 4096 should show spike (alias).
- Misaligned store should show higher latency.
6.3 Test Data
offsets: 0, 64, 4096
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Prefetcher interference | no spikes | randomize access |
| Cache misses | high baseline | reduce buffer size |
| Not pinning core | noisy results | use taskset |
7.2 Debugging Strategies
- Print timing histogram for a single offset.
- Use perf to verify load/store counts.
7.3 Performance Traps
- Large buffers spill into L2/L3 and hide aliasing behavior.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a visualization script.
8.2 Intermediate Extensions
- Compare different load sizes (1, 4, 8 bytes).
8.3 Advanced Extensions
- Measure aliasing differences across cores or CPUs.
9. Real-World Connections
9.1 Industry Applications
- Debugging memory stalls in databases and runtimes
- Compiler alignment and alias analysis
9.2 Related Open Source Projects
- uops.info: microbenchmarks
- lmbench: memory latency tools
9.3 Interview Relevance
- Memory disambiguation is a deep performance topic.
10. Resources
10.1 Essential Reading
- “Computer Architecture: A Quantitative Approach”
- Intel Optimization Manual
10.2 Video Resources
- “Memory Ordering in CPUs” lecture
10.3 Tools & Documentation
- perf: event counters
- rdtsc: timing
10.4 Related Projects in This Series
- Next: P07-the-reorder-buffer-rob-boundary-finder.md
- Also: P03-speculative-side-channel-explorer-spectre-lite.md
11. Self-Assessment Checklist
11.1 Understanding
- I can explain 4K aliasing.
- I can describe STLF success conditions.
- I can interpret the stall map.
11.2 Implementation
- The scan completes with stable spikes.
- Output is deterministic and plotted.
- Results are documented with CPU model.
11.3 Growth
- I can explain memory disambiguation in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Produce offset vs latency CSV with visible spikes.
- Explain 4K aliasing in the report.
Full Completion:
- Include STLF success/failure cases.
Excellence (Going Above & Beyond):
- Compare results across at least two CPU models.