Project 6: Memory Disambiguation Probe

Build a microbenchmark that reveals memory aliasing stalls and disambiguation behavior.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Assembly)
Alternative Programming Languages	Assembly
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Prerequisites	C, pointers, cache basics, RDTSC timing
Key Topics	memory disambiguation, aliasing, store-to-load forwarding, load/store buffers

1. Learning Objectives

By completing this project, you will:

Explain why CPUs speculate on load/store ordering.
Measure 4K aliasing and disambiguation penalties.
Distinguish store-to-load forwarding from true dependencies.
Build a stall map over address offsets.
Produce a reproducible report of aliasing behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Memory Disambiguation and 4K Aliasing

Fundamentals

Memory disambiguation is the CPU’s process of deciding whether a load depends on an earlier store when the addresses are not yet known. The CPU often speculates that a load is independent to keep the pipeline moving, but if the speculation is wrong, it must stall or replay the load. A classic hazard is 4K aliasing: when a store and a later load have the same lower 12 bits of the address, the CPU cannot easily disambiguate, so it may stall or enforce ordering. This creates measurable latency spikes that your benchmark can detect.

Additional fundamentals for Memory Disambiguation and 4K Aliasing: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

Loads and stores are issued out-of-order by modern CPUs to hide memory latency. To preserve correctness, the CPU must ensure that a load does not read stale data that should have been written by an earlier store. If the store address is unknown or partially known, the CPU has to decide whether to allow the load to proceed speculatively. This decision is made by the memory disambiguation logic, which uses partial address information (often the page offset, lower 12 bits) to guess whether the load might alias a store.

The 4K aliasing effect arises because the lower 12 bits correspond to the offset within a page. Many CPUs compare only these bits early in the pipeline. If a load and a store share the same lower 12 bits, the CPU treats them as potentially aliasing until full address resolution. This can trigger a stall or a replay, which appears as a latency spike. By scanning a range of offsets between a store and a load, you can observe that the latency is low for most offsets and spikes at 4K boundaries. This is a strong signal of the disambiguation policy.

Store-to-load forwarding (STLF) is the case where a load reads a value from a previous store that has not yet reached the cache. The CPU forwards the store data directly from the store buffer to the load. This is a fast path, but it is only possible when addresses match exactly and data size/alignment conditions are satisfied. Partial overlaps or misaligned stores can prevent forwarding and cause stalls. In your benchmark, you can create a store followed by a load at various offsets and sizes to see when STLF occurs and when it fails.

The memory subsystem maintains structures like the store buffer, load buffer, and memory order buffer. The store buffer holds pending stores, while the load buffer tracks in-flight loads. The disambiguation logic checks these buffers to ensure ordering. When it cannot resolve dependencies early, it may serialize the load, which reduces out-of-order benefits. Your benchmark reveals these costs by measuring cycles per iteration for different address offsets.

A key experimental challenge is avoiding confounding effects from caches and prefetchers. You want to measure aliasing, not cache misses. Therefore, keep data within L1 and use repeated accesses to warm caches. Use randomization to avoid prefetcher patterns. The observed latency spikes should align with 4K offsets if disambiguation is the cause. If you see broader or irregular spikes, you may be hitting other effects such as TLB misses or page conflicts.

Additional deep dive considerations for Memory Disambiguation and 4K Aliasing: In real designs, Memory Disambiguation and 4K Aliasing is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Memory Disambiguation and 4K Aliasing via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use this concept to design the offset scanning harness in §3.2 and to interpret the stall map in §3.7.

Definitions & key terms

disambiguation -> deciding whether a load depends on an earlier store
4K aliasing -> false dependency due to matching lower 12 address bits
store buffer -> structure holding pending stores
load buffer -> structure tracking in-flight loads

Mental model diagram (ASCII)

Store addr: [page | offset]
Load addr:  [page | offset]
Same offset -> possible alias -> stall/replay

How it works (step-by-step, with invariants and failure modes)

Issue a store, then a load with varying offset.
CPU compares partial address bits.
If alias suspected, stall or replay load.
Measure cycles and plot offset vs latency.

Invariants:

Correctness requires loads not bypass true dependencies.
Speculation must be corrected if wrong.

Failure modes:

Cache misses mask aliasing effects.
Prefetching reduces measured stalls.

Minimal concrete example

*store_ptr = value;
uint64_t x = *load_ptr; // offset varies

Common misconceptions

“If addresses are different, no stall” -> partial aliasing can still stall.
“STLF always works” -> misalignment or size mismatch can break it.

Check-your-understanding questions

Why does the CPU use only partial address bits early?
What is 4K aliasing and why 4K specifically?
When does store-to-load forwarding fail?

Check-your-understanding answers

Full address computation may not be complete yet, but speculation is needed.
4K is page size, so lower 12 bits are the page offset.
If addresses do not fully match or alignment is incompatible.

Real-world applications

Memory-intensive code performance tuning
Understanding stalls in database or HPC workloads

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 2.
Also used in: P07-the-reorder-buffer-rob-boundary-finder.md.

References

“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson
Intel Optimization Manual, memory ordering section

Key insights

Partial address comparisons create false dependencies that show up as periodic stalls.

Summary

Memory disambiguation balances correctness and speed. 4K aliasing is the signature of this compromise.

Homework/Exercises to practice the concept

Predict where stalls occur if the offset increments by 64 bytes.
Explain how STLF can hide true dependencies.

Solutions to the homework/exercises

Stalls should appear when offset mod 4096 is constant.
STLF forwards data and avoids waiting for cache writeback.

2.2 Store-to-Load Forwarding and Load/Store Buffers

Fundamentals

Store-to-load forwarding allows a load to read a value from a previous store that has not yet committed to the cache. This prevents unnecessary stalls and is essential for performance. The load and store buffers track in-flight memory operations and make forwarding possible. However, forwarding has strict requirements: addresses must match, and size/alignment must be compatible. If these conditions fail, the CPU may stall the load until the store completes.

Additional fundamentals for Store-to-Load Forwarding and Load/Store Buffers: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

The store buffer is a queue of pending stores that have not yet been committed to the cache hierarchy. The load buffer tracks in-flight loads. When a load is issued, the CPU checks the store buffer to see if any older stores might target the same address. If it finds a matching store, it can forward the data directly, bypassing cache. This is faster than waiting for the store to complete and avoids pipeline stalls.

Forwarding is not always possible. The CPU may only know part of the store address at the time of the load. It may detect a partial match (same 4K offset) and stall until full address resolution. Even when addresses match, alignment and size matter. For example, a 4-byte load from an address that overlaps a previous 8-byte store can be forwarded, but if the load is unaligned or spans multiple store entries, forwarding may fail or be split. Some CPUs handle partial forwarding; others force a stall and replay.

Load and store buffers are limited in size. If too many memory operations are in flight, the buffers can fill, causing stalls. This is another reason memory-heavy loops can bottleneck even if caches are fast. In this project, your microbenchmark is small, so buffer capacity is not the primary limit, but the same mechanisms are at play.

To observe forwarding, you can set up a loop that stores to an address and then loads from the same address with different sizes and alignments. The cases where forwarding succeeds will show low latency; cases where it fails will show spikes. By varying the offset and alignment, you can map the forwarding rules of your CPU. This is especially relevant for systems programming and compiler optimization, where alignment decisions can have large performance impacts.

Additional deep dive considerations for Store-to-Load Forwarding and Load/Store Buffers: In real designs, Store-to-Load Forwarding and Load/Store Buffers is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Store-to-Load Forwarding and Load/Store Buffers via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

Supplemental note for Store-to-Load Forwarding and Load/Store Buffers: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to Store-to-Load Forwarding and Load/Store Buffers and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like Store-to-Load Forwarding and Load/Store Buffers often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.

How this fits on projects

You will use this concept to interpret your offset-vs-latency plot in §3.7 and to design targeted tests in §6.2.

Definitions & key terms

store buffer -> queue holding pending stores
load buffer -> queue tracking in-flight loads
forwarding -> direct data bypass from store buffer to load
replay -> re-issuing a load after a dependency was mispredicted

Mental model diagram (ASCII)

Store -> Store Buffer -> (forward) -> Load
                  \-> Cache (later)

How it works (step-by-step, with invariants and failure modes)

Issue store; it enters store buffer.
Issue load; check store buffer for match.
If match and aligned, forward data to load.
If uncertain, stall or replay.

Invariants:

Loads must see the most recent older store.
Forwarding must preserve correctness.

Failure modes:

Partial overlap causes forward failure.
Buffer overflow stalls loads.

Minimal concrete example

*(uint32_t*)p = 0xdeadbeef;
uint32_t x = *(uint32_t*)p; // should forward

Common misconceptions

“Forwarding is always free” -> forwarding logic can add latency.
“Alignment doesn’t matter” -> alignment affects forward eligibility.

Check-your-understanding questions

Why does a load need to check the store buffer?
What can cause a load to replay?
Why might a misaligned load fail to forward?

Check-your-understanding answers

To ensure it does not bypass a prior store.
A late-discovered dependency or aliasing.
The CPU cannot assemble the correct bytes efficiently.

Real-world applications

Compiler alignment optimizations
Debugging memory stalls in systems code

Where you’ll apply it

In this project: see §3.7 Real World Outcome and §7.1 Frequent Mistakes.
Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md.

References

Intel Optimization Manual, memory ordering
AMD Software Optimization Guide

Key insights

Forwarding is fast but fragile; alignment and size determine success.

Summary

Store-to-load forwarding is a critical optimization. Your benchmark makes its success and failure visible as latency spikes.

Homework/Exercises to practice the concept

Test 4-byte vs 8-byte loads after a store and note latency differences.
Align and misalign the store address and compare results.

Solutions to the homework/exercises

Matching sizes usually forward; mismatched sizes may stall.
Misalignment often breaks forwarding and increases latency.

2.3 Store Buffers, Store-to-Load Forwarding, and Alias Prediction

Fundamentals

Modern CPUs let loads execute before older stores when they predict the addresses do not alias. This speculation is handled by the store buffer and load queue. If the prediction is correct, the load can complete early, improving performance. If incorrect, the CPU must squash the load and re-execute it, costing cycles. Store-to-load forwarding is the fast path: if a load reads from an address recently written by a store in the buffer, the CPU can forward the value without waiting for the store to reach cache. Your memory disambiguation probe is essentially a measurement of how well these mechanisms work and where they fail.

Deep Dive into the concept

The store buffer holds pending stores that have not yet been committed to the cache or memory. This allows the CPU to retire stores quickly while deferring the actual write. Loads that are younger than those stores face a question: do they depend on any of the pending stores? If the CPU waited for every store to drain, it would lose much of its out-of-order advantage. Instead, it predicts. The first phase is an address-availability check: if the store address is not yet known (because address generation is still in flight), the CPU may conservatively stall the load, or it may use a partial-address predictor that guesses whether aliasing is likely.

Many CPUs implement a partial-address check using the low bits of the address (for example, 12 bits for the page offset). If a load and a store share the same low bits, they might alias, so the CPU may delay the load or mark it as speculative. This is why aliasing at page offsets is a key stress case: you can create two arrays separated by 4 KB or 64 KB so they share low bits but not the full address. When the predictor sees matching low bits, it may falsely assume a dependency and delay the load, causing a performance cliff even though the addresses differ. Your probe should intentionally sweep address offsets to find these cliffs.

Store-to-load forwarding is the fast path for true dependencies. If a load reads the same address as a store that is still in the buffer, the CPU can forward the store data directly to the load, bypassing the cache. But forwarding has constraints: size must match, alignment must be compatible, and the store must be the most recent writer. Partial overlap can prevent forwarding and force the load to wait until the store drains. This is why mismatched sizes (like a 4-byte store followed by an 8-byte load) are valuable test cases. A good probe includes multiple access widths and alignment offsets to reveal forwarding rules.

When prediction is wrong, the CPU has to “replay” the load. The load executes speculatively, reads stale data, and then later the store address resolves and the CPU detects the alias. It squashes the load, reissues it, and any dependent operations are re-executed. This can create large latency spikes. Some CPUs expose counters for load replays or memory-ordering violations; if available, they are excellent validation tools. If not, you can infer replays from sudden jumps in cycle counts when you introduce aliasing patterns.

A practical experiment uses two pointer streams: one that produces stores and another that produces loads. By varying the distance between them and the alignment of their addresses, you can observe three regimes: no aliasing and high throughput; false aliasing where throughput drops due to conservative prediction; and true aliasing where throughput remains stable because forwarding kicks in. The interesting region is false aliasing: it reveals the predictor’s limitations and can explain real-world performance cliffs in code that uses multiple arrays with power-of-two strides.

How this fits on projects

You will use this in §3.6 Edge Cases to define aliasing stress cases, in §5.10 Phase 2 to build the access patterns, and in §7.3 to debug unexpected stalls.

Definitions & key terms

store buffer -> queue of pending stores waiting to commit to cache
store-to-load forwarding -> delivering store data directly to a dependent load
alias prediction -> heuristic that guesses if a load depends on an older store
replay -> re-execution of a load after detecting mis-speculation
partial-address check -> comparing low address bits to detect potential aliasing

Mental model diagram (ASCII)

Store Buffer: [S1 addr=?, data] [S2 addr=0x1000]
Load Q:       [L1 addr=0x9000] -> may alias if low bits match

How it works (step-by-step, with invariants and failure modes)

Issue stores into the store buffer and loads into the load queue.
If a load address is known, compare with pending stores.
If match, forward data; if uncertain, speculate.
If later a store resolves and aliases, replay the load.
Measure cycles and identify aliasing cliffs.

Invariants:

A load must see the most recent older store to the same address.
Replay occurs if a speculative load read was incorrect.

Failure modes:

False aliasing causes unnecessary stalls.
Misaligned accesses disable forwarding and increase latency.

Minimal concrete example

// Potential false alias: arrays 4 KB apart share low bits
int *a = base;         // aligned
int *b = base + 1024;  // 4 KB offset
store(a[i]);
load(b[i]);

Common misconceptions

“Alias prediction only matters for unsafe code” -> it affects all loads/stores.
“Forwarding always works” -> size and alignment can block forwarding.
“Replays are rare” -> they can be common in pointer-heavy code.

Check-your-understanding questions

Why do page-offset collisions cause false aliasing?
When does store-to-load forwarding fail even for the same address?
What evidence would suggest load replays are occurring?

Check-your-understanding answers

The predictor often compares only low address bits, which match across pages.
When access sizes or alignments differ, forwarding cannot supply correct data.
Sudden spikes in cycles and increased replay-related counters.

Real-world applications

Optimizing array layouts in high-performance computing
Understanding performance cliffs in database or analytics kernels
Designing microbenchmarks for memory-ordering behavior

Where you’ll apply it

In this project: see §3.6 Edge Cases and §5.10 Phase 2.
Also used in: P07-the-reorder-buffer-rob-boundary-finder.md, P09-l1-bandwidth-stressor-zen-5-focus.md.

References

“Memory Systems: Cache, DRAM, Disk” by Jacob, Ng, and Wang, Ch. 8
“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3

Key insights

Disambiguation is prediction; your probe reveals where that prediction fails.

Summary

Store buffers and disambiguation predictors let loads run ahead, but they can also create stalls and replays when aliasing is mispredicted. Measuring those boundaries explains many real-world slowdowns.

Homework/Exercises to practice the concept

Create two arrays offset by 4 KB and measure throughput vs 2 KB offset.
Test store->load with mismatched sizes and observe the forwarding penalty.

Solutions to the homework/exercises

The 4 KB offset often shows a throughput drop due to false aliasing.
Mismatched sizes usually disable forwarding and increase latency.

3. Project Specification

3.1 What You Will Build

A microbenchmark that scans address offsets between a store and a subsequent load, measuring cycles for each offset. The output is a stall map showing where disambiguation and forwarding fail, including 4K aliasing spikes.

3.2 Functional Requirements

Offset Scanner: Iterate over offsets from 0 to at least 8192 bytes.
Timing Harness: Measure cycles per store-load pair.
Warmup and Pinning: Stabilize caches and run on a single core.
Report Generator: Output offset vs latency.

3.3 Non-Functional Requirements

Performance: Complete a full scan in under 2 seconds.
Reliability: Repeatable spike patterns across trials.
Usability: CLI flags for offset range and stride.

3.4 Example Usage / Output

$ ./alias_probe --max-offset 8192 --stride 64
offset,cycles
0,4
64,4
4096,18  <-- aliasing spike

3.5 Data Formats / Schemas / Protocols

CSV output:

offset,cycles
0,4
64,4
4096,18

3.6 Edge Cases

Stride larger than offset range
Cache miss interference from large buffers
Prefetcher hiding stalls

3.7 Real World Outcome

You will produce a graph showing periodic latency spikes at 4K boundaries and note how alignment affects them.

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -o alias_probe src/alias_probe.c
sudo taskset -c 2 ./alias_probe --max-offset 8192 --stride 64 --trials 5

3.7.2 Golden Path Demo (Deterministic)

Use fixed stride 64 and 5 trials.
Expect spikes at offsets 4096 and 8192.

3.7.3 If CLI: Exact Terminal Transcript

$ taskset -c 2 ./alias_probe --max-offset 8192 --stride 64
offset,cycles
0,4
64,4
4096,18
8192,17

$ echo $?
0

Failure demo (bad stride):

$ ./alias_probe --stride 0
Error: stride must be >= 1

$ echo $?
3

Exit codes:

0 success
2 timing init error
3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+---------------+   +------------------+   +-----------------+
| Offset Scanner|-> | Timing Harness   |-> | Report Builder  |
+---------------+   +------------------+   +-----------------+

4.2 Key Components

Component	Responsibility	Key Decisions
Scanner	Generate offsets	Use 64-byte stride
Harness	Measure latency	RDTSCP + fences
Reporter	CSV output	easy plotting

4.3 Data Structures (No Full Code)

struct Sample { int offset; double cycles; };

4.4 Algorithm Overview

Key Algorithm: Alias Scan

Warm up memory.
For each offset, perform store + load N times.
Record median cycles.

Complexity Analysis:

Time: O(O * N) where O is number of offsets
Space: O(O)

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

alias-probe/
├── src/
│   ├── alias_probe.c
│   └── timing.c
└── README.md

5.3 The Core Question You’re Answering

“When does my CPU stall because it cannot disambiguate memory?”

The answer is the aliasing spike map.

5.4 Concepts You Must Understand First

4K aliasing and disambiguation
Store-to-load forwarding rules
Cache warmup and timing

5.5 Questions to Guide Your Design

How will you keep data in L1 to avoid cache misses?
How will you choose a stride that reveals 4K aliasing?
How will you detect forwarding success vs failure?

5.6 Thinking Exercise

If you shift the base address by 32 bytes, how will the aliasing spikes move? Explain why.

5.7 The Interview Questions They’ll Ask

What is 4K aliasing and how does it affect performance?
What is store-to-load forwarding?
Why do CPUs speculate on memory ordering?

5.8 Hints in Layers

Hint 1: Use a small buffer to keep everything in L1.

Hint 2: Measure median of multiple trials per offset.

Hint 3: Plot the results to see spikes clearly.

5.9 Books That Will Help

Topic	Book	Chapter
Memory ordering	“Computer Architecture”	Ch. 3
Cache effects	“Computer Systems”	Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Implement timing harness and store/load loop.
Checkpoint: stable cycles for offset 0.

Phase 2: Core Functionality (4-6 days)

Add offset scanning and CSV output.
Checkpoint: spike at 4096 observed.

Phase 3: Analysis (2-3 days)

Analyze STLF success and failure cases.
Checkpoint: report includes aliasing explanation.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Stride	64 vs 128	64	matches cache line
Statistic	mean vs median	median	robust to noise

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Timing correctness	cache hit vs miss
Integration Tests	Full scan	offsets 0-8192
Edge Tests	Invalid stride	stride 0

6.2 Critical Test Cases

Offset 0 should show low latency (forwarding success).
Offset 4096 should show spike (alias).
Misaligned store should show higher latency.

6.3 Test Data

offsets: 0, 64, 4096

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Prefetcher interference	no spikes	randomize access
Cache misses	high baseline	reduce buffer size
Not pinning core	noisy results	use taskset

7.2 Debugging Strategies

Print timing histogram for a single offset.
Use perf to verify load/store counts.

7.3 Performance Traps

Large buffers spill into L2/L3 and hide aliasing behavior.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a visualization script.

8.2 Intermediate Extensions

Compare different load sizes (1, 4, 8 bytes).

8.3 Advanced Extensions

Measure aliasing differences across cores or CPUs.

9. Real-World Connections

9.1 Industry Applications

Debugging memory stalls in databases and runtimes
Compiler alignment and alias analysis

uops.info: microbenchmarks
lmbench: memory latency tools

9.3 Interview Relevance

Memory disambiguation is a deep performance topic.

10. Resources

10.1 Essential Reading

“Computer Architecture: A Quantitative Approach”
Intel Optimization Manual

10.2 Video Resources

“Memory Ordering in CPUs” lecture

10.3 Tools & Documentation

perf: event counters
rdtsc: timing

Next: P07-the-reorder-buffer-rob-boundary-finder.md
Also: P03-speculative-side-channel-explorer-spectre-lite.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain 4K aliasing.
I can describe STLF success conditions.
I can interpret the stall map.

11.2 Implementation

The scan completes with stable spikes.
Output is deterministic and plotted.
Results are documented with CPU model.

11.3 Growth

I can explain memory disambiguation in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Produce offset vs latency CSV with visible spikes.
Explain 4K aliasing in the report.

Full Completion:

Include STLF success/failure cases.

Excellence (Going Above & Beyond):

Compare results across at least two CPU models.