Project 7: The Reorder Buffer (ROB) Boundary Finder

Build a benchmark that discovers how many uOps your CPU can keep in flight before stalling.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 1-2 weeks
Main Programming Language C (with inline assembly) (Alternatives: Assembly)
Alternative Programming Languages Assembly
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold”
Prerequisites C, assembly basics, cache latency ideas, RDTSC timing
Key Topics ROB capacity, instruction window, register renaming, latency hiding

1. Learning Objectives

By completing this project, you will:

  1. Explain the role of the ROB in out-of-order execution.
  2. Measure the effective instruction window size of your CPU.
  3. Build a benchmark that isolates long-latency loads.
  4. Identify the stall cliff where the ROB fills.
  5. Translate results into guidance for loop unrolling.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Reorder Buffer and In-Order Retirement

Fundamentals

The reorder buffer (ROB) is the structure that allows a CPU to execute instructions out of order while retiring them in order. Each in-flight instruction reserves an entry in the ROB until it finishes and is safe to commit architecturally. The size of the ROB limits how many instructions can be in flight, which in turn limits how much latency can be hidden. If the ROB fills, the frontend must stall, even if execution units are idle. The ROB is therefore a key throughput limiter in memory-latency-bound code.

Additional fundamentals for Reorder Buffer and In-Order Retirement: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

Out-of-order execution separates the concepts of “execute” and “retire.” Instructions can execute as soon as operands are ready, but they must retire in program order to preserve correctness and handle exceptions. The ROB tracks each instruction’s completion status and holds results until it is safe to commit. When the head of the ROB is complete, it retires; if it is not complete, retirement stalls, and the ROB fills.

The ROB size is measured in uOps (or micro-ops), not instructions. A single instruction can produce multiple uOps, so the effective window size depends on instruction mix. When a long-latency load occurs, later independent instructions can execute while the load is outstanding, but only until the ROB fills with in-flight uOps. After that, the pipeline stalls, and the CPU cannot issue further instructions. This is the “stall cliff” your benchmark detects.

Your benchmark creates a long-latency load (e.g., an L3 or memory miss) and then issues a sequence of independent instructions. As you increase the number of independent uOps between the load and a dependent use, you are effectively increasing the number of uOps that can be in flight while waiting for the load. If the ROB is large enough, the CPU can keep executing the independent uOps and hide the load latency. If not, it stalls when the ROB fills, and your measured cycles will jump. The point where this jump occurs approximates the ROB capacity.

The ROB also interacts with other structures like the reservation stations and issue queues. Sometimes the issue queue or scheduler fills before the ROB, so the observed cliff might reflect those limits instead. To interpret results correctly, you should keep uOps simple and avoid heavy use of ports that could saturate the scheduler. Using simple integer instructions and avoiding dependencies helps isolate the ROB.

The ROB size is a key parameter for compiler optimizations and loop unrolling. If you unroll too far and create too many in-flight uOps without progress, the ROB fills and performance plateaus. Your benchmark gives a concrete number you can use as a heuristic for unroll factors in latency-bound loops.

Additional deep dive considerations for Reorder Buffer and In-Order Retirement: In real designs, Reorder Buffer and In-Order Retirement is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Reorder Buffer and In-Order Retirement via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will apply this concept to design the filler uOps sequence in §3.2 and to interpret the stall cliff in §3.7.

Definitions & key terms

  • ROB -> reorder buffer that tracks in-flight instructions
  • retire -> commit instruction results in program order
  • instruction window -> number of in-flight uOps the CPU can hold
  • stall cliff -> sudden latency increase when ROB fills

Mental model diagram (ASCII)

[Fetch] -> [Issue] -> [Execute] -> [ROB] -> [Retire]
                     ^ in-flight window limited by ROB size

How it works (step-by-step, with invariants and failure modes)

  1. Issue a long-latency load that will miss cache.
  2. Issue independent filler uOps.
  3. When ROB fills, the frontend stalls.
  4. When the load returns, retirement resumes.

Invariants:

  • Retirement order must match program order.
  • ROB entries are allocated per uOp.

Failure modes:

  • Filler uOps are not independent and create dependencies.
  • Scheduler fills before ROB, masking the true limit.

Minimal concrete example

x = array[random_index]; // long-latency load
// many independent adds here
sum += x; // dependent use

Common misconceptions

  • “ROB size equals instruction window for all code” -> uOp expansion changes it.
  • “Only memory misses fill ROB” -> any long-latency op can.

Check-your-understanding questions

  1. Why does the ROB need to retire in order?
  2. What causes the stall cliff in your benchmark?
  3. Why might your measured window be smaller than the advertised ROB size?

Check-your-understanding answers

  1. In-order retirement preserves program semantics and exception handling.
  2. The ROB fills with in-flight uOps while waiting for the load to complete.
  3. Other structures (issue queues) or uOp expansion can reduce effective size.

Real-world applications

  • Compiler unroll factor tuning
  • Latency hiding in memory-bound kernels

Where you’ll apply it

References

  • “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson
  • “Inside the Machine” by Jon Stokes

Key insights

  • The ROB is the capacity that limits how much latency can be hidden by independent work.

Summary

The ROB enables out-of-order execution but imposes a hard limit on in-flight uOps. Your benchmark finds this limit empirically.

Homework/Exercises to practice the concept

  1. Predict how the cliff shifts if your filler loop uses more complex instructions.
  2. Explain why a dependent chain hides ROB effects.

Solutions to the homework/exercises

  1. Complex instructions expand into more uOps and fill the ROB sooner.
  2. Dependencies prevent issuing independent uOps, so the ROB never fills.

2.2 Register Renaming and Latency Hiding

Fundamentals

Register renaming removes false dependencies by giving each write a new physical register. This allows independent instructions to execute out of order even if they reuse architectural register names. Renaming is essential for maintaining a large instruction window; without it, the ROB would quickly stall due to WAW or WAR hazards. Latency hiding relies on renaming because it allows the CPU to keep issuing independent instructions while waiting on slow operations.

Additional fundamentals for Register Renaming and Latency Hiding: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

Architectural registers are limited in number, but the CPU internally has many more physical registers. When an instruction writes to a register, the rename logic assigns a new physical register for that write and updates a mapping table. Subsequent instructions read from the latest mapping. This breaks false dependencies (write-after-write and write-after-read), enabling out-of-order scheduling of independent instructions. The ROB keeps track of these mappings so that when an instruction retires, the architectural state is updated and old physical registers can be freed.

Latency hiding depends on the ability to issue independent uOps while a long-latency load is outstanding. If your filler instructions reuse the same architectural registers, you may inadvertently create dependencies or run out of rename resources. For this reason, your filler loop should use a large set of registers and avoid chains that feed into each other. If you run out of physical registers, the rename stage will stall even if the ROB has space. This is another source of a cliff in your measurement. Therefore, register choice and unroll factor matter.

The rename stage also interacts with the scheduler and ROB. Each uOp consumes a rename entry and a ROB entry. If either resource runs out, the pipeline stalls. The measured instruction window is thus bounded by the minimum of these resources. Many CPUs report ROB size but do not publish rename register counts, which can make your measurement smaller than expected. Observing this in your benchmark is an opportunity to learn how microarchitectural limits interact.

You can validate renaming effects by comparing a filler loop that reuses a small set of registers with one that uses many registers. The loop with more registers should allow more in-flight uOps before stalling, producing a larger apparent window. This also teaches you a practical optimization: avoid artificial dependencies if you want to hide latency.

Additional deep dive considerations for Register Renaming and Latency Hiding: In real designs, Register Renaming and Latency Hiding is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Register Renaming and Latency Hiding via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

Supplemental note for Register Renaming and Latency Hiding: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to Register Renaming and Latency Hiding and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like Register Renaming and Latency Hiding often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.

How this fits on projects

You will use this concept to design filler loops in §5.10 Phase 1 and to interpret surprising cliffs in §7.1.

Definitions & key terms

  • register renaming -> mapping architectural registers to physical registers
  • physical register -> internal register used by the CPU
  • false dependency -> WAW or WAR hazard that can be removed by renaming
  • rename stall -> pipeline stall due to lack of free physical registers

Mental model diagram (ASCII)

Arch R1 -> Phys P5
Arch R1 -> Phys P9 (new write)
Rename map updates each write

How it works (step-by-step, with invariants and failure modes)

  1. Decode instruction and identify src/dst registers.
  2. Allocate a new physical register for the destination.
  3. Update rename map; sources read from previous mappings.
  4. Free old physical registers on retirement.

Invariants:

  • Each architectural register maps to exactly one physical register at a time.
  • Physical registers are freed only after retirement.

Failure modes:

  • Too few physical registers leads to rename stalls.
  • Reusing few registers reduces independent work.

Minimal concrete example

; Independent adds using different registers
add r8, r9
add r10, r11
add r12, r13

Common misconceptions

  • “Renaming removes all dependencies” -> true dependencies still matter.
  • “More registers always help” -> only if instructions are independent.

Check-your-understanding questions

  1. Why do false dependencies limit out-of-order execution?
  2. How does register renaming enable latency hiding?
  3. What happens when physical registers run out?

Check-your-understanding answers

  1. They force ordering even when data is independent.
  2. It allows independent instructions to execute without waiting for name reuse.
  3. The rename stage stalls and no new uOps are issued.

Real-world applications

  • Compiler register allocation and instruction scheduling
  • Performance tuning of tight loops

Where you’ll apply it

References

  • “Computer Organization and Design” by Patterson and Hennessy
  • “Computer Architecture” by Hennessy and Patterson

Key insights

  • Register renaming is the enabler that turns a deep ROB into real latency hiding.

Summary

Renaming eliminates false dependencies and keeps the instruction window full. Without it, your benchmark would stall early.

Homework/Exercises to practice the concept

  1. Write two versions of a filler loop: one reusing two registers, one using eight.
  2. Predict which will have a larger effective window and why.

Solutions to the homework/exercises

  1. The loop with eight registers will allow more independent uOps.
  2. It avoids false dependencies and rename stalls, expanding the window.

2.3 Instruction Window, MLP, and MSHR Interaction

Fundamentals

The reorder buffer (ROB) defines how many uOps can be in flight, but memory-level parallelism (MLP) depends on more than ROB size. Loads also require entries in the load queue and MSHRs (miss status holding registers) in the cache hierarchy. If you create many independent cache misses, you can fill the ROB but still be limited by MSHRs or load queue size. A boundary finder must therefore distinguish between “ROB full” and “memory subsystem full.” The key is to design experiments that either avoid cache misses or generate controlled independent misses, then observe where throughput stops scaling.

Deep Dive into the concept

The instruction window is the set of in-flight uOps tracked by the ROB, scheduler, and load/store queues. Its size determines how far ahead the CPU can look to find independent work. But when that work is memory loads that miss in cache, progress depends on the memory system’s ability to track those misses. The L1 and L2 caches have a limited number of MSHRs, each representing an outstanding miss. When MSHRs are full, new misses must wait, and the pipeline can stall even if the ROB has space. This creates a second boundary that can look like a ROB limit if you are not careful.

To separate these effects, you need two classes of microbenchmarks. First, a pure compute chain with no memory misses. This isolates the ROB and scheduler because execution is bound by dependency depth and window size. You can create this by generating long independent ALU operations and then introducing dependency chains at controlled intervals. The throughput should scale with window size until you hit the ROB limit. Second, a memory-heavy benchmark that creates many independent cache misses (e.g., by using multiple streams that each miss the LLC). Here the throughput will scale until you hit MSHR or load queue limits. The plateau in this regime tells you more about the memory subsystem than the ROB.

Another important concept is effective window size. Even if the ROB can hold 300 uOps, the effective window for a particular instruction mix might be smaller because of other constraints: issue queue entries, physical registers, or load queue capacity. For example, a loop that uses many registers may run out of physical registers before filling the ROB, causing stalls. Similarly, a loop with heavy loads may fill the load queue even when the ROB still has free slots. A good boundary finder will therefore include counters or diagnostic signals (if available) such as “rob_full” or “ldq_full” to attribute the stall source.

This is where statistical inference helps. If you vary only the number of independent misses (by changing the number of streams) while keeping compute constant, you can plot cycles per iteration versus stream count. The first knee in the curve is often the MSHR limit; a later knee may indicate the ROB limit. Likewise, if you increase the number of independent ALU ops between misses, you can see how the ROB window allows you to hide latency. The point at which adding more ALU ops stops improving throughput is another clue about window size.

Finally, remember that modern CPUs do speculative execution. The CPU may issue loads before the addresses are fully resolved or before older stores. This can inflate the apparent window because speculative loads count as in-flight. But if disambiguation predicts wrong, replays can reduce effective throughput. So your boundary finder should also log replays or use patterns that minimize aliasing. The goal is not a single number but a profile: ROB size, load queue size, MSHR limit, and how they interact.

How this fits on projects

You will use this in §3.6 Edge Cases to include both compute-only and memory-heavy regimes, and in §5.10 Phase 2 to interpret the scaling curves.

Definitions & key terms

  • instruction window -> the set of in-flight uOps visible to the scheduler
  • MSHR -> hardware entry tracking an outstanding cache miss
  • MLP -> memory-level parallelism (independent misses in flight)
  • load queue -> structure holding in-flight loads
  • knee point -> point where scaling stops and a new bottleneck appears

Mental model diagram (ASCII)

ROB window -> Issue -> Loads -> L1/L2 MSHRs -> DRAM
           \-> ALU ops (no MSHR usage)

How it works (step-by-step, with invariants and failure modes)

  1. Build a compute-only loop to probe ROB/scheduler limits.
  2. Build a multi-stream miss loop to probe MSHR limits.
  3. Sweep independent stream count and plot cycles per iter.
  4. Identify knee points and attribute to ROB vs MSHR.
  5. Validate with counters if available.

Invariants:

  • Compute-only test must avoid cache misses.
  • Memory test must ensure independent misses (no pointer chasing).

Failure modes:

  • Prefetchers reduce miss rates and hide MSHR limits.
  • Alias replays distort the scaling curve.

Minimal concrete example

// Independent miss streams
for (i = 0; i < N; i++) {
  sum += a[i*stride];
  sum += b[i*stride];
  sum += c[i*stride];
}

Common misconceptions

  • “ROB size equals maximum misses” -> MSHR count often limits misses first.
  • “Pointer chasing reveals ROB” -> pointer chasing serializes and hides window size.
  • “One curve proves the limit” -> you need multiple regimes to disambiguate.

Check-your-understanding questions

  1. Why does pointer chasing hide MLP even with a large ROB?
  2. What does a knee in the cycles-per-iter curve usually indicate?
  3. How can prefetchers distort a ROB boundary experiment?

Check-your-understanding answers

  1. Each miss depends on the previous, so only one is in flight.
  2. A new bottleneck like MSHR or ROB saturation.
  3. They reduce misses, delaying or removing the MSHR knee.

Real-world applications

  • Understanding why memory-bound code does not scale with unrolling
  • Tuning database or analytics kernels for better MLP
  • Designing CPU microbenchmarks for architectural characterization

Where you’ll apply it

References

  • “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3
  • “What Every Programmer Should Know About Memory” by Ulrich Drepper

Key insights

  • The “window” is a system of queues, not just the ROB.

Summary

ROB size is only one limit on out-of-order execution. MSHRs and load queues create their own ceilings, and a proper boundary finder separates these effects with targeted microbenchmarks.

Homework/Exercises to practice the concept

  1. Build two benchmarks: one compute-only and one with 4 independent miss streams. Compare scaling.
  2. Disable prefetching (if possible) and observe how the knee points shift.

Solutions to the homework/exercises

  1. The compute-only test scales to ROB size; the miss test plateaus earlier due to MSHRs.
  2. Disabling prefetching typically makes the MSHR knee more visible.

3. Project Specification

3.1 What You Will Build

A benchmark that issues a long-latency load followed by a configurable number of independent filler uOps. By sweeping the filler length, the tool finds the point where latency spikes, indicating the ROB capacity. The output includes an estimated ROB size and recommended unroll factors.

3.2 Functional Requirements

  1. Long-Latency Load: Force a cache miss using a large, randomized array.
  2. Filler Generator: Emit independent uOps using many registers.
  3. Timing Harness: Measure cycles per iteration for each filler length.
  4. Boundary Detector: Identify the cliff where latency jumps.

3.3 Non-Functional Requirements

  • Performance: Sweep up to 1024 uOps in under 2 seconds.
  • Reliability: Repeatable cliff across trials.
  • Usability: CLI flags for filler size and stride.

3.4 Example Usage / Output

$ ./rob_finder --max-uops 1024
uops,cycles
128,300
256,305
512,315
640,620  <-- cliff

3.5 Data Formats / Schemas / Protocols

CSV output:

uops,cycles
256,305
512,315
640,620

3.6 Edge Cases

  • Filler uOps not independent
  • Cache misses too large causing noise
  • Prefetchers hiding latency

3.7 Real World Outcome

You will estimate your CPU’s ROB capacity in uOps and record the latency cliff location.

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -o rob_finder src/rob_finder.c
sudo taskset -c 2 ./rob_finder --max-uops 1024 --trials 5

3.7.2 Golden Path Demo (Deterministic)

  • Use --max-uops 1024 and a fixed random seed.
  • Expect a clear jump in cycles near the ROB capacity.

3.7.3 If CLI: Exact Terminal Transcript

$ taskset -c 2 ./rob_finder --max-uops 1024
uops,cycles
128,300
256,305
512,315
640,620

$ echo $?
0

Failure demo (bad args):

$ ./rob_finder --max-uops 0
Error: max-uops must be >= 64

$ echo $?
3

Exit codes:

  • 0 success
  • 2 timing init error
  • 3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+---------------+   +------------------+   +-----------------+
| Load Generator|-> | Filler Generator |-> | Timing + Report |
+---------------+   +------------------+   +-----------------+

4.2 Key Components

Component Responsibility Key Decisions
Load Generator Force cache miss random stride
Filler Generator Emit independent ops many registers
Analyzer Detect cliff slope threshold

4.3 Data Structures (No Full Code)

struct Result { int uops; double cycles; };

4.4 Algorithm Overview

Key Algorithm: Cliff Detection

  1. Sweep filler size from small to large.
  2. Measure median cycles per size.
  3. Detect first large slope jump.

Complexity Analysis:

  • Time: O(S * N) where S is size count
  • Space: O(S)

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

rob-finder/
├── src/
│   ├── rob_finder.c
│   ├── timing.c
│   └── filler.S
└── README.md

5.3 The Core Question You’re Answering

“How many uOps can my CPU keep in flight while waiting on memory?”

Your cliff reveals the answer.

5.4 Concepts You Must Understand First

  1. ROB and in-order retirement
  2. Register renaming
  3. Memory latency and cache misses

5.5 Questions to Guide Your Design

  1. How will you force a long-latency load reliably?
  2. How will you ensure the filler is independent?
  3. How will you avoid front-end bottlenecks?

5.6 Thinking Exercise

If you double the uOps in the filler loop, what happens to the cliff position? Explain why.

5.7 The Interview Questions They’ll Ask

  1. Why does a full ROB stall the frontend?
  2. How does register renaming affect the instruction window?
  3. What limits the amount of latency you can hide?

5.8 Hints in Layers

Hint 1: Use a large array with random strides for cache misses.

Hint 2: Unroll and use many registers to keep instructions independent.

Hint 3: Plot cycles vs uOps to see the cliff.

5.9 Books That Will Help

Topic Book Chapter
Out-of-order execution “Computer Architecture” Ch. 3
Cache latency “Computer Systems” Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

  • Implement cache-miss generator and timing harness.
  • Checkpoint: stable high latency for load alone.

Phase 2: Core Functionality (4-6 days)

  • Add filler generator and sweep sizes.
  • Checkpoint: observe slope change.

Phase 3: Analysis (2-3 days)

  • Detect cliff and estimate ROB size.
  • Checkpoint: report includes inferred capacity.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Load pattern sequential vs random random ensures cache miss
Cliff detection manual vs slope slope deterministic

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Filler independence no dependency chain
Integration Tests End-to-end scan uops 128-1024
Edge Tests max-uops too low error path

6.2 Critical Test Cases

  1. Filler with dependencies should reduce window.
  2. Random stride should increase latency.
  3. Cliff detection should be stable across trials.

6.3 Test Data

uops: 128, 256, 512, 640

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Dependencies in filler no cliff use more registers
Prefetcher hits low latency randomize stride
Scheduler limit early cliff reduce port pressure

7.2 Debugging Strategies

  • Inspect assembly to confirm independent ops.
  • Use perf to check cache miss rate.

7.3 Performance Traps

  • Over-unrolling increases code size and may trigger front-end limits.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a visualization of the cliff.

8.2 Intermediate Extensions

  • Compare different cores or CPU generations.

8.3 Advanced Extensions

  • Separate ROB vs scheduler limits using PMU counters.

9. Real-World Connections

9.1 Industry Applications

  • Memory latency hiding in compilers and runtimes
  • CPU microarchitecture characterization
  • lmbench: memory latency benchmarks
  • uops.info: microbenchmark data

9.3 Interview Relevance

  • ROB and OoO behavior are deep performance topics.

10. Resources

10.1 Essential Reading

  • “Computer Architecture: A Quantitative Approach”
  • “Inside the Machine” by Jon Stokes

10.2 Video Resources

  • “Out-of-Order Execution” lecture

10.3 Tools & Documentation

  • perf: cache miss counters
  • rdtsc: timing

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain what the ROB does.
  • I can describe how renaming affects the window.
  • I can identify the cliff in the data.

11.2 Implementation

  • Benchmark produces a stable cliff.
  • Results are reproducible.
  • Report includes CPU model and settings.

11.3 Growth

  • I can explain latency hiding in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Produce a uOps vs cycles curve with a clear cliff.
  • Estimate ROB size from that curve.

Full Completion:

  • Compare results with different unroll factors.

Excellence (Going Above & Beyond):

  • Validate ROB size with PMU counters or vendor data.