Project 2: The Branch Predictor Torture Test

Build a branch microbenchmark suite that reveals predictor history limits and misprediction penalties.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Programming Language C++ (Alternatives: C, Rust)
Alternative Programming Languages C, Rust
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold”
Prerequisites C/C++ basics, control flow, basic pipeline ideas, CLI tooling
Key Topics branch prediction, history length, TAGE intuition, microbenchmarking, RDTSC

1. Learning Objectives

By completing this project, you will:

  1. Explain why branch predictors use history and hysteresis.
  2. Design branch patterns that selectively defeat prediction mechanisms.
  3. Measure misprediction penalties in cycles with tight error bars.
  4. Infer an approximate predictor history depth from timing cliffs.
  5. Produce a repeatable benchmark report for a real CPU.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Branch Prediction: History, Counters, and TAGE Intuition

Fundamentals

A branch predictor guesses whether a branch will be taken or not taken before the CPU knows the real outcome. Its goal is to keep the pipeline full by speculatively fetching the most likely path. The simplest predictors use small saturating counters keyed by the branch address. But modern predictors also consider recent branch outcomes, stored as history. This history allows the predictor to learn patterns like “taken every 3rd time” or “depends on the previous branch.” A predictor is therefore both a table of guesses and a memory of the past. Understanding it means understanding how history and indexing work together.

Deep Dive into the concept

Predictors evolved because pipelines got deeper and penalties for mistakes grew. A static predictor (always taken, always not taken) is cheap but weak. A 2-bit counter improves accuracy by adding hysteresis: it takes two wrong outcomes to flip the prediction, which filters noise. Each branch address indexes a counter, so loops with consistent behavior become highly predictable. But real programs have correlations across branches. Global history captures recent branch outcomes as a bitstring. This history is folded into the index of predictor tables so that the same branch can map to different entries depending on the prior path. This allows the predictor to learn context-dependent behavior.

TAGE (Tagged Geometric History Length) predictors extend this idea by using multiple tables, each indexed by a different history length. Short history tables learn simple patterns quickly; long history tables capture complex correlations. A tag stored with each entry reduces aliasing by checking whether the entry likely matches the current branch and history. The predictor chooses the “best” matching table based on tags and confidence, falling back to shorter tables if the long-history table does not match. The geometric lengths (e.g., 4, 8, 16, 32, 64, 128, … outcomes) let it cover a wide range of patterns without a single massive table.

Your benchmark exploits these mechanisms by constructing synthetic patterns. A periodic pattern of length 1024, for example, exceeds the history length of a small predictor but may still be learned by a large TAGE table. When you see a sudden cliff in cycles per iteration as you increase period length, you are likely exceeding the predictor’s effective history or table capacity. Nested branches and mixed patterns can also expose aliasing: two branches that map to the same table entry will interfere, causing accuracy drops even if each branch is individually predictable. This is why you pin your test to a single branch instruction and control the exact pattern.

Branch predictors include several structures: a Branch Target Buffer (BTB) to predict the target address of taken branches, a pattern history table (PHT) of counters, and a global history register (GHR). When a branch is fetched, the BTB predicts the target if taken; the PHT predicts taken/not-taken; the global history updates after resolution. Misprediction triggers a pipeline flush: speculative instructions are discarded, and the correct path is fetched. The penalty depends on pipeline depth and frontend latency, usually 10-20 cycles on modern cores, sometimes more. Your measurement captures this penalty by comparing predictable patterns (low penalty) to random patterns (high penalty).

A key modeling idea is that the predictor is a cache. It has limited entries, suffers from aliasing, and has a finite training time. A pattern with a huge period might never stabilize because the table cannot store all states. Your torture test is really a cache thrash test for the predictor. Understanding that gives you a way to interpret the results: cliffs indicate capacity or history length limits; noisy regions indicate aliasing and partial training.

How this fits on projects

You will use this concept to design your pattern generator in §3.2 and to interpret measurement cliffs in §3.7. It informs your analysis section and the report in §5.10 Phase 3.

Definitions & key terms

  • branch predictor -> hardware that guesses branch direction before resolution
  • saturating counter -> small state machine that changes prediction with hysteresis
  • global history -> bitstring of recent branch outcomes used for indexing
  • TAGE -> predictor that uses multiple tables with different history lengths and tags
  • misprediction penalty -> cycles lost when speculation is wrong

Mental model diagram (ASCII)

Branch PC ---> [BTB] ---> target
          \-> [PHT + GHR] ---> taken/not-taken

How it works (step-by-step, with invariants and failure modes)

  1. Fetch a branch at PC.
  2. Index predictor tables using PC and history bits.
  3. Predict direction and target; fetch speculatively.
  4. Resolve branch in EX; update predictor state.
  5. If mispredicted, flush and restart at correct target.

Invariants:

  • Predictor state updates only after resolution.
  • Speculative path must be flushed on mispredict.

Failure modes:

  • Aliasing between branches corrupts predictor entries.
  • Insufficient history length breaks long patterns.

Minimal concrete example

// Simple periodic pattern: taken 3 times, not taken once
if ((i % 4) != 3) { sum += i; }

Common misconceptions

  • “Predictors are perfect for loops” -> Long or irregular loops can still break them.
  • “More history always helps” -> Too much history can increase aliasing and noise.

Check-your-understanding questions

  1. Why does a 2-bit counter reduce noise compared to a 1-bit predictor?
  2. What does a TAGE predictor gain by using multiple history lengths?
  3. Why does aliasing cause accuracy loss even for predictable branches?

Check-your-understanding answers

  1. It requires two opposite outcomes to flip prediction, smoothing single anomalies.
  2. It can learn both short and long patterns without a single huge table.
  3. Different branches share the same table entry and overwrite each other’s state.

Real-world applications

  • CPU front-end design and verification
  • Compiler optimizations that reduce unpredictable branches
  • High-frequency trading and low-latency systems where mispredict cost dominates

Where you’ll apply it

References

  • “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3
  • “The TAGE Branch Predictor” by Seznec and Michaud

Key insights

  • A branch predictor behaves like a cache over history, so capacity and aliasing shape accuracy.

Summary

Predictors trade memory for accuracy. History length and table size determine which patterns are learnable, and your benchmark exposes those limits.

Homework/Exercises to practice the concept

  1. Design a pattern of length 32 and predict whether it will be learned quickly.
  2. Explain how two different branches could interfere if they map to the same table.

Solutions to the homework/exercises

  1. A 32-length pattern should be learned by mid-size history tables; it stabilizes quickly.
  2. Shared entries overwrite counters, so one branch’s training corrupts the other.

2.2 Cycle-Accurate Microbenchmarking with RDTSC

Fundamentals

Microbenchmarking measures extremely small performance differences, often a handful of cycles. The RDTSC instruction (or RDTSCP) reads a high-resolution timestamp counter. It is the tool that makes these measurements possible, but only if you control noise. Noise comes from OS scheduling, frequency scaling, interrupts, and out-of-order effects. The core idea is to measure a tight loop many times, isolate the code under test, and take statistics (median, percentile) instead of a single measurement. You are building an experiment, not just timing a function.

Additional fundamentals for Cycle-Accurate Microbenchmarking with RDTSC: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

The timestamp counter increments at a constant rate on most modern x86 systems, but it can still be affected by core migration or power states. RDTSCP is preferred because it is serialized after prior instructions and provides a core ID in many implementations. You can bracket your code with lfence; rdtsc; and rdtscp; lfence; to reduce reordering. In tight loops, you should unroll and perform many iterations so the overhead of reading the counter is amortized. Another key technique is to subtract the loop overhead by running a “baseline” loop without the branch under test. The difference between the baseline and test loops estimates the branch cost.

Branch microbenchmarks also require defeating compiler optimizations. A compiler might remove a branch if it can prove the condition. You must prevent this by using volatile variables, data-dependent conditions, or inline assembly that forces the branch. You also need to ensure the branch actually executes with the desired pattern. For a pattern generator, you can precompute an array of outcomes and index into it so the branch sees the intended sequence. The array should fit in L1 to avoid memory effects, or you should measure and control for memory overhead separately.

Noise control is critical. Pin the process to a single core (taskset or sched_setaffinity) and disable turbo or use a fixed frequency if possible. Warm up the predictor and cache before measurement. Use a large number of iterations (millions) and compute the median cycles per iteration. Report confidence by showing min/median/max or standard deviation. If you see bimodal distributions, you likely have OS noise or power state shifts.

Because branch prediction is path-sensitive, you must carefully separate training and measurement. For example, run a warmup phase to train the predictor, then measure a steady-state loop for a fixed number of iterations. Repeat the entire experiment multiple times and aggregate. The benchmark should also include a “random” pattern (e.g., 50/50) to estimate misprediction penalty, because a random branch is close to worst-case predictability. By comparing the predictable and random cases, you can approximate the cost of mispredicts without needing hardware counters.

A microbenchmark is only as credible as its protocol. Document every parameter: iterations, warmup count, pinning method, compiler flags, and CPU model. This is the difference between a toy and a real performance experiment.

Additional deep dive considerations for Cycle-Accurate Microbenchmarking with RDTSC: In real designs, Cycle-Accurate Microbenchmarking with RDTSC is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Cycle-Accurate Microbenchmarking with RDTSC via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use these techniques to structure your benchmarking harness in §3.7 and to ensure deterministic results in §5.1 and §5.10 Phase 3.

Definitions & key terms

  • RDTSC/RDTSCP -> instructions that read the timestamp counter
  • serialization -> preventing reordering around timing instructions
  • baseline loop -> control loop used to subtract overhead
  • pinning -> forcing execution on a single CPU core

Mental model diagram (ASCII)

[Warmup] -> [Measure Loop] -> [Aggregate Stats]
   |             |                |
  train        time             median

How it works (step-by-step, with invariants and failure modes)

  1. Pin the process to a core and warm up caches and predictor.
  2. Run baseline loop and record cycles.
  3. Run test loop with branch pattern and record cycles.
  4. Subtract baseline and divide by iterations.
  5. Repeat and aggregate statistics.

Invariants:

  • Use the same iteration count for baseline and test.
  • Do not allow core migration.

Failure modes:

  • Compiler removes the branch or hoists it out of the loop.
  • Frequency scaling changes between runs.

Minimal concrete example

uint64_t t0 = rdtscp();
for (int i = 0; i < N; i++) {
  if (pattern[i & mask]) sum++;
}
uint64_t t1 = rdtscp();
cycles_per_iter = (t1 - t0) / (double)N;

Common misconceptions

  • “One timing run is enough” -> you need multiple runs and statistics.
  • “RDTSC is always stable” -> core migration and power states break it.

Check-your-understanding questions

  1. Why do you subtract a baseline loop?
  2. Why is pinning to a single core important?
  3. How does compiler optimization threaten measurement validity?

Check-your-understanding answers

  1. To remove loop overhead and isolate the branch cost.
  2. Different cores may have different TSC offsets or frequencies.
  3. The compiler may remove or transform the branch, changing the experiment.

Real-world applications

  • Benchmarking libraries and micro-optimizations
  • CPU vendor performance characterization
  • Academic microarchitecture research

Where you’ll apply it

References

  • “Agner Fog’s Optimization Manuals”
  • “Computer Systems: A Programmer’s Perspective” Ch. 5

Key insights

  • Microbenchmarking is experimental science: control, repeatability, and statistics matter.

Summary

RDTSC gives you the resolution to see misprediction costs, but only disciplined methodology turns it into truth.

Homework/Exercises to practice the concept

  1. Measure a tight empty loop and compute overhead per iteration.
  2. Compare median vs mean timing for a noisy test and explain the difference.

Solutions to the homework/exercises

  1. The overhead is the baseline cycles per iteration; subtract it from your test.
  2. The median is robust to spikes; the mean is distorted by outliers.

2.3 Branch Pattern Synthesis and Statistical Validation

Fundamentals

A branch predictor is a pattern recognizer, so a serious test suite must control the patterns it sees. Random branches stress the worst case, but they are not the only case. Real programs have loops, periodic behaviors, and correlated branches. Your torture test needs deterministic generators that can emit specific histories: alternating, periodic with long periods, biased-but-noisy, and correlated streams. You also need a measurement plan that separates warm-up from steady state and produces reproducible accuracy metrics. If you measure a predictor without controlling input sequences, you will learn more about your workload than the predictor. A good suite treats branch outcomes as test vectors and treats misprediction rate as a statistical outcome with confidence bounds.

Deep Dive into the concept

The core challenge is that predictor state is sticky. Direction predictors keep counters; BTBs cache targets; global history registers shift in recent outcomes. This means the first few thousand branches are a warm-up period that does not reflect steady-state accuracy. Your generator should therefore produce three phases: a cold-start warm-up, a steady-state region you measure, and a cool-down region you ignore. For example, for a periodic pattern of length 1024, you should run at least 10 full periods before you measure, otherwise you may be sampling the predictor while it is still adapting.

Pattern synthesis should cover at least four families. First, simple alternation (T, N, T, N) tests short-history and 2-bit saturating counters. Second, long-period patterns (e.g., a 17-bit LFSR) test whether a predictor can capture long correlations or whether it aliases. Third, biased-but-noisy patterns (e.g., 90 percent taken with random noise) reveal how quickly a predictor recovers from occasional deviations. Fourth, correlated patterns across multiple branch sites test global history usage: you can have branch A’s outcome equal branch B’s outcome delayed by k iterations. To implement this, you can generate a vector of outcomes and then drive two separate branches with shifted windows of the same vector.

Statistics matter because the misprediction rate can fluctuate by several percent due to noise and short runs. Use long runs and compute the misprediction rate as mispredictions / total branches. Also compute MPKI (mispredictions per thousand instructions) to compare across different instruction mixes. If possible, record confidence intervals by running each pattern multiple times and using the median or 95th percentile. Medians are often more stable on systems with turbo, OS noise, and background interrupts.

Microbenchmark design must separate branch costs from loop overhead. A typical pattern is to put the branch inside a tight loop with a fixed amount of non-branch work per iteration (often a few independent ALU ops) and then subtract the baseline cost of the same loop without branches. This lets you estimate the marginal misprediction penalty. You should also control for instruction cache and uOp cache effects by keeping the loop small and aligned. If your loop spills into decode or gets evicted, you may attribute front-end stalls to the predictor.

Finally, be aware of aliasing. Predictors use limited tables indexed by branch PC and history. If two different branches map to the same entry, they can pollute each other. In a torture test, this is a feature, not a bug. You can intentionally create aliasing by placing branch sites at addresses that share the same lower bits, then observe accuracy degradation. This tells you about predictor table sizes and indexing strategies. In short, a branch test suite is not just about random noise; it is about controlled, theory-driven stress patterns and statistically meaningful measurement.

How this fits on projects

You will use this in §3.5 to define input pattern formats, in §5.10 Phase 2 to build pattern generators, and in §6.2 to design repeatable predictor tests.

Definitions & key terms

  • warm-up -> initial phase where predictor state is trained but not measured
  • MPKI -> mispredictions per thousand instructions
  • aliasing -> multiple branches mapping to the same predictor entry
  • pattern length -> the number of outcomes before a sequence repeats
  • correlation -> dependency between outcomes of different branches

Mental model diagram (ASCII)

Pattern Generator --> Branch A --> Predictor
                    \-> Branch B (shifted)
Measure: accuracy, MPKI, penalty cycles

How it works (step-by-step, with invariants and failure modes)

  1. Generate an outcome vector for each pattern family.
  2. Run a warm-up phase that primes predictor state.
  3. Execute the measured phase and record mispredictions.
  4. Compute accuracy, MPKI, and penalty per miss.
  5. Repeat and take median results.

Invariants:

  • Warm-up outcomes are not counted in metrics.
  • Measurement length is long enough to capture steady state.

Failure modes:

  • Short runs lead to noisy or misleading accuracy.
  • Loop overhead dominates and hides predictor effects.
  • OS noise changes timing and inflates penalties.

Minimal concrete example

// Alternating pattern
for (int i = 0; i < N; i++) {
    if (i & 1) taken(); else not_taken();
}

Common misconceptions

  • “Random patterns are the only real test” -> Real code has structure and correlation.
  • “Accuracy alone is enough” -> MPKI and penalty cycles tell different stories.
  • “Warm-up does not matter” -> Predictor state must be trained.

Check-your-understanding questions

  1. Why does a long-period pattern reveal different behavior than a simple alternating pattern?
  2. What is the difference between accuracy and MPKI?
  3. How does aliasing affect predictor accuracy?

Check-your-understanding answers

  1. Long periods require longer history; short predictors will alias and mispredict.
  2. Accuracy is percent correct; MPKI normalizes misses per thousand instructions.
  3. Aliasing causes unrelated branches to overwrite each other’s predictor entries.

Real-world applications

  • Designing branch prediction microbenchmarks for CPU evaluation
  • Understanding unpredictable branches in workloads like parsers and interpreters
  • Tuning branch-heavy code in performance-critical systems

Where you’ll apply it

References

  • “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3
  • “The Microarchitecture of Branch Predictors” (Seznec and Michaud) survey papers

Key insights

  • A good predictor test is a controlled experiment, not just a random loop.

Summary

Branch prediction is about learned history, so your tests must be designed around history. Controlled patterns, proper warm-up, and statistically stable metrics are what turn a toy branch loop into a real predictor study.

Homework/Exercises to practice the concept

  1. Generate a 257-length pattern and compare accuracy to a 2-bit saturating counter.
  2. Build two correlated branches and measure how accuracy changes when they alias.

Solutions to the homework/exercises

  1. The 2-bit counter will fail to track the long period and show a higher miss rate.
  2. Aliasing increases miss rate; separating the branch addresses reduces it.

3. Project Specification

3.1 What You Will Build

A suite of branch prediction microbenchmarks that generate controlled branch outcome patterns and measure cycles per iteration. The tool reports estimated misprediction penalty and the approximate history depth where prediction quality collapses. It outputs a machine-readable report (CSV/JSON) and a human-readable summary.

3.2 Functional Requirements

  1. Pattern Generator: Create patterns (periodic, nested, random, correlated) with configurable length.
  2. Benchmark Harness: Time each pattern using RDTSC with warmup and multiple trials.
  3. Compiler Defense: Ensure branches are not optimized away.
  4. Report Generator: Summarize cycles/iter and inferred predictor limits.

3.3 Non-Functional Requirements

  • Performance: Each pattern should complete in under 1 second for N=1e7 iterations.
  • Reliability: Stable results across 5 runs on the same core.
  • Usability: CLI flags to select patterns and output formats.

3.4 Example Usage / Output

$ ./branch_torture --patterns periodic,random --period 16384 --trials 5

Pattern: period=16384   cycles/iter=12.9
Pattern: random 50/50   cycles/iter=18.3
Estimated mispredict penalty: 16-20 cycles
Estimated history depth: ~8K outcomes

3.5 Data Formats / Schemas / Protocols

CSV output:

pattern,period,cycles_per_iter,stdev
periodic,16384,12.9,0.3
random,0,18.3,0.5

3.6 Edge Cases

  • Very small periods (1, 2, 3) that are trivially predictable
  • Extremely large periods that exceed array size
  • Random patterns with low entropy (biased distributions)

3.7 Real World Outcome

You will produce a predictor profile report that shows a clear cliff where prediction accuracy collapses.

3.7.1 How to Run (Copy/Paste)

c++ -O2 -Wall -o branch_torture src/main.cpp
sudo taskset -c 2 ./branch_torture --patterns all --trials 5 --warmup 10000

3.7.2 Golden Path Demo (Deterministic)

  • Use --pattern periodic --period 1024 with fixed seed --seed 42.
  • Expect low cycles/iter and stable median across runs.

3.7.3 If CLI: Exact Terminal Transcript

$ taskset -c 2 ./branch_torture --pattern periodic --period 1024 --trials 3 --seed 42
pattern=periodic period=1024 median=1.12 cycles/iter stdev=0.02

$ echo $?
0

Failure demo (invalid pattern):

$ ./branch_torture --pattern zigzag
Error: unknown pattern 'zigzag'

$ echo $?
3

Exit codes:

  • 0 success
  • 2 timing initialization error
  • 3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+----------------+    +------------------+    +-----------------+
| Pattern Builder| -> | Benchmark Harness| -> | Report Generator|
+----------------+    +------------------+    +-----------------+

4.2 Key Components

Component Responsibility Key Decisions
Pattern Builder Create outcome arrays Keep in L1 to avoid memory noise
Harness Measure cycles accurately Use RDTSCP + fences
Reporter Compute stats and inference Median over mean

4.3 Data Structures (No Full Code)

struct Pattern {
  std::vector<uint8_t> outcomes; // 0 or 1
  std::string name;
};

4.4 Algorithm Overview

Key Algorithm: Pattern Runner

  1. Warm up loop to train predictor.
  2. Time loop over outcome array for N iterations.
  3. Compute cycles/iter and aggregate stats.

Complexity Analysis:

  • Time: O(N) per pattern
  • Space: O(P) for pattern length

5. Implementation Guide

5.1 Development Environment Setup

c++ --version
# Optional: pin to core
sudo taskset -c 2 ./branch_torture ...

5.2 Project Structure

branch-torture/
├── src/
│   ├── main.cpp
│   ├── patterns.cpp
│   └── timing.cpp
├── tests/
│   └── test_patterns.cpp
└── README.md

5.3 The Core Question You’re Answering

“How much history can my CPU’s branch predictor actually learn?”

Your benchmark should reveal a clear inflection point where accuracy collapses.

5.4 Concepts You Must Understand First

  1. Predictor hysteresis
  2. Global history indexing
  3. Timing protocol with RDTSC

5.5 Questions to Guide Your Design

  1. How will you ensure the branch is not optimized away?
  2. How will you separate predictor training from measurement?
  3. How will you detect a cliff in cycles/iter?

5.6 Thinking Exercise

Design a pattern that is predictable for a 64-length history but fails at 256. How will you encode it?

5.7 The Interview Questions They’ll Ask

  1. Why is a 2-bit counter better than 1-bit?
  2. What is the difference between local and global history?
  3. How do you measure misprediction penalty without hardware counters?

5.8 Hints in Layers

Hint 1: Use a precomputed array of outcomes to force the sequence.

Hint 2: Use volatile or inline asm to prevent branch removal.

Hint 3: Report median cycles to reduce noise.

5.9 Books That Will Help

Topic Book Chapter
Branch predictors “Computer Architecture” Ch. 3
Microbenchmarking “Agner Fog’s Manuals” Timing section

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

  • Implement pattern generator and validate sequences.
  • Checkpoint: pattern arrays match expected periodicity.

Phase 2: Core Functionality (4-6 days)

  • Build timing harness with warmup and trials.
  • Checkpoint: predictable patterns show low cycles/iter.

Phase 3: Polish & Analysis (2-3 days)

  • Add report generation and inference logic.
  • Checkpoint: report includes history depth estimate.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Timing method clock_gettime vs RDTSC RDTSC Needed resolution
Stats mean vs median median Robust to noise

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Pattern correctness period=4 matches TTTN
Integration Tests Timing harness baseline vs test loop
Edge Tests Large period period=65536

6.2 Critical Test Cases

  1. Period-2 pattern should be near-perfect prediction.
  2. Random pattern should show high cycles/iter.
  3. Large period should show a cliff.

6.3 Test Data

pattern: T T T N (period 4)

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Compiler removes branch cycles/iter too low use volatile / inline asm
Core migration noisy results pin core
Turbo scaling bimodal timings fix frequency

7.2 Debugging Strategies

  • Log the first 16 outcomes of each pattern to confirm correctness.
  • Use perf to confirm branch instructions execute.

7.3 Performance Traps

  • Pattern array too large -> memory misses distort branch timing.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a simple local-history predictor model.

8.2 Intermediate Extensions

  • Add nested branch patterns to explore correlation.

8.3 Advanced Extensions

  • Implement a small software TAGE and compare with hardware.

9. Real-World Connections

9.1 Industry Applications

  • Performance tuning in branch-heavy workloads
  • CPU vendor characterization suites
  • perf: can validate branch mispredict events
  • llvm-mca: provides modeled predictor cost assumptions

9.3 Interview Relevance

  • Branch prediction questions are common in systems interviews.

10. Resources

10.1 Essential Reading

  • “Computer Architecture” by Hennessy and Patterson - Ch. 3
  • “Agner Fog’s Optimization Manuals” - branch prediction chapter

10.2 Video Resources

  • “Modern Branch Predictors” - architecture lecture series

10.3 Tools & Documentation

  • perf: measure branch-misses
  • taskset: pin to core

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why history length matters.
  • I can design a pattern that defeats a small predictor.
  • I can estimate misprediction penalty from timing.

11.2 Implementation

  • All functional requirements are met.
  • Results are stable across trials.
  • Reports are reproducible.

11.3 Growth

  • I can explain my benchmark methodology to others.
  • I documented CPU model and configuration.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • At least 3 patterns measured with stable results.
  • CLI reports cycles/iter and misprediction penalty.

Full Completion:

  • History-length cliff detected and documented.

Excellence (Going Above & Beyond):

  • Include a software predictor model and compare to hardware.