Project 5: Branch Predictor and Pipeline Lab

Project 5: Branch Predictor and Pipeline Lab

Project Overview

Attribute Details
Difficulty Advanced
Time Estimate 1-2 weeks
Primary Language C
Alternative Languages C++, Rust, Zig
Knowledge Area CPU Microarchitecture
Tools Required perf, cpu topology tools, objdump
Primary Reference โ€œComputer Architecture: A Quantitative Approachโ€ by Hennessy & Patterson

Learning Objectives

By completing this project, you will be able to:

  1. Explain how branch prediction works including common predictor algorithms
  2. Measure branch misprediction rates using hardware performance counters
  3. Design workloads that defeat branch prediction to understand worst cases
  4. Calculate the performance impact of misprediction on pipeline throughput
  5. Apply branchless programming techniques for unpredictable branches
  6. Interpret CPI and IPC metrics in the context of pipeline stalls

Deep Theoretical Foundation

The Pipelining Problem

Modern CPUs execute instructions in stages: fetch, decode, execute, memory, writeback. Pipelining allows starting a new instruction every cycle, achieving 1+ instructions/cycle throughput.

Classic 5-Stage Pipeline:
Time โ†’  1    2    3    4    5    6    7    8
Inst 1: [F]  [D]  [E]  [M]  [W]
Inst 2:      [F]  [D]  [E]  [M]  [W]
Inst 3:           [F]  [D]  [E]  [M]  [W]
Inst 4:                [F]  [D]  [E]  [M]  [W]
Inst 5:                     [F]  [D]  [E]  [M]  [W]

After startup, one instruction completes every cycle

CPU Pipeline Stages

The Branch Problem: When the CPU fetches a branch instruction, it doesnโ€™t know which instruction comes next until the branch is evaluated (cycle 3 in example). But by then, itโ€™s already fetched 2 more instructionsโ€”which may be wrong.

What happens at a branch:
Time โ†’  1    2    3    4    5    6    7    8
Branch: [F]  [D]  [E]  โ†’ Branch resolved (taken)
Inst A:      [F]  [D]  [X]  โ† WRONG PATH - must flush
Inst B:           [F]  [X]  โ† WRONG PATH - must flush
Target:                [F]  [D]  [E]  [M]  [W]  โ† Correct path starts
Next:                       [F]  [D]  [E]  [M]  [W]

[X] = Wasted work (pipeline flush)

Modern CPUs have 15-20 pipeline stages, meaning a misprediction wastes 15-20 cycles of work.

How Branch Prediction Works

Static Prediction (Simple)

  • Backward branches predicted taken (loops usually continue)
  • Forward branches predicted not taken (error paths usually not taken)
  • ~65-70% accuracy

Two-Level Adaptive Prediction (Modern)

CPUs maintain history of recent branches:

Branch History Table (BHT):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Branch Address โ†’ History Pattern โ†’ Prediction       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 0x401234       โ†’ TTTTTTTN         โ†’ Taken (strong)  โ”‚
โ”‚ 0x401256       โ†’ TNTNTNTN         โ†’ Not Taken       โ”‚
โ”‚ 0x401278       โ†’ NNNNNNNN         โ†’ Not Taken       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

T = Taken, N = Not Taken (last 8 outcomes)

The predictor learns patterns. A branch that alternates TNTN can be predicted with 100% accuracy once the pattern is learned.

Branch Target Buffer (BTB)

For taken branches, the CPU also needs to know where to jump. The BTB caches recent branch targets:

BTB:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Branch Address โ†’ Target Address                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 0x401234       โ†’ 0x401100 (loop start)         โ”‚
โ”‚ 0x401256       โ†’ 0x405000 (function call)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

What Makes Branches Predictable

Highly Predictable (>99% accuracy):

// Loop control - always taken except last iteration
for (int i = 0; i < 1000; i++) { ... }
// 999 taken, 1 not taken = 99.9% predictable

// Data-dependent but sorted
for (int i = 0; i < n; i++) {
    if (sorted_array[i] < threshold) { sum += 1; }
}
// First K iterations taken, rest not taken
// Predictor learns the pattern

Unpredictable (~50% accuracy):

// Random data
for (int i = 0; i < n; i++) {
    if (random_array[i] & 1) { sum += 1; }
}
// Outcome depends on random bits - no pattern to learn

// Hash-dependent
for (key in keys) {
    if (hash(key) % 2 == 0) { process(key); }
}
// Hash output is pseudorandom

The Cost of Misprediction

Direct Cost: Pipeline flush and refill = 15-20 cycles

Indirect Cost: Lost instruction-level parallelism (ILP)

Consider:

  • CPU can execute 4+ instructions per cycle (superscalar)
  • With perfect prediction, IPC = 2-4
  • With 50% misprediction on frequent branches, IPC drops to <1

The Math:

Given:
- Branch every 5 instructions
- 50% misprediction rate
- 18-cycle misprediction penalty

Effective cycles per branch:
  (0.50 ร— 1 cycle) + (0.50 ร— 18 cycles) = 9.5 cycles

For every 5 instructions:
  5 instructions + 9.5 cycles = ~2.5 cycles/instruction

Result: IPC = 0.4 (vs theoretical 4.0)

Complete Project Specification

What Youโ€™re Building

An experimental toolkit called branch_lab that:

  1. Generates workloads with controlled branch patterns (sorted, random, alternating)
  2. Measures branch misprediction rates and CPI using hardware counters
  3. Compares predictable vs unpredictable patterns quantitatively
  4. Demonstrates branchless alternatives and their performance characteristics
  5. Produces reports connecting branch behavior to throughput

Functional Requirements

branch_lab run --pattern <name> --size <n> --iters <n>
branch_lab compare --patterns <list> --size <n>
branch_lab branchless --pattern <name> --compare
branch_lab report --output <report.md>

Branch Patterns to Implement

  1. always_taken: Branch condition always true
  2. never_taken: Branch condition always false
  3. sorted: Threshold on sorted data (predictable transition)
  4. random_50: 50/50 random outcomes
  5. biased_90: 90% one way, 10% other
  6. alternating: Strict TNTNTN pattern
  7. period_4: TTNNTTNN repeating

Example Output

Branch Prediction Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Array size: 100,000,000 elements
Iterations: 10
CPU: Intel Core i7-12700K

Pattern Comparison:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Pattern        Time     CPI    IPC    Mispredict%
always_taken   124 ms   0.78   2.94   0.01%
sorted         127 ms   0.81   2.87   0.12%
biased_90      168 ms   1.02   2.12   8.7%
alternating    245 ms   1.34   1.54   0.8%
random_50      487 ms   2.94   0.89   48.3%

Performance Impact Analysis:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ€ข random_50 is 3.9x slower than always_taken
โ€ข 48.3% misprediction rate costs 2.16 extra cycles/instruction
โ€ข Sorting the data would provide 3.8x speedup

Branchless Comparison (random_50 pattern):
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Implementation    Time     Relative
branchy           487 ms   1.0x (baseline)
cmov              298 ms   1.6x faster
bitwise_mask      312 ms   1.5x faster

Branchless wins for unpredictable data!

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    CLI Interface                             โ”‚
โ”‚   Pattern selection, size configuration                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Pattern     โ”‚  โ”‚ Counter     โ”‚  โ”‚ Branchless  โ”‚
   โ”‚ Generator   โ”‚  โ”‚ Collector   โ”‚  โ”‚ Variants    โ”‚
   โ”‚             โ”‚  โ”‚ (perf)      โ”‚  โ”‚             โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                โ”‚                โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Measurement Loop          โ”‚
           โ”‚  Run workload โ†’ Collect stats  โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Analysis & Report         โ”‚
           โ”‚  Calculate CPI, compare        โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Data Structures

// Branch pattern generator
typedef void (*pattern_generator_t)(int *data, size_t count, int threshold);

typedef struct {
    const char *name;
    pattern_generator_t generator;
    float expected_mispredict_rate;  // For validation
    const char *description;
} pattern_t;

// Branch measurement result
typedef struct {
    double wall_time_ms;
    uint64_t instructions;
    uint64_t cycles;
    uint64_t branches;
    uint64_t branch_misses;
    double cpi;  // cycles per instruction
    double ipc;  // instructions per cycle
    double mispredict_rate;
} branch_result_t;

Workload Implementation

// The core branch-heavy workload
// This must NOT be optimized away by compiler
volatile int64_t benchmark_branchy(int *data, size_t n, int threshold) {
    volatile int64_t sum = 0;
    for (size_t i = 0; i < n; i++) {
        if (data[i] < threshold) {  // THE BRANCH
            sum += data[i];
        }
    }
    return sum;
}

// Branchless equivalent using conditional move
volatile int64_t benchmark_branchless(int *data, size_t n, int threshold) {
    volatile int64_t sum = 0;
    for (size_t i = 0; i < n; i++) {
        int mask = -(data[i] < threshold);  // -1 or 0
        sum += data[i] & mask;  // No branch!
    }
    return sum;
}

Pattern Generators

// Sorted data - highly predictable
void generate_sorted(int *data, size_t count, int threshold) {
    for (size_t i = 0; i < count; i++) {
        data[i] = i;  // Sorted ascending
    }
}

// Random data - unpredictable
void generate_random(int *data, size_t count, int threshold) {
    for (size_t i = 0; i < count; i++) {
        data[i] = rand();
    }
}

// Biased random (90% below threshold)
void generate_biased_90(int *data, size_t count, int threshold) {
    for (size_t i = 0; i < count; i++) {
        if (rand() % 10 < 9) {
            data[i] = rand() % threshold;  // Below threshold
        } else {
            data[i] = threshold + rand();  // Above threshold
        }
    }
}

// Alternating TNTN pattern
void generate_alternating(int *data, size_t count, int threshold) {
    for (size_t i = 0; i < count; i++) {
        data[i] = (i % 2) ? threshold + 1 : 0;
    }
}

Phased Implementation Guide

Phase 1: Basic Branch Benchmark (Days 1-3)

Goal: Measure branch misprediction for sorted vs random data.

Steps:

  1. Create array with configurable data pattern
  2. Implement the branchy workload with threshold comparison
  3. Use perf stat to measure branch-misses
  4. Compare sorted (predictable) vs random (unpredictable)

Validation: Random should have ~50% misprediction, sorted ~0%.

Phase 2: Hardware Counter Integration (Days 4-6)

Goal: Use perf_event_open for detailed branch metrics.

Steps:

  1. Configure counters for:
    • PERF_COUNT_HW_BRANCH_INSTRUCTIONS
    • PERF_COUNT_HW_BRANCH_MISSES
    • PERF_COUNT_HW_CPU_CYCLES
    • PERF_COUNT_HW_INSTRUCTIONS
  2. Calculate CPI and IPC
  3. Verify counter accuracy against perf stat

Validation: CPI should increase with misprediction rate.

Phase 3: Branchless Alternatives (Days 7-9)

Goal: Implement and compare branchless versions.

Steps:

  1. Implement CMOV-based branchless version
  2. Implement bitwise mask version
  3. Verify correctness (same results)
  4. Compare performance for each pattern

Validation: Branchless should win for random data.

Phase 4: Pattern Library (Days 10-11)

Goal: Comprehensive pattern coverage.

Steps:

  1. Implement all listed patterns
  2. Add expected misprediction rate annotations
  3. Create comparison framework
  4. Generate pattern analysis report

Validation: Measured rates match expectations (ยฑ5%).

Phase 5: Assembly Verification (Days 12-14)

Goal: Verify compiler generates expected code.

Steps:

  1. Use objdump to inspect generated assembly
  2. Verify branchy version has conditional jumps
  3. Verify branchless version has CMOV
  4. Document optimization flags that change behavior

Validation: Assembly matches intent.


Testing Strategy

Micro-Benchmarks

  1. Always-taken: Should have ~0% misprediction
  2. Always-not-taken: Should have ~0% misprediction
  3. Alternating: Should have 0-2% (modern predictors learn patterns)
  4. Random 50/50: Should have 45-50% misprediction

Cross-Validation

  1. Compare with perf stat: Counters should match
  2. Vary array size: Results should be consistent
  3. Multiple runs: Low variance expected

Compiler Verification

  1. Check for auto-vectorization: May bypass branches entirely
  2. Verify no dead-code elimination: Results must be used
  3. Test multiple optimization levels: -O0, -O1, -O2, -O3

Common Pitfalls and Debugging

Pitfall 1: Compiler Optimizes Away the Branch

Symptom: Branchy and branchless have identical performance.

Cause: Compiler auto-vectorized or used CMOV automatically.

Solution:

# Compile with specific flags to preserve branches
gcc -O2 -fno-if-conversion -fno-tree-vectorize branch_lab.c

# Verify assembly has jcc (conditional jump)
objdump -d branch_lab | grep -E 'j(e|ne|l|le|g|ge)'

Pitfall 2: Data Not Actually Random

Symptom: โ€œRandomโ€ pattern shows low misprediction.

Cause: Pseudo-random generator has predictable patterns.

Solution:

// Use multiple sources of randomness
#include <time.h>
srand(time(NULL) ^ getpid());

// Or use /dev/urandom for true randomness
int fd = open("/dev/urandom", O_RDONLY);
read(fd, data, count * sizeof(int));

Pitfall 3: Cache Effects Masking Branch Effects

Symptom: Random access is slow but not due to branches.

Cause: Random data access pattern causes cache misses.

Solution: Ensure sequential memory access regardless of branch outcome:

// Good: Sequential access, only branch varies
for (i = 0; i < n; i++) {
    if (data[i] < threshold) sum += data[i];
}

// Bad: Random access through pointers
for (i = 0; i < n; i++) {
    if (ptrs[i]->value < threshold) sum += ptrs[i]->value;
}

Pitfall 4: Warmup Effects

Symptom: First run shows different misprediction rate.

Cause: Branch predictor tables cold on first run.

Solution: Always include warmup runs:

// Warmup - run workload but discard results
for (int warm = 0; warm < 5; warm++) {
    benchmark_branchy(data, n, threshold);
}

// Measured runs
for (int run = 0; run < 10; run++) {
    start_counters();
    benchmark_branchy(data, n, threshold);
    stop_counters();
}

Extensions and Challenges

Extension 1: Branch Predictor Capacity Test

Determine the size of the branch history table:

  1. Create program with N unique branches
  2. Increase N until misprediction rate jumps
  3. Thatโ€™s approximately the BTB capacity

Extension 2: Indirect Branch Study

Measure the impact of indirect branches (function pointers, virtual calls):

typedef int (*op_t)(int);
op_t operations[N];  // Array of function pointers

for (int i = 0; i < n; i++) {
    result = operations[i % N](data[i]);  // Indirect call
}

Vary N to test indirect branch predictor capacity.

Extension 3: Speculative Execution Effects

Explore how speculative execution affects branch prediction:

  • Transient execution paths
  • Speculation barriers (lfence)
  • Performance impact of mitigations (Spectre)

Challenge: Optimal Threshold Finding

Given unsorted data and a threshold-based filter:

  1. Is it faster to sort first, then filter?
  2. At what selectivity (% passing filter) does sorting help?
  3. Implement auto-tuning that chooses optimal strategy

Real-World Connections

Where Branch Prediction Matters

  1. Parsers: Tokenizers with many character-type checks
  2. Databases: Query predicate evaluation
  3. Compression: Huffman decoding with variable-length codes
  4. Networking: Packet parsing and protocol handling

Industry Techniques

  • Profile-guided optimization (PGO): Compiler uses runtime data to optimize branch layout
  • __builtin_expect: Hint to compiler about likely branch direction
  • Branch-free data structures: B-trees designed to minimize branches

Famous Examples

  • Sorting networks: Branchless sorting for small arrays
  • Binary search on sorted array: Classic example of unpredictable branches
  • std::partition: STL implementation uses branchless comparisons

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can explain why branch misprediction is expensive
  • You measured ~50% misprediction for random data
  • You demonstrated branchless is faster for unpredictable patterns
  • You understand the CPI impact of misprediction
  • You verified assembly code matches your intent
  • You can explain when branchless is and isnโ€™t appropriate
  • Your measurements are consistent across runs

Resources

Essential Reading

  • โ€œComputer Architecture: A Quantitative Approachโ€ by Hennessy & Patterson, Chapter 3
  • โ€œPerformance Analysis and Tuning on Modern CPUsโ€ by Denis Bakhvalov
  • โ€œWhat Every Programmer Should Know About Memoryโ€ by Drepper

Reference Documentation

  • Intel 64 and IA-32 Optimization Reference Manual
  • AMD64 Architecture Programmerโ€™s Manual
  • Agner Fogโ€™s microarchitecture guides

Tools

  • perf stat: Branch counter measurement
  • objdump: Assembly inspection
  • Intel VTune: Branch analysis
  • likwid: Performance counter suite

Classic Papers

  • โ€œTwo-Level Adaptive Training Branch Predictionโ€ (Yeh & Patt)
  • โ€œDynamic Branch Prediction with Perceptronsโ€ (Jimรฉnez & Lin)