← Back to all projects

MODERN CPU INTERNALS 2025 DEEP DIVE

In 2025, we are in the era of Architectural Specialization. The speed of a single thread is no longer determined by clock frequency (which has plateaued at ~5-6GHz), but by how many instructions the CPU can execute in parallel (**Instruction Level Parallelism - ILP**) and how well it can guess the future (**Speculative Execution**).

Modern CPU Internals: 2025 Deep Dive

Goal: Deeply understand the microarchitecture of modern CPUs (Zen 5, Lunar Lake, Apple M4) through hands-on implementation and measurement. You will move past the abstraction of “instructions” to the reality of micro-operations, out-of-order execution windows, TAGE branch predictors, and 512-bit vector pathways. By the end, you will be able to write code that achieves 90%+ of theoretical hardware peak performance.


Why Modern CPU Internals Matter in 2025

In 2025, we are in the era of “Architectural Specialization.” The speed of a single thread is no longer determined by clock frequency (which has plateaued at ~5-6GHz), but by how many instructions the CPU can execute in parallel (Instruction Level Parallelism - ILP) and how well it can guess the future (Speculative Execution).

The gap between a developer who understands the pipeline and one who doesn’t is no longer a few percent; it’s an order of magnitude. Modern CPUs are effectively supercomputers on a chip, with 400+ instruction windows and multiple 512-bit vector engines.

High-level Code (C/C++)       What you think happens: "Add x to y"
        ↓
    x = x + y;          →     One instruction, one cycle.
        ↓
Microarchitectural Reality    What actually happens (Zen 5/Lunar Lake):
        ↓
Fetch & Branch Prediction →   Predict "if" result, fetch 32-64 bytes of code.
Decode (x86 -> uOps)      →   Break complex x86 into simple RISC uOps.
Rename & Allocate         →   Map RAX to Physical Register #187 to break dependencies.
Schedule (Out-of-Order)   →   Wait for data. Execute the MOMENT data is ready.
Execute (Port Pressure)   →   Compete for the specific ALU that can do ADD.
Retire (In-Order)         →   Put the result back in RAX only if prediction was right.

Microarchitectural Reality


Core Concept Analysis

1. The Front-End: x86 to uOp Translation

Modern CPUs are RISC machines wearing a CISC mask. x86 instructions are variable-length (1-15 bytes), making them a nightmare to decode. The “Front-End” is responsible for fetching these bytes and turning them into fixed-length “Micro-ops” (uOps) that the execution engine can understand.

[ L1 Instruction Cache ]
          ↓
[ Instruction Fetch Unit ]  ← Fetches 32-64 bytes/cycle
          ↓
[ Pre-Decode / Length Calc ] ← Find where instructions start/end (hard!)
          ↓
[ Micro-op Decoders ]        ← Zen 5: Dual 4-wide decoders
          ↓                    (Turns ADD [mem], rax -> LOAD + ADD)
[ uOp Cache (DSB) ]          ← The "Shortcut": Stores already decoded uOps
          ↓
[ uOp Queue ]                ← Buffers uOps before the storm

Front-End Architecture

Key Insight: If your loop fits in the uOp Cache (DSB), you bypass the slow, power-hungry decoders entirely. This is why small, tight loops are exponentially faster. Zen 5 doubled down on this with a massive increase in uOp cache throughput.

2. The Out-of-Order (OoO) Engine & The ROB

The Reorder Buffer (ROB) is the “Waiting Room” of the CPU. It allows the CPU to look ahead hundreds of instructions to find things to do while waiting for slow RAM. This is “Out-of-Order” execution: instructions that are ready to run (operands available) jump ahead of instructions that are stalled.

            +---------------------------------------+
            |          REORDER BUFFER (ROB)         |
            | (Zen 5: 448 entries, Lion Cove: 576)  |
            +---------------------------------------+
              /      |      |      |      |      \
    [ Issue ] [ Issue ] [ Issue ] [ Issue ] [ Issue ]  ← Dispatch uOps
      |          |          |          |          |
    [ Port 0 ] [ Port 1 ] [ Port 2 ] [ Port 3 ] [ Port 4 ] ← ALUs/FPUs
      |          |          |          |          |
      \----------\----------\----------\----------/
                            |
                     [ COMMIT / RETIRE ] ← Make it "official" in-order

OoO Engine & ROB

Key Insight: The size of the ROB determines the CPU’s “Instruction Window.” If you have a memory miss that takes 300 cycles, and the CPU can only look 448 uOps ahead, it will eventually run out of things to do and stall.

3. Branch Prediction: TAGE Predictors

Modern CPUs predict the outcome of every if statement and loop condition before they even know the values. If the prediction is right, the CPU keeps flying. If it’s wrong, it has to “nuke” the entire pipeline and start over. In 2025, we use TAGE (Tagged Geometric) predictors.

Pattern: T, T, N, T, T, N...
[ Short History Table ]  ← Catches "T, N"
[ Medium History Table ] ← Catches "T, T, N"
[ Long History Table ]   ← Catches complex patterns across function calls

TAGE Branch Prediction

4. Port Pressure and Execution Units

A CPU isn’t just one giant brain; it’s a team of specialists. There are specific ALUs for addition, others for multiplication, and others for memory loads. These specialists are accessed through “Ports.” If all your code is doing is ADD, you might be limited by the number of ADD ports, even if the rest of the CPU is idle. This is called Port Pressure.


Concept Summary Table

Concept Cluster What You Need to Internalize
uOps & Decoding x86 is just a frontend; the backend is a RISC engine. Decode width is the first bottleneck.
Out-of-Order (OoO) The “Instruction Window” (ROB size) determines how much memory latency the CPU can hide.
Speculation Speculative execution is why modern CPUs are fast, but also why Spectre exists.
Execution Ports Multiple ALUs exist, but they are specialized. “Port Pressure” is the 2nd bottleneck.
Memory Aliasing The CPU only checks ~12 bits of address for quick overlap checks (4K Aliasing).
Heterogeneous (P/E) Schedulers must map tasks based on core-specific decode widths and vector support.
Bandwidth vs Latency High-throughput 512-bit vector units require massive L1 cache bandwidth to stay fed.

Deep Dive Reading by Concept

Microarchitecture Fundamentals

Concept Book & Chapter
The 5-stage classic pipeline Computer Systems: A Programmer’s Perspective (CS:APP) — Ch. 4
Superscalar & OoO Execution Modern Processor Design by Shen & Lipasti — Ch. 4-5
Front-end & uOp Caching Agner Fog’s Microarchitecture Manual — Section: “The Pipeline”
Intel/AMD Optimization Intel 64 and IA-32 Optimization Manual — Ch. 2 (Architecture)

Branch Prediction & Speculation

Concept Book & Chapter
TAGE and Advanced Prediction Computer Architecture: A Quantitative Approach (Hennessy & Patterson) — Ch. 3
Speculative Security (Spectre) Practical Binary Analysis by Dennis Andriesse — Ch. 11

Memory Hierarchy & Vectorization

Concept Book & Chapter
Cache design and bandwidth Computer Systems: A Programmer’s Perspective — Ch. 6
SIMD and Vector performance The Art of Writing Efficient Programs by Fedor Pikus — Ch. 7
Memory Disambiguation Modern Processor Design — Ch. 6.4

Project 1: The Human Pipeline Trace

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Assembly, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Pipelining / Hazards
  • Software or Tool: objdump, gdb
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A CLI tool that parses a sequence of assembly instructions and generates a cycle-by-cycle “Gantt chart” of how they move through a 5-stage pipeline, highlighting stalls and forwarding.

Why it teaches CPU internals: You will move from thinking of code as “instant” to seeing it as a physical movement of data through latches. You’ll understand why ADD R1, R2, R3 followed by SUB R4, R1, R5 causes a “Data Hazard.”

Core challenges you’ll face:

  • Parsing Data Dependencies: You must track which registers are “written” and when they are “available.”
  • Modeling Forwarding Paths: Deciding if the result of an EX stage can be sent directly to the next instruction’s EX stage.
  • Visualizing the “Bubble”: Representing a pipeline stall in a way that shows why the fetch unit stopped.

Real World Outcome

By completing this project, you will have a micro-architectural simulator that reveals the “heartbeat” of a CPU. You’ll see how instructions flow through the 5 classic stages (Fetch, Decode, Execute, Memory, Write-back) and, more importantly, where they stop and why.

You’ll be able to take real assembly code generated by gcc or clang and see exactly why it’s not running at 1 instruction per cycle.

Example Output:

$ gcc -S -O0 my_loop.c -o my_loop.s
$ ./pipe_trace my_loop.s --forwarding=on

Analyzing: 
1: ADD R1, R2, R3
2: SUB R4, R1, R5
3: LDR R6, [R1]

Cycle | IF  | ID  | EX  | MEM | WB  | Notes
------+-----+-----+-----+-----+-----+
  1   | I1  |     |     |     |     | Fetching ADD
  2   | I2  | I1  |     |     |     | Fetching SUB, Decoding ADD
  3   | I3  | I2  | I1  |     |     | I1 starts Execute
  4   | I4  | I3  | I2  | I1  |     | I2 enters EX (Forwarded R1 from I1)
  5   | I5  | I4  | I3  | I2  | I1  | I1 writes back. I3 stalled? NO.
  6   | ... | ... | ... | ... | ... |

[Pipeline Hazards Detected]
- RAW Hazard: I2 depends on I1 (R1). Status: Resolved via Forwarding (EX -> EX).
- Structural Hazard: None.

[Final Stats]
Total Cycles: 8
Instructions: 3
IPC (Instructions Per Cycle): 0.375
Efficiency: 37.5% of peak.

The Core Question You’re Answering

“Why does adding a single instruction sometimes make my code twice as slow?”

Before you write any code, sit with this. Most developers think code is “instant” once it hits the CPU. This project proves that code is a physical movement of data. If one instruction is waiting for a result that hasn’t been written yet, the whole factory floor (the pipeline) grinds to a halt. You are answering: How does the physical design of the CPU create wait-times (hazards)?

Concepts You Must Understand First

Stop and research these before coding:

  1. The 5-Stage Pipeline (RISC model)
    • IF (Instruction Fetch): Getting the bits from memory.
    • ID (Instruction Decode): Figuring out what the bits mean and reading registers.
    • EX (Execute): The ALU doing the actual math.
    • MEM (Memory Access): Reading or writing to RAM.
    • WB (Write Back): Putting the result back in the register file.
    • Book Reference: “Computer Systems” Ch. 4.1
  2. Data Hazards (RAW, WAR, WAW)
    • Read-After-Write (RAW): Trying to read a value before it’s finished being calculated.
    • Forwarding (Bypassing): The “short-circuit” that lets a result jump from the end of the ALU directly back to the input of the next ALU.
    • Book Reference: “Computer Organization and Design” Ch. 4.7
  3. Control Hazards (Branching)
    • What happens to the instructions already in the pipeline when a jump occurs?

Questions to Guide Your Design

Before implementing, think through these:

  1. Modeling Time
    • How do you represent “one cycle” in a C program?
    • Should your simulator loop over instructions or loop over cycles? (Hint: Cycle-based is the only way to model parallel stages).
  2. Instruction Representation
    • What metadata does an instruction need to carry as it moves through the stages? (Source registers, destination register, cycle it entered).
  3. Hazard Detection Logic
    • How does the “Decode” stage know if it should stall? It needs to look “ahead” at the EX, MEM, and WB stages. How do you implement this “look-ahead”?

Thinking Exercise

Trace the Flow by Hand

Before coding, trace this sequence on paper for a 5-stage pipeline WITHOUT forwarding:

1: ADD R1, R2, R3
2: SUB R4, R1, R5

Questions:

  • In which cycle is the result of ADD (R1) actually written to the register file?
  • In which cycle does SUB need to read R1 from the register file?
  • How many cycles of “bubbles” (stalls) are needed?
  • Now, add “Forwarding” from the EX stage. How does the diagram change?

The Interview Questions They’ll Ask

  1. “What is the difference between a structural hazard and a data hazard?”
  2. “How does a pipeline increase throughput but not necessarily decrease latency?”
  3. “Explain how forwarding (bypassing) works in a CPU pipeline.”
  4. “What is a ‘branch delay slot’ and why was it used in older architectures?”
  5. “What happens to the pipeline when a page fault occurs in the MEM stage?”

Hints in Layers

Hint 1: The Pipeline Array Represent the pipeline as an array of 5 elements: Instruction* pipeline[5];. In each “cycle”, move pointers from index 4 down to 0 (Writeback to Fetch).

Hint 2: Move Backwards When updating the pipeline for a new cycle, process the stages from WB to IF. If you move IF to ID first, you might move the same instruction through the whole pipeline in one cycle!

Hint 3: The Stall Condition A stall happens in the ID stage. If pipeline[1] (ID) needs a register that pipeline[2] (EX) is about to write, and you don’t have forwarding, you must “freeze” the IF and ID stages while letting EX, MEM, and WB continue.

Hint 4: Tracking Register Availability Keep an array int register_ready_at_cycle[32]. Every time an instruction enters the pipeline, update when its destination register will be “legal” to read.

Books That Will Help

Topic Book Chapter
Pipeline Architecture “Computer Systems: A Programmer’s Perspective” Ch. 4.4
Hazard & Forwarding “Computer Organization and Design” Ch. 4.7-4.8
RISC-V Implementation “Digital Design and Computer Architecture” Ch. 7

Project 2: The Branch Predictor Torture Test

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: C++
  • Alternative Programming Languages: C, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Branch Prediction / TAGE
  • Software or Tool: perf, RDTSC
  • Main Book: “Modern Processor Design” by Shen & Lipasti

What you’ll build: A suite of micro-benchmarks that execute loops with conditional branches following specific patterns (alternating, nested loops, long-period sequences). You’ll measure the exact cycle cost of each pattern to find the limits of your CPU’s predictor.

Why it teaches CPU internals: You will realize that the CPU has a “memory” of your code’s history. You’ll discover that a pattern like T, T, N, T, T, N is essentially free, but a truly random sequence is 15-20x slower. You’ll explore the limits of modern TAGE (Tagged Geometric) predictors found in Zen 5 and Lunar Lake.

Core challenges you’ll face:

  • Defeating the Compiler: Compilers love turning branches into CMOV (conditional moves) which have no prediction. You must use volatile or assembly to keep the branch.
  • High-Resolution Timing: Using RDTSC (Read Time-Stamp Counter) to measure differences of just a few clock cycles.
  • Pattern Complexity: Generating a pattern that is too long for the CPU’s Branch History Table (BHT) or TAGE tables.

Real World Outcome

You will create a “Map of Predictability” for your CPU. This tool will identify the exact “saturation point” where your CPU’s branch predictor (like the TAGE-L in Zen 5) can no longer find patterns in your code. You’ll be able to quantify the cycle-cost of a misprediction—proving why “branchless” code is often superior for performance.

Example Output:

$ ./branch_torture --patterns all --warmup 10000

[Target: Intel Core Ultra 200V (Lunar Lake)]
[Testing Pattern: Periodic Binary]
Pattern: [T, N]                 -> 1.01 cycles/iter (Predictor: 100% Correct)
Pattern: [T, T, T, N]           -> 1.01 cycles/iter (Predictor: 100% Correct)
Pattern: [T, T, N, N, T, N]     -> 1.02 cycles/iter (Predictor: 100% Correct)

[Testing Pattern: Random Chaos]
Pattern: [Rand 50/50]           -> 18.45 cycles/iter (Predictor: 0% Mastery)

[Testing Pattern: Deep History (TAGE Test)]
Pattern: [Period 1024]          -> 1.05 cycles/iter (Predictor: Still Mastering)
Pattern: [Period 16384]         -> 15.22 cycles/iter (Predictor: SATURATED)

# ANALYSIS: Your Skymont E-core predictor saturates at ~8192 bits of global history.
# Misprediction Penalty: 17.4 cycles (approx. pipeline depth).

The Core Question You’re Answering

“Is my CPU’s branch predictor smart enough to learn a pattern of 1,000 branches?”

Most developers think branch prediction is a simple “yes/no” guess based on the last result. In reality, it’s a sophisticated machine-learning engine that can track thousands of previous outcomes to identify complex rhythms. This project answers: What are the physical limits of my CPU’s ‘memory’ of my code’s behavior?

Concepts You Must Understand First

Stop and research these before coding:

  1. Two-Bit Saturating Counters
    • Why is one bit of history not enough? (The “hysteresis” problem—changing prediction too fast on a single outlier).
    • Book Reference: “Modern Processor Design” Ch. 5.2
  2. Global History Register (GHR)
    • How does the outcome of a branch in a completely different function affect the prediction of the current branch?
    • Book Reference: “Computer Architecture: A Quantitative Approach” Ch. 3.3
  3. TAGE Predictors (Zen 5 / Lunar Lake)
    • What does “Tagged Geometric” mean? (Using multiple tables with exponentially increasing history lengths).
    • Why do TAGE predictors allow for 1000+ branch history tracking?
  4. RDTSC and Timing Noise
    • Using __builtin_ia32_rdtsc() vs rdtscp.
    • Why you must pin your process to a single core (pthread_setaffinity_np) to avoid “core hopping” noise.

Questions to Guide Your Design

  1. Defeating the Compiler
    • If you write if (x < 10) count++;, the compiler might turn it into a CMOV (Conditional Move). CMOV has no prediction—it just runs. How do you force the compiler to use a real branch (JCC)? (Hint: Use asm goto or volatile).
  2. Warming the Engine
    • Why do you need a “Training Phase” before you start your timer? What happens to the predictor’s state during the first 1,000 iterations?
  3. Overhead Subtraction
    • How do you measure just the branch? You need a “Baseline” loop that does the same work but has no branch. How do you ensure the baseline isn’t optimized away?

Thinking Exercise

The Hidden State

Trace this logic:

for (int i = 0; i < 1000000; i++) {
    if (i % 3 == 0) { // T, N, N, T, N, N...
        do_work();
    }
}

Questions:

  • After how many iterations will a “2-bit saturating counter” reach the “Strongly Not Taken” state for the N parts of the pattern?
  • If the pattern was 1,000 elements long before repeating, would a simple local predictor catch it?
  • Draw how the Global History Register would look after 4 iterations.

The Interview Questions They’ll Ask

  1. “Why is a mispredicted branch so much more expensive than a predicted one?” (Pipeline flush).
  2. “What is speculative execution, and how does it relate to branch prediction?”
  3. “Explain the difference between a Branch Target Buffer (BTB) and a Branch History Table (BHT).”
  4. “How does a TAGE predictor handle ‘aliasing’ between different branches?”
  5. “Can one thread’s branch history affect another thread on the same physical core (SMT/Hyper-threading)?”

Hints in Layers

Hint 1: The Measurement Loop Wrap your test in a function and use RDTSC at the start and end. Run it for at least 100 million iterations to average out system jitter.

Hint 2: The Inline Assembly Branch To guarantee a branch, use inline assembly:

__asm__ volatile (
    "test %%eax, %%eax\n"
    "jz .skip\n"
    "add $1, %%ebx\n"
    ".skip:\n"
    : "+b"(counter) : "a"(condition) : "cc"
);

Hint 3: Training Pattern Generator Create a buffer of uint8_t and fill it with your pattern. Access it using pattern[i & (N-1)] inside the loop. Make sure N is a power of 2 for maximum speed.

Hint 4: Toplev and Perf Use perf stat -e branches,branch-misses ./your_program to verify that your “Random” test is actually missing 50% of the time, and your “Periodic” test is missing 0% of the time.

Books That Will Help

Topic Book Chapter
Branch Prediction Theory “Computer Architecture: A Quantitative Approach” Ch. 3.3
Predictor Implementation “Modern Processor Design” Ch. 5.2
Intel/AMD Specifics “Agner Fog’s Microarchitecture Manual” “Branch Prediction” section

Project 3: Speculative Side-Channel Explorer (Spectre-lite)

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Assembly
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Security / Speculative Execution
  • Software or Tool: clflush, mfence, RDTSCP
  • Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you’ll build: A demonstration of the Spectre Variant 1 attack. You will write code that leaks a “secret string” from a protected area of memory, even though the secret is never logically accessed by your program. This project proves that software isolation can be broken by hardware performance optimizations.

Why it teaches CPU internals: You will realize that “Code” doesn’t just do what’s on the page. You’ll see that the CPU speculatively executes instructions that were logically impossible to reach, and you’ll learn how to “detect” the footprints left behind in the L1 cache.

Core challenges you’ll face:

  • Defeating the “Bounds Check”: You must train the branch predictor to expect valid indices, then suddenly pass an invalid one.
  • Cache Timing: Distinguishing between an L1 cache “hit” (~4 cycles) and a “miss” (~200+ cycles) with enough precision to recover data.
  • Microarchitectural Noise: Handling the fact that other processes and OS interrupts can “pollute” the cache while you’re measuring.

Real World Outcome

You will demonstrate a “ghost” in the machine. You’ll leak a “secret string” from a protected area of memory, even though the secret is never logically accessed by your program. This project proves that software isolation (like sandboxing) can be broken by hardware behavior.

Example Output:

$ ./spectre_lite

[Attacker Initialization]
Target Secret: "SUPER_SECRET_KEY_2025"
Training Predictor with valid indices (0-15)...

[Phase 1: The Speculative Strike]
Triggering OOB read with index 40... Branch predicted TRUE!
Speculative Load: secret_array[40] ('S') -> loaded into L1 cache.

[Phase 2: The Side-Channel Measurement]
Scanning Cache...
Char 'S' (83): Hit (15 cycles) - EXPOSED!
Char 'U' (85): Hit (12 cycles) - EXPOSED!
Char 'P' (80): Hit (14 cycles) - EXPOSED!

Recovered Secret: "SUPER_SECRET"
# WARNING: This demonstrates why modern OSes use KPTI and LFENCE mitigations.

The Core Question You’re Answering

“If my code has a bounds check (if (i < size)), why is it still unsafe?”

Modern CPUs prioritize speed over architectural purity. Speculative execution is so powerful that we accept its “leakiness” (leaving footprints in the cache) as a trade-off for performance. This project answers: How can microarchitectural side-effects be used to bypass logical security boundaries?

Concepts You Must Understand First

Stop and research these before coding:

  1. The Cache Timing Side-Channel
    • What is the difference in cycles between L1 Hit (~4 cycles) and RAM Miss (~200+ cycles)?
  2. Speculative Reversion
    • When a branch is mispredicted, the CPU “reverts” the registers. Why does it not revert the cache state? (Complexity and power cost).
  3. Flush+Reload Technique
    • How do you clear a specific memory address from the cache (clflush) and later measure how long it takes to read it back?
  4. Memory Barriers (LFENCE)
    • How does LFENCE stop the CPU from looking ahead?
    • Book Reference: “Practical Binary Analysis” Ch. 11

Questions to Guide Your Design

  1. Probe Array Spacing
    • Why must each entry in your “Probe Array” be 4096 bytes apart? (Hint: Page boundaries and prefetchers).
  2. Training vs. Attacking
    • How do you “trick” the predictor? (Hint: 5 good requests, 1 bad request).
  3. Timer Precision
    • Is clock() or gettimeofday() enough? (No, you need RDTSCP for cycle-level precision).

Thinking Exercise

The Speculative Path

Look at this logic:

if (index < 16) {
    uint8_t value = secret_data[index];
    temp &= probe_array[value * 4096];
}

Questions:

  • If index is 100, the if is logically false. Does the CPU fetch secret_data[100]? (Yes, if the predictor says “True”).
  • If the CPU speculates, it uses the result of secret_data[100] to calculate an address in probe_array. Does that address get loaded into the cache?
  • When the CPU realizes the branch was false, it throws away value. Is the entry in probe_array still in the cache?

The Interview Questions They’ll Ask

  1. “What is the fundamental difference between Spectre and Meltdown?”
  2. “Why are side-channel attacks so difficult to mitigate in software?”
  3. “How does KPTI (Kernel Page Table Isolation) help protect against speculation attacks?”
  4. “What is a ‘gadget’ in the context of a Spectre attack?”
  5. “Can you perform this attack in a high-level language like JavaScript?”

Hints in Layers

Hint 1: The Setup Allocate a probe_array of 256 * 4096 bytes. Use memset to ensure it’s actually mapped in RAM.

Hint 2: Clearing the Cache Use _mm_clflush(&probe_array[i * 4096]) for all 256 entries before you trigger the speculative read. This ensures that any “Hit” you find later was caused by the speculation.

Hint 3: The Attack Loop Run a loop 30 times. For 29 iterations, pass a safe index. On the 30th, pass the index of the “Secret.”

Hint 4: Measuring the Footprint After the attack, time how long it takes to read probe_array[i * 4096] for i from 0 to 255. The index i that returns the fastest is the value of the secret byte!

Books That Will Help

Topic Book Chapter
Side-Channel Attacks “Practical Binary Analysis” Ch. 11
Cache Internals “Computer Systems: A Programmer’s Perspective” Ch. 6.4
Spectre Original Paper “Spectre Attacks: Exploiting Speculative Execution” Kocher et al.

Project 4: The uOp Cache Prober

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: Assembly (x86-64)
  • Alternative Programming Languages: C (with __asm__)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Front-end / uOp Cache
  • Software or Tool: perf, RDTSC
  • Main Book: “Agner Fog’s Microarchitecture Manual”

What you’ll build: A tool that executes “hot loops” of varying sizes (from 10 instructions to 10,000 instructions) and measures the IPC (Instructions Per Cycle). You’ll identify the exact point where the loop stops fitting in the uOp Cache (DSB) and the CPU switches to the slower legacy decoder.

Why it teaches CPU internals: You will learn that the bottleneck of your CPU is often the “Front-end” (the decoder). You’ll see that Zen 5’s dual uOp caches allow for much larger loops to remain fast compared to older architectures.

Core challenges you’ll face:

  • NOP Sleds: Using precise NOP or dummy instructions to grow the loop size without changing its behavior.
  • Instruction Alignment: Aligning your loop start on a 64-byte boundary to avoid cache-line crossing penalties.
  • Counter Overflow: Handling RDTSC overflows during long-running tests.

Real World Outcome

You will generate a “Performance DNA Profile” of your CPU’s front-end. You’ll identify the “Magic Boundary”—the exact number of instructions where your code stops being “Fast” and starts being “Legacy.” This knowledge allows you to size your critical hot-loops perfectly to stay within the uOp Cache (DSB).

Example Output:

$ ./uop_prober --arch zen5 --step 16

[Analyzing AMD Zen 5 Front-End]
Loop Size (uOps) | IPC    | Path Taken | Efficiency
-----------------|--------|------------|-----------
64               | 6.2    | DSB        | 100%
512              | 6.1    | DSB        | 98%
2048             | 5.8    | DSB        | 93%
4096             | 5.2    | DSB/MITE   | 84% (Transition)
8192             | 2.4    | MITE       | 38% (Legacy Decoder)

# RESULT: Your Zen 5 uOp cache saturates at ~4096 entries.
# Optimization: For maximum IPC, keep innermost loops under 32KB of binary code.

The Core Question You’re Answering

“Why are small loops so much faster than big ones, even if they do the same work?”

The x86 instruction set is a variable-length nightmare (1-15 bytes). The “Legacy Decoder” is slow and power-hungry because it has to scan for instruction boundaries. The uOp Cache is the “cheat code” that makes modern x86 CPUs competitive with RISC by storing instructions after they’ve been decoded. This project answers: How big can my ‘Fast’ path be before the CPU is forced back to its legacy roots?

Concepts You Must Understand First

Stop and research these before coding:

  1. MITE (Legacy Decoder)
    • Why can the legacy decoder only handle ~4-6 instructions per cycle? (Serial bottleneck of length decoding).
  2. DSB (Decoded Stream Buffer / uOp Cache)
    • Why is the uOp cache measured in “uOps” rather than “Bytes”?
    • How does the CPU switch between MITE and DSB?
  3. Instruction Fetch Unit (IFU)
    • How does the CPU pre-fetch 32-64 bytes of code per cycle into the L1i cache?
  4. Zen 5 Dual-Front End
    • How does Zen 5 use two parallel decode paths to reach 8-wide dispatch?

Questions to Guide Your Design

  1. Instruction Variety
    • Does a 1-byte NOP (0x90) take as much uOp cache space as a 10-byte MOV? (Hint: In the uOp cache, a uOp is a uOp, but original byte size affects the MITE path).
  2. Alignment Penalties
    • What happens if your loop starts at the very end of a 64-byte cache line? Does it split the fetch window?
  3. Loop Control
    • If your loop is only 10 instructions, the “Jump” at the end happens every few cycles. Does the branch predictor bottleneck the test before the uOp cache does?

Thinking Exercise

The Decoder Bottleneck

Imagine an x86 instruction that is 15 bytes long. Questions:

  • If your CPU fetches 32 bytes per cycle, how many of these can it fetch at once? (Only 2).
  • If these are already in the uOp cache, does their 15-byte size still matter? (No, they are already uOps).
  • Why do compilers like gcc add “junk” NOP instructions to align loop targets to 32 or 64 byte boundaries?

The Interview Questions They’ll Ask

  1. “What is the Decoded I-Cache (uOp cache) and why is it critical for x86 performance?”
  2. “Why is x86 decoding harder than ARM decoding?”
  3. “What happens when a loop ‘thrashes’ the uOp cache?”
  4. “How does Zen 5’s dual-decode pipe change loop optimization strategy?”
  5. “What is uOp Fusion and where does it occur?”

Hints in Layers

Hint 1: The Loop Body Use a series of ADD RAX, 0 or PXOR XMM0, XMM0. These are simple, single-uop instructions that won’t bottleneck the execution ports.

Hint 2: Alignment Use the .p2align 6 assembly directive to align your loop to a 64-byte boundary.

Hint 3: perf Counters Use perf stat -e dsb_coverage.any,idq.mite_uops (Intel) or equivalent AMD counters to see exactly which path the CPU is taking.

Hint 4: Manual Unrolling To test specific sizes, use a macro to repeat an instruction N times:

#define REP10(x) x x x x x x x x x x

Books That Will Help

Topic Book Chapter
Front-end & uOp Cache “Modern Processor Design” Ch. 3.4
Zen 5 Microarchitecture “AMD Zen 5 Optimization Guide” Front-end Section
x86 Instruction Formats “Computer Systems: A Programmer’s Perspective” Ch. 3.1

Project 5: Execution Port Pressure Map

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: C++ (with inline assembly)
  • Alternative Programming Languages: Rust, Assembly
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Execution Units / Throughput
  • Software or Tool: perf, pmu-tools (toplev)
  • Main Book: “Intel 64 and IA-32 Architectures Optimization Reference Manual”

What you’ll build: A tool that characterizes the “Execution Ports” of your CPU. You will write loops that execute mixtures of instructions (e.g., Integer ADD vs Floating Point MUL) and measure the throughput to find which instructions compete for the same hardware resources.

Why it teaches CPU internals: You will understand that “the CPU” is actually a collection of specialized “Execution Units.” You’ll discover why you can do 4-6 additions per cycle but only 1 division, and how mixing independent instruction types (Int + FP) can lead to higher performance.

Core challenges you’ll face:

  • Identifying Port Binding: Creating “Conflict Tests” (two instructions that use the same port) vs “Parallel Tests” (two instructions that use different ports).
  • Managing Dependency Chains: Ensuring your test instructions don’t depend on each other (RAW hazards) so you are measuring port limits, not latency limits.
  • Interpreting PMU Counters: Learning to read UOPS_EXECUTED.PORT_0, PORT_1, etc. on Intel or OP_DISPATCH on AMD.

Real World Outcome

You will create a “Port Conflict Matrix” for your specific processor. This tool reveals the “floor plan” of your CPU’s execution engine. You’ll know exactly which instruction types (e.g., SIMD vs Integer) can run in parallel and which ones are fighting for the same specialized ports.

Example Output:

$ ./port_mapper --arch zen5

Analyzing Throughput (uOps/cycle):
- [ADD, ADD]: 4.0 uOps/c  (Status: Perfect Parallelism)
- [ADD, MUL]: 4.0 uOps/c  (Status: Perfect Parallelism)
- [MUL, MUL]: 1.0 uOps/c  (CONFLICT: Port 1 Only)
- [FADD, FADD]: 2.0 uOps/c (Parallel Vector Pipes)

Conclusion for Zen 5:
ALUs: 4 main ports (0-3)
IMUL Bottleneck: Dedicated to Port 1.
# Strategy: Spread IMUL instructions out; they can't be issued together!

The Core Question You’re Answering

“Why is my code slow even though CPU usage is only 50%?”

A CPU can be “busy” even if its transistors are idle, simply because every instruction is fighting for the same specialized “door” (port) to the ALU. This project answers: How are the execution units of my CPU physically wired to the instruction scheduler?

Concepts You Must Understand First

Stop and research these before coding:

  1. Superscalar Execution
    • What does “8-wide issue” mean in hardware? (The ability to send 8 uOps to ports in a single cycle).
  2. Execution Ports vs. Functional Units
    • Why do we use ports as an abstraction? (One port might lead to an Adder and a Multiplier).
  3. Zen 5 Integer Engine
    • Why did AMD move to 6 Integer ALUs and 4 AGUs?
  4. Port Binding
    • The “Port Conflict” test: If two instructions take 2x as long as one, they share a port.

Questions to Guide Your Design

  1. Dependency Chains
    • If Instruction B uses the result of Instruction A, they must run one after another (latency). How do you ensure your instructions are 100% independent so you measure throughput? (Hint: Use different registers for everything).
  2. Instruction Mix
    • If you mix ADD and SUB, do they compete? Check the “ALU port map” in your optimization manual.
  3. Register Pressure
    • If you use 50 different registers to avoid dependencies, will you hit the limit of the Physical Register File (PRF)?

Thinking Exercise

The Port Bottleneck

Your hypothetical CPU has 4 ports.

  • Ports 0, 1, 2, 3 can all do ADD.
  • Only Port 0 can do DIV. Questions:
  • What is the max IPC if your code is 100% ADD? (4.0).
  • What is the max IPC if your code is 100% DIV? (1.0).
  • What happens if your code is 50% ADD and 50% DIV? (Port 0 is busy with DIV, leaving only 3 ports for the ADDs).

The Interview Questions They’ll Ask

  1. “What is an execution port and why do modern CPUs have so many?”
  2. “Explain the difference between instruction latency and throughput.”
  3. “What is Port Contention and how do you detect it in code?”
  4. “How does the scheduler decide which port to send a uOp to?”
  5. “Compare Zen 5 vs Intel Lion Cove port layouts (Hint: Both are widening significantly).”

Hints in Layers

Hint 1: Independent Accumulators Use different registers for every instruction. Example: ADD RAX, 1; ADD RBX, 1; ADD RCX, 1; ...

Hint 2: The Conflict Test Measure the cycles for 1,000 instructions of type A. Then 1,000 of type B. Then 1,000 of a mixed A, B, A, B sequence. If the mixed sequence is faster than the sum, they are on different ports.

Hint 3: Use asm volatile C++ compilers will “optimize” your test by noticing you’re adding to registers you never use. Use asm volatile to force the CPU to actually perform the work.

Hint 4: Intel IACA / llvm-mca Use the llvm-mca tool to predict what your results should be. If your real code is slower, you might have a hidden dependency.

Books That Will Help

Topic Book Chapter
Execution Units “Inside the Machine” Ch. 4
Throughput Tables “Agner Fog’s Instruction Latencies” PDF Tables
Scheduler Design “Modern Processor Design” Ch. 4.4

Project 6: Memory Disambiguation Probe

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Assembly
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Load/Store Buffers / Speculation
  • Software or Tool: RDTSC
  • Main Book: “Modern Processor Design” by Shen & Lipasti

What you’ll build: A tool that tests the CPU’s Memory Disambiguation engine—the logic that decides if a LOAD can safely pass a STORE in the out-of-order engine. You’ll use addresses that “alias” (look the same to the CPU’s quick check) to trigger performance penalties.

Why it teaches CPU internals: You will understand that memory access is speculative too! You’ll see that the CPU guesses that ptrA and ptrB are different, only to “panic” and restart the instruction if they actually overlap.

Core challenges you’ll face:

  • 4K Aliasing: Understanding that x86 CPUs only check the bottom 12 bits of an address to decide if two pointers might overlap.
  • Store-to-Load Forwarding (STLF): Measuring the speed difference when a LOAD reads directly from the STORE buffer instead of the L1 cache.
  • Precise Timing: Detecting a 10-20 cycle “Store Buffer Stall.”

Real World Outcome

You will generate a “Disambiguation Heatmap” for your CPU. This tool reveals the “Blind Spot” of your CPU’s memory predictor. You’ll identify the exact address offsets (like the infamous 4K Aliasing) that cause the CPU to stall its out-of-order engine, helping you avoid performance pitfalls in high-performance data structures.

Example Output:

$ ./alias_probe --arch zen5

[Testing Memory Aliasing Bottleneck]
Offset (Bytes) | Latency (Cycles) | Status
---------------|------------------|-------
0              | 3                | STLF (Fast Forwarding)
64             | 5                | L1 Hit
2048           | 5                | L1 Hit
4096           | 22               | 4K ALIASING STALL!
8192           | 21               | 4K ALIASING STALL!

# ANALYSIS: Your Zen 5 CPU uses a 12-bit quick check for address aliasing.
# Performance Tip: Ensure pointers in tight read-after-write loops do not have the same bottom 12 bits.

The Core Question You’re Answering

“Why is writing to array[0] and then reading from array[1024] slower than it should be?”

The CPU’s out-of-order engine is a “look-ahead” machine. Memory addresses are the one thing it can’t always predict perfectly because the address isn’t known until the last millisecond. This project answers: How does the CPU guess if two memory locations overlap, and what happens when it’s wrong?

Concepts You Must Understand First

Stop and research these before coding:

  1. Load Buffer & Store Buffer
    • How does the CPU track “in-flight” memory operations that haven’t hit the cache yet?
  2. Memory Disambiguation
    • How does the CPU decide if a LOAD is “ready” if there are STOREs still waiting for their addresses to be calculated?
  3. Store-to-Load Forwarding (STLF)
    • How does data get from a STORE to a LOAD without ever touching the L1 cache? (The data is passed in a register-like way inside the store buffer).
  4. 4K Aliasing (Address Aliasing)
    • Why do modern x86 CPUs only compare the lower 12 bits of an address for the initial “Safety check”?

Questions to Guide Your Design

  1. Creating the Conflict
    • How can you make the STORE address take a long time to calculate (e.g., a long dependency chain) while the LOAD address is known immediately? (This forces the CPU to guess).
  2. Aliasing the Bits
    • What happens if you use ptr and ptr + 4096? Why does the hardware think they are the same? (Hint: The bottom 12 bits are identical).
  3. The Stall Cost
    • How many cycles does a “Disambiguation Violation” cost compared to a normal L1 hit?

Thinking Exercise

The Aliasing Trap

Consider this code:

*ptrA = 100;
int x = *ptrB;

Questions:

  • If ptrA == ptrB, the CPU must wait for the store to finish.
  • If ptrA != ptrB, the CPU can start loading x immediately even before the *ptrA = 100 finishes.
  • What if ptrB = ptrA + 4096? The bottom 12 bits are the same. Does the CPU realize they are different pages immediately? (No, it stalls while it performs the full address check).

The Interview Questions They’ll Ask

  1. “What is Memory Disambiguation and why is it necessary for OoO execution?”
  2. “Explain the ‘4K Aliasing’ problem in x86 architectures.”
  3. “What is Store-to-Load Forwarding and when does it fail?” (Fails on unaligned access or partial overlaps).
  4. “How does the ‘Load/Store Buffer’ handle speculative memory accesses?”
  5. “What is a ‘Memory Dependency Violation’ and what is the cost of fixing it in hardware?”

Hints in Layers

Hint 1: The Address Mask To trigger 4K aliasing, use two pointers p1 and p2 where ((uintptr_t)p1 & 0xFFF) == ((uintptr_t)p2 & 0xFFF).

Hint 2: The Slow Store Force the store address to be slow by making it depend on a long chain of IMUL or DIV instructions. This prevents the “Store Address Generation” unit from knowing the address until the load is already ready to run.

Hint 3: Use RDTSC Measure the time for the store+load sequence. Compare the case where the offset is 64 bytes (no alias) to 4096 bytes (alias).

Hint 4: Prevent Prefetching Use _mm_clflush to ensure the data isn’t sitting in the L1 cache in a way that masks the buffer logic you’re trying to measure.

Books That Will Help

Topic Book Chapter
Load/Store Units “Modern Processor Design” Ch. 6.4
Memory Speculation “Computer Architecture” Ch. 3.9
x86 Memory Model “Intel Optimization Manual” Memory Subsystem section

Project 7: The Reorder Buffer (ROB) Boundary Finder

Real World Outcome

You will determine the physical limit of your CPU’s “Instruction Horizon.” You’ll discover exactly how many instructions (uOps) your CPU can keep “in the air” while waiting for a slow memory load. This project proves why modern CPUs like Lunar Lake (576 entries) can hide more latency than older ones.

Example Output:

$ ./rob_prober --max-window 1024

[Measuring Reorder Buffer Capacity: Intel Lion Cove]
Window Size (uOps) | Latency (Cycles) | Status
-------------------|------------------|-------
128                | 305              | Waiting for L3 Miss
256                | 308              | Waiting for L3 Miss
512                | 315              | Window still open
576                | 320              | SATURATION POINT
640                | 610              | ROB FULL (Stalled Front-end)

# RESULT: Your Lion Cove ROB size is ~576 entries.
# Insight: This is the limit of your CPU's "patience."

The Core Question You’re Answering

“How far into the future can my CPU see?”

Modern CPUs like Zen 5 have ROBs with 400+ entries. This means the CPU is processing instructions that are 400 lines “ahead” of where the current result is being committed to memory. This project answers: What is the maximum number of independent operations I can pack into a loop to hide a 300-cycle RAM access?

Concepts You Must Understand First

Stop and research these before coding:

  1. The Reorder Buffer (ROB)
    • What is the difference between “Issue,” “Execute,” and “Commit/Retire”?
    • Why must instructions commit in-order even if they execute out-of-order?
  2. Instruction Latency vs. Throughput
    • How does a 300-cycle load “clog” the retirement engine?
  3. Speculative Execution Retirement
    • What happens to a ROB entry if the instruction was on a mispredicted branch?
  4. The Physical Register File (PRF)
    • How does the PRF size limit the ROB? (If you run out of physical registers, the ROB can’t grow).

Questions to Guide Your Design

  1. Creating the “Wall”
    • How do you create an instruction that definitely takes 300 cycles? (Hint: Pointer chasing through a large linked list that misses L3 cache).
  2. Independent “Filler”
    • How do you create uOps that have zero dependencies on the slow load and zero dependencies on each other? (Hint: Use 16+ different registers).
  3. Detecting the Saturation
    • Why does the total time suddenly double when you exceed the ROB size? (Hint: The next slow load can’t enter the ROB until the first one retires).

Thinking Exercise

The Full Buffer

Imagine a ROB with 10 slots.

  • Slot 1: A LOAD that takes 100 cycles.
  • Slots 2-10: ADD instructions that take 1 cycle. Questions:
  • In cycle 5, are slots 2-10 “done” executing? (Yes).
  • Can slots 2-10 be removed from the ROB in cycle 6? (No, because Slot 1 is still busy and retirement is in-order).
  • If the CPU encounters instruction 11 in cycle 7, can it enter the ROB? (No, the buffer is full).

The Interview Questions They’ll Ask

  1. “What is the primary function of the Reorder Buffer?”
  2. “How does the ROB size affect the CPU’s ability to hide memory latency?”
  3. “Explain the process of ‘retirement’ or ‘commit’ in an OoO processor.”
  4. “What is the ‘Instruction Window’ and how is it calculated?”
  5. “Why can’t we just make the ROB infinite?” (Power, area, and complexity of the renaming logic).

Hints in Layers

Hint 1: The Pointer Chase Create a linked list of 100MB. Each node’s next pointer should be at least 4KB away from the previous one to avoid the hardware prefetcher. This creates a “Memory Wall.”

Hint 2: The Filler Block Between two pointer-chase loads, insert N number of ADD RAX, 0. Use as many different registers as possible (RAX, RBX, RCX, RDX, RSI, RDI, R8-R15).

Hint 3: Measuring the Jump Loop the whole test 1,000 times. Plot N (number of filler uOps) vs Cycles. You will see a flat line that suddenly jumps vertically when N > ROB_SIZE.

Hint 4: Zen 5 vs Intel On Zen 5, look for ~448. On Intel Lion Cove, look for ~576. If you hit the jump at ~180-200, you are likely hitting the Physical Register File limit instead of the ROB limit.

Books That Will Help

Topic Book Chapter
ROB Architecture “Modern Processor Design” Ch. 4.3
Register Renaming “Computer Architecture” Ch. 3.2
Intel Windows “Intel Optimization Manual” Ch. 2.2

Project 8: Macro-op Fusion Detector

  • File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
  • Main Programming Language: Assembly
  • Alternative Programming Languages: C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Decoding / Instruction Fusion
  • Software or Tool: perf (uops_retired.slots), RDTSC
  • Main Book: “Agner Fog’s Optimization Manual”

What you’ll build: A benchmark that compares the retirement throughput of common instruction pairs (e.g., CMP and JZ) against non-fusable pairs. You’ll measure if the CPU treats the pair as a single “Macro-op” in the pipeline.

Why it teaches CPU internals: You’ll understand how the CPU “shrinks” your code in the front-end to save energy and bandwidth. You’ll learn which coding patterns are “hardware-friendly” for Zen 5 vs Lunar Lake.

Core challenges you’ll face:

  • Precise retirement counting: Using hardware performance counters to see the number of “uops retired” vs “instructions retired.”
  • Isolating Fusion from Caching: Ensuring the uop cache doesn’t mask the fusion effect.
  • Architectural differences: Finding why certain fusions work on Intel (Lion Cove) but not on AMD (Zen 5), or vice versa.

Real World Outcome

You will generate a “Fusion Catalog” for your processor. This report identifies which instruction pairs (like TEST/JZ or CMP/JE) your CPU’s decoder can physically “glue” into a single operation. Maximizing fusion effectively increases your CPU’s fetch and retirement width by up to 2x without increasing power consumption.

Example Output:

$ ./fusion_detect --arch lunar_lake

[Instruction Pair Analysis: Intel Lion Cove]
Pair: TEST EAX, EAX; JZ label; -> 1 uOp retired (SUCCESS: FUSED)
Pair: CMP RAX, 100; JG label;   -> 1 uOp retired (SUCCESS: FUSED)
Pair: ADD RAX, 1; JZ label;     -> 1 uOp retired (SUCCESS: FUSED)
Pair: INC RAX; JZ label;        -> 1 uOp retired (SUCCESS: FUSED)
Pair: MOV RAX, [RBX]; JZ label; -> 2 uOps retired (FAILED: Load cannot fuse with Jump)

# EFFICIENCY GAINED: 50% reduction in Front-end Pressure for control flow.
# Tip: Modern Intel cores fuse almost all ALU+Branch pairs.

The Core Question You’re Answering

“Does the order of my CMP and JMP instructions actually matter?”

Yes. Modern CPUs have specific logic in the decoder that looks for pairs of instructions to “glue” together. This is “Macro-op Fusion.” This project answers: How can I write assembly that the decoder can ‘shrink’ to save energy and bandwidth?

Concepts You Must Understand First

Stop and research these before coding:

  1. Macro-op Fusion (MOP Fusion)
    • How does the decoder merge two x86 instructions into one internal uop?
  2. Micro-op Fusion
    • How does the CPU merge a memory load and an ALU operation into one uop?
  3. The Decoder’s “Peephole”
    • How many instructions “ahead” does the decoder look to find fusion candidates?
  4. Instruction Retirement Slots
    • Why is counting “uops retired” the key to detecting fusion?

Questions to Guide Your Design

  1. Instruction Spacing
    • If you put a NOP between CMP and JZ, does fusion still happen? (Hint: No, they must be adjacent).
  2. Register Usage
    • Does CMP RAX, RBX; JZ label; fuse the same way as CMP RAX, 0; JZ label;?
  3. The Counter Signal
    • Which specific perf event allows you to see the “fused” state? (Hint: uops_retired.slots vs inst_retired.any).

Thinking Exercise

The Fusion Logic

The CPU decoder sees:

CMP RAX, RCX
JE label

Questions:

  • If these are fused, how many entries do they take in the ROB? (Only 1).
  • Does the execution unit (ALU) treat them as a single operation (compare-and-branch)? (Yes).
  • Why would fusing them save power? (Fewer bits to move through the pipeline).

The Interview Questions They’ll Ask

  1. “What is the difference between Micro-op fusion and Macro-op fusion?”
  2. “Why is fusion beneficial for the front-end of the CPU?”
  3. “List three instruction pairs that are commonly fused in x86-64.”
  4. “Does fusion happen in the decoder or in the uOp cache?”
  5. “How does the ‘Instruction Decoder’ width limit the number of fusions per cycle?”

Hints in Layers

Hint 1: The Measurement Loop Execute 100 million pairs in a tight loop. Use perf to count total instructions and total uops. If uops < instructions, fusion occurred.

Hint 2: Subtraction of Loop Overhead The loop itself has a DEC and JNZ. Measure a loop of just NOPs first to find the baseline cost of the loop control.

Hint 3: Boundary Crossings Test if fusion fails if the first instruction is at the end of a 16-byte fetch window and the second is at the start of the next.

Hint 4: Intel vs AMD AMD Zen 5 and Intel Lion Cove have different “Peephole” widths. Experiment with putting a non-fusing instruction between the pair to see when fusion breaks.

Books That Will Help

Topic Book Chapter
Fusion Rules “Agner Fog’s Optimization Manual” Instruction Fusion section
Intel Front-end “Intel Optimization Manual” Ch. 2.1
Zen 5 Front-end “AMD Zen 5 Microarchitecture Manual” Decoder section

Project 9: L1 Bandwidth Stressor (Zen 5 focus)

Real World Outcome

You will attempt to saturate the massive 512-bit wide memory pipes of the Zen 5 architecture. You’ll measure “Bytes per Cycle” and compare it to the theoretical peak (e.g., 512 GB/s). This project proves if your code can actually “feed the beast” of modern AVX-512 vector units.

Example Output:

$ ./l1_stress --width 512 --unroll 8

[Zen 5 L1 Bandwidth Benchmark]
AVX-256 (2 Loads/cycle): 124.2 GB/s
AVX-512 (2 Loads/cycle): 248.8 GB/s
AVX-512 (3 Loads/cycle - Zen 5 Optimized): 480.5 GB/s

# RESULT: You are hitting 98% of the theoretical 512 GB/s bandwidth.
# Efficiency: This loop moves 64 bytes of data every 2 nanoseconds.

The Core Question You’re Answering

“Why did AMD double the cache bandwidth on Zen 5, and can my code actually use it?”

Hardware features are useless if software can’t feed them. To hit peak speed, you must understand alignment, vector width, and AGU (Address Generation Unit) limits. This project answers: How do I structure my memory access to match the physical ‘pipe’ diameter of the L1 cache?

Concepts You Must Understand First

Stop and research these before coding:

  1. L1 Data Cache Ports
    • How many loads and stores can Zen 5 handle in one cycle? (Hint: 3 Loads, 2 Stores).
  2. Vector Width (AVX2 vs AVX-512)
    • Why does moving to 512-bit registers require doubling the bandwidth to keep the ALUs busy?
  3. Cache Line Alignment
    • What is a “Split Load” and why is it a performance killer?
  4. AGU (Address Generation Units)
    • Why does the CPU need a specialized unit just to calculate an address?

Questions to Guide Your Design

  1. Data Alignment
    • Why must your buffers be aligned to 64-byte boundaries (posix_memalign)?
  2. Read/Write Ratio
    • Is bandwidth higher for 100% Reads or a mix of Reads and Writes? (Check the port counts).
  3. Loop Unrolling
    • If you don’t unroll your loop, is the “branch” at the end the bottleneck? (Hint: Yes).

Thinking Exercise

The Pipe Diameter

Imagine the L1 cache as a water tank and the FPU as a thirsty engine.

  • Zen 4 has a 32-byte “pipe.”
  • Zen 5 has a 64-byte “pipe.” Questions:
  • If you use 32-byte (AVX2) instructions on Zen 5, do you get the benefit of the bigger pipe?
  • How many instructions must you issue per cycle to “fill” the 64-byte pipe?
  • What happens if the data is in the L2 cache instead of L1?

The Interview Questions They’ll Ask

  1. “What is the difference between L1 bandwidth and L1 latency?”
  2. “How many Load/Store units does the Zen 5 architecture have?”
  3. “Explain why unaligned memory access is slower.”
  4. “What is a Cache Bank Conflict?”
  5. “Does AVX-512 cause clock downclocking on Zen 5?” (Answer: No, Zen 5 is designed for it).

Hints in Layers

Hint 1: Aligned Allocation Use aligned_alloc(64, size) to ensure your memory starts at the beginning of a cache line.

Hint 2: Intrinsics Use _mm512_load_ps to move 64 bytes in one instruction. Ensure you use the “aligned” version of the load to avoid penalties.

Hint 3: Unrolling Manually unroll your loop 8 times. This ensures the CPU spends 99% of its time moving data and 1% of its time checking the loop counter.

Hint 4: Parallel Pointers Use 4 different pointers to 4 different memory areas to avoid “Store-to-Load” false dependencies in the buffer.

Books That Will Help

Topic Book Chapter
Cache Bandwidth “Computer Systems” Ch. 6.4
Zen 5 Design “AMD Zen 5 Microarchitecture Whitepaper” Cache section
SIMD Optimization “The Art of Writing Efficient Programs” Ch. 7

Project 10: Lunar Lake P vs E Core Profiler

Real World Outcome

You will generate an “Architectural Efficiency” report for Intel’s Lunar Lake. You’ll compare the performance of Lion Cove (Performance) and Skymont (Efficiency) cores on identical code. You’ll discover the trade-offs in decode width, branch accuracy, and ROB size between the two designs.

Example Output:

$ ./hybrid_profile

[Intel Lunar Lake Topology]
P-Cores: 4 (Lion Cove)
E-Cores: 4 (Skymont)

Metric       | Lion Cove (P) | Skymont (E) | Gap
-------------|---------------|-------------|-----
Integer IPC  | 6.2           | 4.1         | 1.5x
Branch Acc   | 98.2%         | 95.1%       | 1.03x
L2 Latency   | 12 cycles     | 18 cycles   | 1.5x
ROB Size     | 576 entries   | 416 entries | 1.38x

# CONCLUSION: Skymont E-cores deliver 70% of P-core IPC at 1/4 the power.
# Optimization: Target P-cores for UI/Latency, E-cores for Background/Batch.

The Core Question You’re Answering

“If my CPU has 8 cores, why is my code slower on 4 of them?”

Intel’s Lunar Lake represents the future of heterogeneous computing. This project answers: How do I write code that performs well on both a 12-wide P-core and a 9-wide E-core?

Concepts You Must Understand First

Stop and research these before coding:

  1. Heterogeneous Computing
    • Why did Intel remove SMT (Hyperthreading) from Lunar Lake? (Power and die area efficiency).
  2. Lion Cove vs Skymont Architecture
    • Compare the decoder widths (8-wide vs 9-wide clusters).
  3. Intel Thread Director
    • How does the hardware “nudge” the OS to move a thread to a different core type?
  4. Frequency Normalization
    • To compare architecture, you must divide performance by Clock Speed.

Questions to Guide Your Design

  1. Thread Pinning
    • How do you use pthread_setaffinity_np to force a thread onto Core 0 (P) vs Core 4 (E)?
  2. Cache Latency
    • Is it faster to share data between two P-cores than between a P and an E core?
  3. ISA Symmetry
    • Are there instructions that run on P but fail or emulate on E?

Thinking Exercise

The Efficient Worker

Imagine a P-core as a specialized engineer and an E-core as a general worker. Questions:

  • If you have 100 simple math tasks, is it better to use 1 P-core or 4 E-cores?
  • If the P-core has a 12-wide decoder and the E-core has a 9-wide decoder, which one is more sensitive to “messy” code?
  • Why did Intel give the E-cores a massive L2 cache cluster?

The Interview Questions They’ll Ask

  1. “What is Intel Lunar Lake’s architectural strategy for mobile devices?”
  2. “Why is branch prediction accuracy different between P and E cores?”
  3. “Explain how thread affinity affects high-concurrency performance.”
  4. “What are the advantages of removing SMT from performance cores?”
  5. “How does the Ring Bus connect P and E cores?”

Hints in Layers

Hint 1: CPUID Detection Use CPUID Leaf 0x1A to identify core types. A result of 0x20 means E-core, 0x40 means P-core.

Hint 2: Pinning API In Linux, use sched_setaffinity. In macOS, use thread_policy_set.

Hint 3: The Workload Run a variety of tasks: Integer math, Floating point, and Linked List walks. Measure IPC for each on both core types.

Hint 4: Power Monitoring If on Linux, read the RAPL interface (/sys/class/powercap/intel-rapl) to measure the “Energy Per Instruction” (EPI) for both cores.

Books That Will Help

Topic Book Chapter
Intel Hybrid Arch “Intel Optimization Manual” Ch. 18
Core Differences “AnandTech / Chips and Cheese Analysis” Lunar Lake Deep Dive
Linux Scheduling “How Linux Works” Ch. 4

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Human Pipeline Trace Beginner Weekend Medium 3/5
2. Branch Torture Test Intermediate 1 Week High 4/5
3. Spectre-lite Expert 2 Weeks Extremely High 5/5
4. uOp Cache Prober Advanced 1 Week High 4/5
5. Port Pressure Map Advanced 1 Week High 4/5
6. Memory Disambiguation Expert 2 Weeks High 3/5
7. ROB Boundary Finder Expert 1 Week High 3/5
8. Macro-op Fusion Intermediate 3 Days Medium 4/5
9. L1 Bandwidth Stress Advanced 1 Week High 3/5
10. Lunar Lake Profiler Advanced 1 Week High 5/5

Recommendation

Based on your level:

  1. For the Absolute Beginner: Start with Project 1 (Human Pipeline Trace). You cannot understand modern out-of-order execution without first understanding the “simple” pipeline it evolved from.
  2. For the Aspiring Systems Engineer: Focus on Project 6 (Memory Disambiguation) and Project 7 (ROB Boundary Finder). These address the most common bottlenecks in real-world database and engine code.
  3. For the Security Obsessed: Go straight to Project 3 (Spectre-lite). It’s the most challenging but the most rewarding, as it bridges the gap between hardware architecture and cybersecurity.

Final Overall Project: The uArch-Aware JIT Engine

What you’ll build: A “Micro-JIT” (Just-In-Time) compiler for a simple math language. This JIT is unique: it is Micro-architecture Aware. It won’t just generate generic machine code; it will optimize itself on-the-fly based on the CPU it detects.

  • On Zen 5: It will use AVX-512 for parallel operations and unroll loops to match the 8-wide dispatch engine.
  • On Lunar Lake: It will detect if it’s running on a P-core or E-core. On E-cores, it will favor instructions with lower port pressure and smaller instruction lengths.
  • Across both: It will automatically align loop targets to 64-byte boundaries to maximize uOp Cache (DSB) hits and avoid cache-line split penalties.

Real World Outcome

You will have a high-performance JIT engine that outperforms generic compilers by tailoring code to the specific silicon. You’ll be able to demonstrate a 20-30% performance boost just by changing the code-generation strategy based on CPUID results.

Example Output:

$ ./uarch_jit --run my_math_script.math

[JIT: Target Detection]
Found: AMD Zen 5 (8-wide Front-end, 512-bit FPU)
[JIT: Optimization Strategy]
- Strategy: AVX-512 Vectorization
- Strategy: 8x Loop Unrolling
- Strategy: 64-byte Target Alignment

[Execution Result]
Result: 42.0000
Time: 450 cycles (Generic JIT took 680 cycles)
Performance Gain: 1.51x

The Core Question You’re Answering

“How do the world’s fastest runtimes (like V8 or the JVM) achieve such high performance?”

Modern runtimes aren’t just interpreters; they are sophisticated compilers that know the secrets of the hardware they run on. This project answers: How do I build a system that dynamically adapts its machine-code generation to exploit the hidden strengths of 2025 CPU architectures?

Concepts You Must Understand First

  1. JIT Compilation Basics
    • Generating machine code at runtime (mmap with PROT_EXEC).
  2. Instruction Selection & Scheduling
    • Choosing the “cheapest” instructions for a specific core.
  3. Runtime CPU Dispatching
    • Using the CPUID instruction to branch between different code-gen paths.
  4. Binary Patching
    • How to align instructions and insert NOPs for padding at runtime.

Thinking Exercise

The Adaptive Compiler

Imagine your JIT sees a loop that adds 8 numbers. Questions:

  • On a CPU with 256-bit SIMD, how many instructions do you generate?
  • On Zen 5 with 512-bit SIMD, how many?
  • If the CPU has a 576-entry ROB, how many iterations of the loop can you “look ahead” if you unroll it?

The Interview Questions They’ll Ask

  1. “What are the security risks of generating executable memory at runtime (W^X)?”
  2. “How do you detect CPU features in a portable way?”
  3. “Why does loop alignment matter for the uOp cache?”
  4. “Explain the trade-off between JIT compilation time and final code execution time.”

Hints in Layers

Hint 1: The Code Buffer Use mmap to allocate a page, then use mprotect to make it executable after you’ve written your machine code bytes.

Hint 2: Feature Flags Create a struct CPUFeatures and fill it once at startup. Use these flags inside your emit_instruction() functions.

Hint 3: Alignment Logic To align a loop, check the current write_pointer. If (ptr % 64) != 0, emit a “multi-byte NOP” until you hit the boundary.

Hint 4: Benchmarking Run your math script 1 million times and use RDTSC to compare your “uArch-aware” JIT against a “Naïve” JIT that uses the same code for every CPU.

Books That Will Help

Topic Book Chapter
JIT Implementation “Writing a C Compiler” Ch. 12 (Codegen)
Runtime Code Gen “Linkers and Loaders” Ch. 11
CPU Specifics “Modern Processor Design” Ch. 4-6

Summary

This learning path covers Modern CPU Internals through 10 hands-on projects designed for the 2025 hardware landscape.

# Project Name Main Language Difficulty Time Estimate
1 Human Pipeline Trace C Beginner Weekend
2 Branch Torture Test C++ Intermediate 1 Week
3 Spectre-lite C Expert 2 Weeks
4 uOp Cache Prober Assembly Advanced 1 Week
5 Port Pressure Map C++ Advanced 1 Week
6 Memory Disambiguation C Expert 2 Weeks
7 ROB Boundary Finder Assembly Expert 1 Week
8 Macro-op Fusion Assembly Intermediate 3 Days
9 L1 Bandwidth Stress C Advanced 1 Week
10 Lunar Lake Profiler C++ Advanced 1 Week
F uArch-Aware JIT C Expert 1 Month

Expected Outcomes

After completing these projects, you will:

  • Understand every stage of the 2025 CPU execution pipeline.
  • Be able to diagnose bottlenecks in front-end (decoding) and back-end (execution).
  • Know how to write code that avoids branch mispredictions and port contention.
  • Understand the architectural trade-offs between Zen 5 and Intel Lunar Lake.
  • Be familiar with speculative execution security and how to measure mitigation impact.