MODERN CPU INTERNALS 2025 DEEP DIVE
In 2025, we are in the era of Architectural Specialization. The speed of a single thread is no longer determined by clock frequency (which has plateaued at ~5-6GHz), but by how many instructions the CPU can execute in parallel (**Instruction Level Parallelism - ILP**) and how well it can guess the future (**Speculative Execution**).
Modern CPU Internals: 2025 Deep Dive
Goal: Deeply understand the microarchitecture of modern CPUs (Zen 5, Lunar Lake, Apple M4) through hands-on implementation and measurement. You will move past the abstraction of âinstructionsâ to the reality of micro-operations, out-of-order execution windows, TAGE branch predictors, and 512-bit vector pathways. By the end, you will be able to write code that achieves 90%+ of theoretical hardware peak performance.
Why Modern CPU Internals Matter in 2025
In 2025, we are in the era of âArchitectural Specialization.â The speed of a single thread is no longer determined by clock frequency (which has plateaued at ~5-6GHz), but by how many instructions the CPU can execute in parallel (Instruction Level Parallelism - ILP) and how well it can guess the future (Speculative Execution).
The gap between a developer who understands the pipeline and one who doesnât is no longer a few percent; itâs an order of magnitude. Modern CPUs are effectively supercomputers on a chip, with 400+ instruction windows and multiple 512-bit vector engines.
High-level Code (C/C++) What you think happens: "Add x to y"
â
x = x + y; â One instruction, one cycle.
â
Microarchitectural Reality What actually happens (Zen 5/Lunar Lake):
â
Fetch & Branch Prediction â Predict "if" result, fetch 32-64 bytes of code.
Decode (x86 -> uOps) â Break complex x86 into simple RISC uOps.
Rename & Allocate â Map RAX to Physical Register #187 to break dependencies.
Schedule (Out-of-Order) â Wait for data. Execute the MOMENT data is ready.
Execute (Port Pressure) â Compete for the specific ALU that can do ADD.
Retire (In-Order) â Put the result back in RAX only if prediction was right.

Core Concept Analysis
1. The Front-End: x86 to uOp Translation
Modern CPUs are RISC machines wearing a CISC mask. x86 instructions are variable-length (1-15 bytes), making them a nightmare to decode. The âFront-Endâ is responsible for fetching these bytes and turning them into fixed-length âMicro-opsâ (uOps) that the execution engine can understand.
[ L1 Instruction Cache ]
â
[ Instruction Fetch Unit ] â Fetches 32-64 bytes/cycle
â
[ Pre-Decode / Length Calc ] â Find where instructions start/end (hard!)
â
[ Micro-op Decoders ] â Zen 5: Dual 4-wide decoders
â (Turns ADD [mem], rax -> LOAD + ADD)
[ uOp Cache (DSB) ] â The "Shortcut": Stores already decoded uOps
â
[ uOp Queue ] â Buffers uOps before the storm

Key Insight: If your loop fits in the uOp Cache (DSB), you bypass the slow, power-hungry decoders entirely. This is why small, tight loops are exponentially faster. Zen 5 doubled down on this with a massive increase in uOp cache throughput.
2. The Out-of-Order (OoO) Engine & The ROB
The Reorder Buffer (ROB) is the âWaiting Roomâ of the CPU. It allows the CPU to look ahead hundreds of instructions to find things to do while waiting for slow RAM. This is âOut-of-Orderâ execution: instructions that are ready to run (operands available) jump ahead of instructions that are stalled.
+---------------------------------------+
| REORDER BUFFER (ROB) |
| (Zen 5: 448 entries, Lion Cove: 576) |
+---------------------------------------+
/ | | | | \
[ Issue ] [ Issue ] [ Issue ] [ Issue ] [ Issue ] â Dispatch uOps
| | | | |
[ Port 0 ] [ Port 1 ] [ Port 2 ] [ Port 3 ] [ Port 4 ] â ALUs/FPUs
| | | | |
\----------\----------\----------\----------/
|
[ COMMIT / RETIRE ] â Make it "official" in-order

Key Insight: The size of the ROB determines the CPUâs âInstruction Window.â If you have a memory miss that takes 300 cycles, and the CPU can only look 448 uOps ahead, it will eventually run out of things to do and stall.
3. Branch Prediction: TAGE Predictors
Modern CPUs predict the outcome of every if statement and loop condition before they even know the values. If the prediction is right, the CPU keeps flying. If itâs wrong, it has to ânukeâ the entire pipeline and start over. In 2025, we use TAGE (Tagged Geometric) predictors.
Pattern: T, T, N, T, T, N...
[ Short History Table ] â Catches "T, N"
[ Medium History Table ] â Catches "T, T, N"
[ Long History Table ] â Catches complex patterns across function calls

4. Port Pressure and Execution Units
A CPU isnât just one giant brain; itâs a team of specialists. There are specific ALUs for addition, others for multiplication, and others for memory loads. These specialists are accessed through âPorts.â If all your code is doing is ADD, you might be limited by the number of ADD ports, even if the rest of the CPU is idle. This is called Port Pressure.
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| uOps & Decoding | x86 is just a frontend; the backend is a RISC engine. Decode width is the first bottleneck. |
| Out-of-Order (OoO) | The âInstruction Windowâ (ROB size) determines how much memory latency the CPU can hide. |
| Speculation | Speculative execution is why modern CPUs are fast, but also why Spectre exists. |
| Execution Ports | Multiple ALUs exist, but they are specialized. âPort Pressureâ is the 2nd bottleneck. |
| Memory Aliasing | The CPU only checks ~12 bits of address for quick overlap checks (4K Aliasing). |
| Heterogeneous (P/E) | Schedulers must map tasks based on core-specific decode widths and vector support. |
| Bandwidth vs Latency | High-throughput 512-bit vector units require massive L1 cache bandwidth to stay fed. |
Deep Dive Reading by Concept
Microarchitecture Fundamentals
| Concept | Book & Chapter |
|---|---|
| The 5-stage classic pipeline | Computer Systems: A Programmerâs Perspective (CS:APP) â Ch. 4 |
| Superscalar & OoO Execution | Modern Processor Design by Shen & Lipasti â Ch. 4-5 |
| Front-end & uOp Caching | Agner Fogâs Microarchitecture Manual â Section: âThe Pipelineâ |
| Intel/AMD Optimization | Intel 64 and IA-32 Optimization Manual â Ch. 2 (Architecture) |
Branch Prediction & Speculation
| Concept | Book & Chapter |
|---|---|
| TAGE and Advanced Prediction | Computer Architecture: A Quantitative Approach (Hennessy & Patterson) â Ch. 3 |
| Speculative Security (Spectre) | Practical Binary Analysis by Dennis Andriesse â Ch. 11 |
Memory Hierarchy & Vectorization
| Concept | Book & Chapter |
|---|---|
| Cache design and bandwidth | Computer Systems: A Programmerâs Perspective â Ch. 6 |
| SIMD and Vector performance | The Art of Writing Efficient Programs by Fedor Pikus â Ch. 7 |
| Memory Disambiguation | Modern Processor Design â Ch. 6.4 |
Project 1: The Human Pipeline Trace
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Assembly, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The âResume Goldâ
- Difficulty: Level 1: Beginner
- Knowledge Area: Pipelining / Hazards
- Software or Tool:
objdump,gdb - Main Book: âComputer Systems: A Programmerâs Perspectiveâ by Bryant & OâHallaron
What youâll build: A CLI tool that parses a sequence of assembly instructions and generates a cycle-by-cycle âGantt chartâ of how they move through a 5-stage pipeline, highlighting stalls and forwarding.
Why it teaches CPU internals: You will move from thinking of code as âinstantâ to seeing it as a physical movement of data through latches. Youâll understand why ADD R1, R2, R3 followed by SUB R4, R1, R5 causes a âData Hazard.â
Core challenges youâll face:
- Parsing Data Dependencies: You must track which registers are âwrittenâ and when they are âavailable.â
- Modeling Forwarding Paths: Deciding if the result of an EX stage can be sent directly to the next instructionâs EX stage.
- Visualizing the âBubbleâ: Representing a pipeline stall in a way that shows why the fetch unit stopped.
Real World Outcome
By completing this project, you will have a micro-architectural simulator that reveals the âheartbeatâ of a CPU. Youâll see how instructions flow through the 5 classic stages (Fetch, Decode, Execute, Memory, Write-back) and, more importantly, where they stop and why.
Youâll be able to take real assembly code generated by gcc or clang and see exactly why itâs not running at 1 instruction per cycle.
Example Output:
$ gcc -S -O0 my_loop.c -o my_loop.s
$ ./pipe_trace my_loop.s --forwarding=on
Analyzing:
1: ADD R1, R2, R3
2: SUB R4, R1, R5
3: LDR R6, [R1]
Cycle | IF | ID | EX | MEM | WB | Notes
------+-----+-----+-----+-----+-----+
1 | I1 | | | | | Fetching ADD
2 | I2 | I1 | | | | Fetching SUB, Decoding ADD
3 | I3 | I2 | I1 | | | I1 starts Execute
4 | I4 | I3 | I2 | I1 | | I2 enters EX (Forwarded R1 from I1)
5 | I5 | I4 | I3 | I2 | I1 | I1 writes back. I3 stalled? NO.
6 | ... | ... | ... | ... | ... |
[Pipeline Hazards Detected]
- RAW Hazard: I2 depends on I1 (R1). Status: Resolved via Forwarding (EX -> EX).
- Structural Hazard: None.
[Final Stats]
Total Cycles: 8
Instructions: 3
IPC (Instructions Per Cycle): 0.375
Efficiency: 37.5% of peak.
The Core Question Youâre Answering
âWhy does adding a single instruction sometimes make my code twice as slow?â
Before you write any code, sit with this. Most developers think code is âinstantâ once it hits the CPU. This project proves that code is a physical movement of data. If one instruction is waiting for a result that hasnât been written yet, the whole factory floor (the pipeline) grinds to a halt. You are answering: How does the physical design of the CPU create wait-times (hazards)?
Concepts You Must Understand First
Stop and research these before coding:
- The 5-Stage Pipeline (RISC model)
- IF (Instruction Fetch): Getting the bits from memory.
- ID (Instruction Decode): Figuring out what the bits mean and reading registers.
- EX (Execute): The ALU doing the actual math.
- MEM (Memory Access): Reading or writing to RAM.
- WB (Write Back): Putting the result back in the register file.
- Book Reference: âComputer Systemsâ Ch. 4.1
- Data Hazards (RAW, WAR, WAW)
- Read-After-Write (RAW): Trying to read a value before itâs finished being calculated.
- Forwarding (Bypassing): The âshort-circuitâ that lets a result jump from the end of the ALU directly back to the input of the next ALU.
- Book Reference: âComputer Organization and Designâ Ch. 4.7
- Control Hazards (Branching)
- What happens to the instructions already in the pipeline when a jump occurs?
Questions to Guide Your Design
Before implementing, think through these:
- Modeling Time
- How do you represent âone cycleâ in a C program?
- Should your simulator loop over instructions or loop over cycles? (Hint: Cycle-based is the only way to model parallel stages).
- Instruction Representation
- What metadata does an instruction need to carry as it moves through the stages? (Source registers, destination register, cycle it entered).
- Hazard Detection Logic
- How does the âDecodeâ stage know if it should stall? It needs to look âaheadâ at the EX, MEM, and WB stages. How do you implement this âlook-aheadâ?
Thinking Exercise
Trace the Flow by Hand
Before coding, trace this sequence on paper for a 5-stage pipeline WITHOUT forwarding:
1: ADD R1, R2, R3
2: SUB R4, R1, R5
Questions:
- In which cycle is the result of
ADD(R1) actually written to the register file? - In which cycle does
SUBneed to read R1 from the register file? - How many cycles of âbubblesâ (stalls) are needed?
- Now, add âForwardingâ from the EX stage. How does the diagram change?
The Interview Questions Theyâll Ask
- âWhat is the difference between a structural hazard and a data hazard?â
- âHow does a pipeline increase throughput but not necessarily decrease latency?â
- âExplain how forwarding (bypassing) works in a CPU pipeline.â
- âWhat is a âbranch delay slotâ and why was it used in older architectures?â
- âWhat happens to the pipeline when a page fault occurs in the MEM stage?â
Hints in Layers
Hint 1: The Pipeline Array
Represent the pipeline as an array of 5 elements: Instruction* pipeline[5];. In each âcycleâ, move pointers from index 4 down to 0 (Writeback to Fetch).
Hint 2: Move Backwards When updating the pipeline for a new cycle, process the stages from WB to IF. If you move IF to ID first, you might move the same instruction through the whole pipeline in one cycle!
Hint 3: The Stall Condition
A stall happens in the ID stage. If pipeline[1] (ID) needs a register that pipeline[2] (EX) is about to write, and you donât have forwarding, you must âfreezeâ the IF and ID stages while letting EX, MEM, and WB continue.
Hint 4: Tracking Register Availability
Keep an array int register_ready_at_cycle[32]. Every time an instruction enters the pipeline, update when its destination register will be âlegalâ to read.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Pipeline Architecture | âComputer Systems: A Programmerâs Perspectiveâ | Ch. 4.4 |
| Hazard & Forwarding | âComputer Organization and Designâ | Ch. 4.7-4.8 |
| RISC-V Implementation | âDigital Design and Computer Architectureâ | Ch. 7 |
Project 2: The Branch Predictor Torture Test
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The âResume Goldâ
- Difficulty: Level 2: Intermediate
- Knowledge Area: Branch Prediction / TAGE
- Software or Tool:
perf,RDTSC - Main Book: âModern Processor Designâ by Shen & Lipasti
What youâll build: A suite of micro-benchmarks that execute loops with conditional branches following specific patterns (alternating, nested loops, long-period sequences). Youâll measure the exact cycle cost of each pattern to find the limits of your CPUâs predictor.
Why it teaches CPU internals: You will realize that the CPU has a âmemoryâ of your codeâs history. Youâll discover that a pattern like T, T, N, T, T, N is essentially free, but a truly random sequence is 15-20x slower. Youâll explore the limits of modern TAGE (Tagged Geometric) predictors found in Zen 5 and Lunar Lake.
Core challenges youâll face:
- Defeating the Compiler: Compilers love turning branches into
CMOV(conditional moves) which have no prediction. You must usevolatileor assembly to keep the branch. - High-Resolution Timing: Using
RDTSC(Read Time-Stamp Counter) to measure differences of just a few clock cycles. - Pattern Complexity: Generating a pattern that is too long for the CPUâs Branch History Table (BHT) or TAGE tables.
Real World Outcome
You will create a âMap of Predictabilityâ for your CPU. This tool will identify the exact âsaturation pointâ where your CPUâs branch predictor (like the TAGE-L in Zen 5) can no longer find patterns in your code. Youâll be able to quantify the cycle-cost of a mispredictionâproving why âbranchlessâ code is often superior for performance.
Example Output:
$ ./branch_torture --patterns all --warmup 10000
[Target: Intel Core Ultra 200V (Lunar Lake)]
[Testing Pattern: Periodic Binary]
Pattern: [T, N] -> 1.01 cycles/iter (Predictor: 100% Correct)
Pattern: [T, T, T, N] -> 1.01 cycles/iter (Predictor: 100% Correct)
Pattern: [T, T, N, N, T, N] -> 1.02 cycles/iter (Predictor: 100% Correct)
[Testing Pattern: Random Chaos]
Pattern: [Rand 50/50] -> 18.45 cycles/iter (Predictor: 0% Mastery)
[Testing Pattern: Deep History (TAGE Test)]
Pattern: [Period 1024] -> 1.05 cycles/iter (Predictor: Still Mastering)
Pattern: [Period 16384] -> 15.22 cycles/iter (Predictor: SATURATED)
# ANALYSIS: Your Skymont E-core predictor saturates at ~8192 bits of global history.
# Misprediction Penalty: 17.4 cycles (approx. pipeline depth).
The Core Question Youâre Answering
âIs my CPUâs branch predictor smart enough to learn a pattern of 1,000 branches?â
Most developers think branch prediction is a simple âyes/noâ guess based on the last result. In reality, itâs a sophisticated machine-learning engine that can track thousands of previous outcomes to identify complex rhythms. This project answers: What are the physical limits of my CPUâs âmemoryâ of my codeâs behavior?
Concepts You Must Understand First
Stop and research these before coding:
- Two-Bit Saturating Counters
- Why is one bit of history not enough? (The âhysteresisâ problemâchanging prediction too fast on a single outlier).
- Book Reference: âModern Processor Designâ Ch. 5.2
- Global History Register (GHR)
- How does the outcome of a branch in a completely different function affect the prediction of the current branch?
- Book Reference: âComputer Architecture: A Quantitative Approachâ Ch. 3.3
- TAGE Predictors (Zen 5 / Lunar Lake)
- What does âTagged Geometricâ mean? (Using multiple tables with exponentially increasing history lengths).
- Why do TAGE predictors allow for 1000+ branch history tracking?
- RDTSC and Timing Noise
- Using
__builtin_ia32_rdtsc()vsrdtscp. - Why you must pin your process to a single core (
pthread_setaffinity_np) to avoid âcore hoppingâ noise.
- Using
Questions to Guide Your Design
- Defeating the Compiler
- If you write
if (x < 10) count++;, the compiler might turn it into aCMOV(Conditional Move).CMOVhas no predictionâit just runs. How do you force the compiler to use a real branch (JCC)? (Hint: Useasm gotoorvolatile).
- If you write
- Warming the Engine
- Why do you need a âTraining Phaseâ before you start your timer? What happens to the predictorâs state during the first 1,000 iterations?
- Overhead Subtraction
- How do you measure just the branch? You need a âBaselineâ loop that does the same work but has no branch. How do you ensure the baseline isnât optimized away?
Thinking Exercise
The Hidden State
Trace this logic:
for (int i = 0; i < 1000000; i++) {
if (i % 3 == 0) { // T, N, N, T, N, N...
do_work();
}
}
Questions:
- After how many iterations will a â2-bit saturating counterâ reach the âStrongly Not Takenâ state for the
Nparts of the pattern? - If the pattern was 1,000 elements long before repeating, would a simple local predictor catch it?
- Draw how the Global History Register would look after 4 iterations.
The Interview Questions Theyâll Ask
- âWhy is a mispredicted branch so much more expensive than a predicted one?â (Pipeline flush).
- âWhat is speculative execution, and how does it relate to branch prediction?â
- âExplain the difference between a Branch Target Buffer (BTB) and a Branch History Table (BHT).â
- âHow does a TAGE predictor handle âaliasingâ between different branches?â
- âCan one threadâs branch history affect another thread on the same physical core (SMT/Hyper-threading)?â
Hints in Layers
Hint 1: The Measurement Loop
Wrap your test in a function and use RDTSC at the start and end. Run it for at least 100 million iterations to average out system jitter.
Hint 2: The Inline Assembly Branch To guarantee a branch, use inline assembly:
__asm__ volatile (
"test %%eax, %%eax\n"
"jz .skip\n"
"add $1, %%ebx\n"
".skip:\n"
: "+b"(counter) : "a"(condition) : "cc"
);
Hint 3: Training Pattern Generator
Create a buffer of uint8_t and fill it with your pattern. Access it using pattern[i & (N-1)] inside the loop. Make sure N is a power of 2 for maximum speed.
Hint 4: Toplev and Perf
Use perf stat -e branches,branch-misses ./your_program to verify that your âRandomâ test is actually missing 50% of the time, and your âPeriodicâ test is missing 0% of the time.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Branch Prediction Theory | âComputer Architecture: A Quantitative Approachâ | Ch. 3.3 |
| Predictor Implementation | âModern Processor Designâ | Ch. 5.2 |
| Intel/AMD Specifics | âAgner Fogâs Microarchitecture Manualâ | âBranch Predictionâ section |
Project 3: Speculative Side-Channel Explorer (Spectre-lite)
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Assembly
- Coolness Level: Level 5: Pure Magic
- Business Potential: 3. The âService & Supportâ Model
- Difficulty: Level 4: Expert
- Knowledge Area: Security / Speculative Execution
- Software or Tool:
clflush,mfence,RDTSCP - Main Book: âPractical Binary Analysisâ by Dennis Andriesse
What youâll build: A demonstration of the Spectre Variant 1 attack. You will write code that leaks a âsecret stringâ from a protected area of memory, even though the secret is never logically accessed by your program. This project proves that software isolation can be broken by hardware performance optimizations.
Why it teaches CPU internals: You will realize that âCodeâ doesnât just do whatâs on the page. Youâll see that the CPU speculatively executes instructions that were logically impossible to reach, and youâll learn how to âdetectâ the footprints left behind in the L1 cache.
Core challenges youâll face:
- Defeating the âBounds Checkâ: You must train the branch predictor to expect valid indices, then suddenly pass an invalid one.
- Cache Timing: Distinguishing between an L1 cache âhitâ (~4 cycles) and a âmissâ (~200+ cycles) with enough precision to recover data.
- Microarchitectural Noise: Handling the fact that other processes and OS interrupts can âpolluteâ the cache while youâre measuring.
Real World Outcome
You will demonstrate a âghostâ in the machine. Youâll leak a âsecret stringâ from a protected area of memory, even though the secret is never logically accessed by your program. This project proves that software isolation (like sandboxing) can be broken by hardware behavior.
Example Output:
$ ./spectre_lite
[Attacker Initialization]
Target Secret: "SUPER_SECRET_KEY_2025"
Training Predictor with valid indices (0-15)...
[Phase 1: The Speculative Strike]
Triggering OOB read with index 40... Branch predicted TRUE!
Speculative Load: secret_array[40] ('S') -> loaded into L1 cache.
[Phase 2: The Side-Channel Measurement]
Scanning Cache...
Char 'S' (83): Hit (15 cycles) - EXPOSED!
Char 'U' (85): Hit (12 cycles) - EXPOSED!
Char 'P' (80): Hit (14 cycles) - EXPOSED!
Recovered Secret: "SUPER_SECRET"
# WARNING: This demonstrates why modern OSes use KPTI and LFENCE mitigations.
The Core Question Youâre Answering
âIf my code has a bounds check (
if (i < size)), why is it still unsafe?â
Modern CPUs prioritize speed over architectural purity. Speculative execution is so powerful that we accept its âleakinessâ (leaving footprints in the cache) as a trade-off for performance. This project answers: How can microarchitectural side-effects be used to bypass logical security boundaries?
Concepts You Must Understand First
Stop and research these before coding:
- The Cache Timing Side-Channel
- What is the difference in cycles between
L1 Hit(~4 cycles) andRAM Miss(~200+ cycles)?
- What is the difference in cycles between
- Speculative Reversion
- When a branch is mispredicted, the CPU ârevertsâ the registers. Why does it not revert the cache state? (Complexity and power cost).
- Flush+Reload Technique
- How do you clear a specific memory address from the cache (
clflush) and later measure how long it takes to read it back?
- How do you clear a specific memory address from the cache (
- Memory Barriers (
LFENCE)- How does
LFENCEstop the CPU from looking ahead? - Book Reference: âPractical Binary Analysisâ Ch. 11
- How does
Questions to Guide Your Design
- Probe Array Spacing
- Why must each entry in your âProbe Arrayâ be 4096 bytes apart? (Hint: Page boundaries and prefetchers).
- Training vs. Attacking
- How do you âtrickâ the predictor? (Hint: 5 good requests, 1 bad request).
- Timer Precision
- Is
clock()orgettimeofday()enough? (No, you needRDTSCPfor cycle-level precision).
- Is
Thinking Exercise
The Speculative Path
Look at this logic:
if (index < 16) {
uint8_t value = secret_data[index];
temp &= probe_array[value * 4096];
}
Questions:
- If
indexis 100, theifis logically false. Does the CPU fetchsecret_data[100]? (Yes, if the predictor says âTrueâ). - If the CPU speculates, it uses the result of
secret_data[100]to calculate an address inprobe_array. Does that address get loaded into the cache? - When the CPU realizes the branch was false, it throws away
value. Is the entry inprobe_arraystill in the cache?
The Interview Questions Theyâll Ask
- âWhat is the fundamental difference between Spectre and Meltdown?â
- âWhy are side-channel attacks so difficult to mitigate in software?â
- âHow does KPTI (Kernel Page Table Isolation) help protect against speculation attacks?â
- âWhat is a âgadgetâ in the context of a Spectre attack?â
- âCan you perform this attack in a high-level language like JavaScript?â
Hints in Layers
Hint 1: The Setup
Allocate a probe_array of 256 * 4096 bytes. Use memset to ensure itâs actually mapped in RAM.
Hint 2: Clearing the Cache
Use _mm_clflush(&probe_array[i * 4096]) for all 256 entries before you trigger the speculative read. This ensures that any âHitâ you find later was caused by the speculation.
Hint 3: The Attack Loop Run a loop 30 times. For 29 iterations, pass a safe index. On the 30th, pass the index of the âSecret.â
Hint 4: Measuring the Footprint
After the attack, time how long it takes to read probe_array[i * 4096] for i from 0 to 255. The index i that returns the fastest is the value of the secret byte!
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Side-Channel Attacks | âPractical Binary Analysisâ | Ch. 11 |
| Cache Internals | âComputer Systems: A Programmerâs Perspectiveâ | Ch. 6.4 |
| Spectre Original Paper | âSpectre Attacks: Exploiting Speculative Executionâ | Kocher et al. |
Project 4: The uOp Cache Prober
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: Assembly (x86-64)
- Alternative Programming Languages: C (with
__asm__) - Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The âResume Goldâ
- Difficulty: Level 3: Advanced
- Knowledge Area: Front-end / uOp Cache
- Software or Tool:
perf,RDTSC - Main Book: âAgner Fogâs Microarchitecture Manualâ
What youâll build: A tool that executes âhot loopsâ of varying sizes (from 10 instructions to 10,000 instructions) and measures the IPC (Instructions Per Cycle). Youâll identify the exact point where the loop stops fitting in the uOp Cache (DSB) and the CPU switches to the slower legacy decoder.
Why it teaches CPU internals: You will learn that the bottleneck of your CPU is often the âFront-endâ (the decoder). Youâll see that Zen 5âs dual uOp caches allow for much larger loops to remain fast compared to older architectures.
Core challenges youâll face:
- NOP Sleds: Using precise
NOPor dummy instructions to grow the loop size without changing its behavior. - Instruction Alignment: Aligning your loop start on a 64-byte boundary to avoid cache-line crossing penalties.
- Counter Overflow: Handling
RDTSCoverflows during long-running tests.
Real World Outcome
You will generate a âPerformance DNA Profileâ of your CPUâs front-end. Youâll identify the âMagic Boundaryââthe exact number of instructions where your code stops being âFastâ and starts being âLegacy.â This knowledge allows you to size your critical hot-loops perfectly to stay within the uOp Cache (DSB).
Example Output:
$ ./uop_prober --arch zen5 --step 16
[Analyzing AMD Zen 5 Front-End]
Loop Size (uOps) | IPC | Path Taken | Efficiency
-----------------|--------|------------|-----------
64 | 6.2 | DSB | 100%
512 | 6.1 | DSB | 98%
2048 | 5.8 | DSB | 93%
4096 | 5.2 | DSB/MITE | 84% (Transition)
8192 | 2.4 | MITE | 38% (Legacy Decoder)
# RESULT: Your Zen 5 uOp cache saturates at ~4096 entries.
# Optimization: For maximum IPC, keep innermost loops under 32KB of binary code.
The Core Question Youâre Answering
âWhy are small loops so much faster than big ones, even if they do the same work?â
The x86 instruction set is a variable-length nightmare (1-15 bytes). The âLegacy Decoderâ is slow and power-hungry because it has to scan for instruction boundaries. The uOp Cache is the âcheat codeâ that makes modern x86 CPUs competitive with RISC by storing instructions after theyâve been decoded. This project answers: How big can my âFastâ path be before the CPU is forced back to its legacy roots?
Concepts You Must Understand First
Stop and research these before coding:
- MITE (Legacy Decoder)
- Why can the legacy decoder only handle ~4-6 instructions per cycle? (Serial bottleneck of length decoding).
- DSB (Decoded Stream Buffer / uOp Cache)
- Why is the uOp cache measured in âuOpsâ rather than âBytesâ?
- How does the CPU switch between MITE and DSB?
- Instruction Fetch Unit (IFU)
- How does the CPU pre-fetch 32-64 bytes of code per cycle into the L1i cache?
- Zen 5 Dual-Front End
- How does Zen 5 use two parallel decode paths to reach 8-wide dispatch?
Questions to Guide Your Design
- Instruction Variety
- Does a 1-byte
NOP(0x90) take as much uOp cache space as a 10-byteMOV? (Hint: In the uOp cache, a uOp is a uOp, but original byte size affects the MITE path).
- Does a 1-byte
- Alignment Penalties
- What happens if your loop starts at the very end of a 64-byte cache line? Does it split the fetch window?
- Loop Control
- If your loop is only 10 instructions, the âJumpâ at the end happens every few cycles. Does the branch predictor bottleneck the test before the uOp cache does?
Thinking Exercise
The Decoder Bottleneck
Imagine an x86 instruction that is 15 bytes long. Questions:
- If your CPU fetches 32 bytes per cycle, how many of these can it fetch at once? (Only 2).
- If these are already in the uOp cache, does their 15-byte size still matter? (No, they are already uOps).
- Why do compilers like
gccadd âjunkâNOPinstructions to align loop targets to 32 or 64 byte boundaries?
The Interview Questions Theyâll Ask
- âWhat is the Decoded I-Cache (uOp cache) and why is it critical for x86 performance?â
- âWhy is x86 decoding harder than ARM decoding?â
- âWhat happens when a loop âthrashesâ the uOp cache?â
- âHow does Zen 5âs dual-decode pipe change loop optimization strategy?â
- âWhat is uOp Fusion and where does it occur?â
Hints in Layers
Hint 1: The Loop Body
Use a series of ADD RAX, 0 or PXOR XMM0, XMM0. These are simple, single-uop instructions that wonât bottleneck the execution ports.
Hint 2: Alignment
Use the .p2align 6 assembly directive to align your loop to a 64-byte boundary.
Hint 3: perf Counters
Use perf stat -e dsb_coverage.any,idq.mite_uops (Intel) or equivalent AMD counters to see exactly which path the CPU is taking.
Hint 4: Manual Unrolling To test specific sizes, use a macro to repeat an instruction N times:
#define REP10(x) x x x x x x x x x x
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Front-end & uOp Cache | âModern Processor Designâ | Ch. 3.4 |
| Zen 5 Microarchitecture | âAMD Zen 5 Optimization Guideâ | Front-end Section |
| x86 Instruction Formats | âComputer Systems: A Programmerâs Perspectiveâ | Ch. 3.1 |
Project 5: Execution Port Pressure Map
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: C++ (with inline assembly)
- Alternative Programming Languages: Rust, Assembly
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The âMicro-SaaS / Pro Toolâ
- Difficulty: Level 3: Advanced
- Knowledge Area: Execution Units / Throughput
- Software or Tool:
perf,pmu-tools(toplev) - Main Book: âIntel 64 and IA-32 Architectures Optimization Reference Manualâ
What youâll build: A tool that characterizes the âExecution Portsâ of your CPU. You will write loops that execute mixtures of instructions (e.g., Integer ADD vs Floating Point MUL) and measure the throughput to find which instructions compete for the same hardware resources.
Why it teaches CPU internals: You will understand that âthe CPUâ is actually a collection of specialized âExecution Units.â Youâll discover why you can do 4-6 additions per cycle but only 1 division, and how mixing independent instruction types (Int + FP) can lead to higher performance.
Core challenges youâll face:
- Identifying Port Binding: Creating âConflict Testsâ (two instructions that use the same port) vs âParallel Testsâ (two instructions that use different ports).
- Managing Dependency Chains: Ensuring your test instructions donât depend on each other (RAW hazards) so you are measuring port limits, not latency limits.
- Interpreting PMU Counters: Learning to read
UOPS_EXECUTED.PORT_0,PORT_1, etc. on Intel orOP_DISPATCHon AMD.
Real World Outcome
You will create a âPort Conflict Matrixâ for your specific processor. This tool reveals the âfloor planâ of your CPUâs execution engine. Youâll know exactly which instruction types (e.g., SIMD vs Integer) can run in parallel and which ones are fighting for the same specialized ports.
Example Output:
$ ./port_mapper --arch zen5
Analyzing Throughput (uOps/cycle):
- [ADD, ADD]: 4.0 uOps/c (Status: Perfect Parallelism)
- [ADD, MUL]: 4.0 uOps/c (Status: Perfect Parallelism)
- [MUL, MUL]: 1.0 uOps/c (CONFLICT: Port 1 Only)
- [FADD, FADD]: 2.0 uOps/c (Parallel Vector Pipes)
Conclusion for Zen 5:
ALUs: 4 main ports (0-3)
IMUL Bottleneck: Dedicated to Port 1.
# Strategy: Spread IMUL instructions out; they can't be issued together!
The Core Question Youâre Answering
âWhy is my code slow even though CPU usage is only 50%?â
A CPU can be âbusyâ even if its transistors are idle, simply because every instruction is fighting for the same specialized âdoorâ (port) to the ALU. This project answers: How are the execution units of my CPU physically wired to the instruction scheduler?
Concepts You Must Understand First
Stop and research these before coding:
- Superscalar Execution
- What does â8-wide issueâ mean in hardware? (The ability to send 8 uOps to ports in a single cycle).
- Execution Ports vs. Functional Units
- Why do we use ports as an abstraction? (One port might lead to an Adder and a Multiplier).
- Zen 5 Integer Engine
- Why did AMD move to 6 Integer ALUs and 4 AGUs?
- Port Binding
- The âPort Conflictâ test: If two instructions take 2x as long as one, they share a port.
Questions to Guide Your Design
- Dependency Chains
- If Instruction B uses the result of Instruction A, they must run one after another (latency). How do you ensure your instructions are 100% independent so you measure throughput? (Hint: Use different registers for everything).
- Instruction Mix
- If you mix
ADDandSUB, do they compete? Check the âALU port mapâ in your optimization manual.
- If you mix
- Register Pressure
- If you use 50 different registers to avoid dependencies, will you hit the limit of the Physical Register File (PRF)?
Thinking Exercise
The Port Bottleneck
Your hypothetical CPU has 4 ports.
- Ports 0, 1, 2, 3 can all do
ADD. - Only Port 0 can do
DIV. Questions: - What is the max IPC if your code is 100%
ADD? (4.0). - What is the max IPC if your code is 100%
DIV? (1.0). - What happens if your code is 50%
ADDand 50%DIV? (Port 0 is busy with DIV, leaving only 3 ports for the ADDs).
The Interview Questions Theyâll Ask
- âWhat is an execution port and why do modern CPUs have so many?â
- âExplain the difference between instruction latency and throughput.â
- âWhat is Port Contention and how do you detect it in code?â
- âHow does the scheduler decide which port to send a uOp to?â
- âCompare Zen 5 vs Intel Lion Cove port layouts (Hint: Both are widening significantly).â
Hints in Layers
Hint 1: Independent Accumulators
Use different registers for every instruction.
Example: ADD RAX, 1; ADD RBX, 1; ADD RCX, 1; ...
Hint 2: The Conflict Test
Measure the cycles for 1,000 instructions of type A. Then 1,000 of type B. Then 1,000 of a mixed A, B, A, B sequence. If the mixed sequence is faster than the sum, they are on different ports.
Hint 3: Use asm volatile
C++ compilers will âoptimizeâ your test by noticing youâre adding to registers you never use. Use asm volatile to force the CPU to actually perform the work.
Hint 4: Intel IACA / llvm-mca
Use the llvm-mca tool to predict what your results should be. If your real code is slower, you might have a hidden dependency.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Execution Units | âInside the Machineâ | Ch. 4 |
| Throughput Tables | âAgner Fogâs Instruction Latenciesâ | PDF Tables |
| Scheduler Design | âModern Processor Designâ | Ch. 4.4 |
Project 6: Memory Disambiguation Probe
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Assembly
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The âResume Goldâ
- Difficulty: Level 4: Expert
- Knowledge Area: Load/Store Buffers / Speculation
- Software or Tool:
RDTSC - Main Book: âModern Processor Designâ by Shen & Lipasti
What youâll build: A tool that tests the CPUâs Memory Disambiguation engineâthe logic that decides if a LOAD can safely pass a STORE in the out-of-order engine. Youâll use addresses that âaliasâ (look the same to the CPUâs quick check) to trigger performance penalties.
Why it teaches CPU internals: You will understand that memory access is speculative too! Youâll see that the CPU guesses that ptrA and ptrB are different, only to âpanicâ and restart the instruction if they actually overlap.
Core challenges youâll face:
- 4K Aliasing: Understanding that x86 CPUs only check the bottom 12 bits of an address to decide if two pointers might overlap.
- Store-to-Load Forwarding (STLF): Measuring the speed difference when a
LOADreads directly from theSTOREbuffer instead of the L1 cache. - Precise Timing: Detecting a 10-20 cycle âStore Buffer Stall.â
Real World Outcome
You will generate a âDisambiguation Heatmapâ for your CPU. This tool reveals the âBlind Spotâ of your CPUâs memory predictor. Youâll identify the exact address offsets (like the infamous 4K Aliasing) that cause the CPU to stall its out-of-order engine, helping you avoid performance pitfalls in high-performance data structures.
Example Output:
$ ./alias_probe --arch zen5
[Testing Memory Aliasing Bottleneck]
Offset (Bytes) | Latency (Cycles) | Status
---------------|------------------|-------
0 | 3 | STLF (Fast Forwarding)
64 | 5 | L1 Hit
2048 | 5 | L1 Hit
4096 | 22 | 4K ALIASING STALL!
8192 | 21 | 4K ALIASING STALL!
# ANALYSIS: Your Zen 5 CPU uses a 12-bit quick check for address aliasing.
# Performance Tip: Ensure pointers in tight read-after-write loops do not have the same bottom 12 bits.
The Core Question Youâre Answering
âWhy is writing to
array[0]and then reading fromarray[1024]slower than it should be?â
The CPUâs out-of-order engine is a âlook-aheadâ machine. Memory addresses are the one thing it canât always predict perfectly because the address isnât known until the last millisecond. This project answers: How does the CPU guess if two memory locations overlap, and what happens when itâs wrong?
Concepts You Must Understand First
Stop and research these before coding:
- Load Buffer & Store Buffer
- How does the CPU track âin-flightâ memory operations that havenât hit the cache yet?
- Memory Disambiguation
- How does the CPU decide if a
LOADis âreadyâ if there areSTOREsstill waiting for their addresses to be calculated?
- How does the CPU decide if a
- Store-to-Load Forwarding (STLF)
- How does data get from a
STOREto aLOADwithout ever touching the L1 cache? (The data is passed in a register-like way inside the store buffer).
- How does data get from a
- 4K Aliasing (Address Aliasing)
- Why do modern x86 CPUs only compare the lower 12 bits of an address for the initial âSafety checkâ?
Questions to Guide Your Design
- Creating the Conflict
- How can you make the
STOREaddress take a long time to calculate (e.g., a long dependency chain) while theLOADaddress is known immediately? (This forces the CPU to guess).
- How can you make the
- Aliasing the Bits
- What happens if you use
ptrandptr + 4096? Why does the hardware think they are the same? (Hint: The bottom 12 bits are identical).
- What happens if you use
- The Stall Cost
- How many cycles does a âDisambiguation Violationâ cost compared to a normal L1 hit?
Thinking Exercise
The Aliasing Trap
Consider this code:
*ptrA = 100;
int x = *ptrB;
Questions:
- If
ptrA == ptrB, the CPU must wait for the store to finish. - If
ptrA != ptrB, the CPU can start loadingximmediately even before the*ptrA = 100finishes. - What if
ptrB = ptrA + 4096? The bottom 12 bits are the same. Does the CPU realize they are different pages immediately? (No, it stalls while it performs the full address check).
The Interview Questions Theyâll Ask
- âWhat is Memory Disambiguation and why is it necessary for OoO execution?â
- âExplain the â4K Aliasingâ problem in x86 architectures.â
- âWhat is Store-to-Load Forwarding and when does it fail?â (Fails on unaligned access or partial overlaps).
- âHow does the âLoad/Store Bufferâ handle speculative memory accesses?â
- âWhat is a âMemory Dependency Violationâ and what is the cost of fixing it in hardware?â
Hints in Layers
Hint 1: The Address Mask
To trigger 4K aliasing, use two pointers p1 and p2 where ((uintptr_t)p1 & 0xFFF) == ((uintptr_t)p2 & 0xFFF).
Hint 2: The Slow Store
Force the store address to be slow by making it depend on a long chain of IMUL or DIV instructions. This prevents the âStore Address Generationâ unit from knowing the address until the load is already ready to run.
Hint 3: Use RDTSC
Measure the time for the store+load sequence. Compare the case where the offset is 64 bytes (no alias) to 4096 bytes (alias).
Hint 4: Prevent Prefetching
Use _mm_clflush to ensure the data isnât sitting in the L1 cache in a way that masks the buffer logic youâre trying to measure.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Load/Store Units | âModern Processor Designâ | Ch. 6.4 |
| Memory Speculation | âComputer Architectureâ | Ch. 3.9 |
| x86 Memory Model | âIntel Optimization Manualâ | Memory Subsystem section |
Project 7: The Reorder Buffer (ROB) Boundary Finder
Real World Outcome
You will determine the physical limit of your CPUâs âInstruction Horizon.â Youâll discover exactly how many instructions (uOps) your CPU can keep âin the airâ while waiting for a slow memory load. This project proves why modern CPUs like Lunar Lake (576 entries) can hide more latency than older ones.
Example Output:
$ ./rob_prober --max-window 1024
[Measuring Reorder Buffer Capacity: Intel Lion Cove]
Window Size (uOps) | Latency (Cycles) | Status
-------------------|------------------|-------
128 | 305 | Waiting for L3 Miss
256 | 308 | Waiting for L3 Miss
512 | 315 | Window still open
576 | 320 | SATURATION POINT
640 | 610 | ROB FULL (Stalled Front-end)
# RESULT: Your Lion Cove ROB size is ~576 entries.
# Insight: This is the limit of your CPU's "patience."
The Core Question Youâre Answering
âHow far into the future can my CPU see?â
Modern CPUs like Zen 5 have ROBs with 400+ entries. This means the CPU is processing instructions that are 400 lines âaheadâ of where the current result is being committed to memory. This project answers: What is the maximum number of independent operations I can pack into a loop to hide a 300-cycle RAM access?
Concepts You Must Understand First
Stop and research these before coding:
- The Reorder Buffer (ROB)
- What is the difference between âIssue,â âExecute,â and âCommit/Retireâ?
- Why must instructions commit in-order even if they execute out-of-order?
- Instruction Latency vs. Throughput
- How does a 300-cycle load âclogâ the retirement engine?
- Speculative Execution Retirement
- What happens to a ROB entry if the instruction was on a mispredicted branch?
- The Physical Register File (PRF)
- How does the PRF size limit the ROB? (If you run out of physical registers, the ROB canât grow).
Questions to Guide Your Design
- Creating the âWallâ
- How do you create an instruction that definitely takes 300 cycles? (Hint: Pointer chasing through a large linked list that misses L3 cache).
- Independent âFillerâ
- How do you create uOps that have zero dependencies on the slow load and zero dependencies on each other? (Hint: Use 16+ different registers).
- Detecting the Saturation
- Why does the total time suddenly double when you exceed the ROB size? (Hint: The next slow load canât enter the ROB until the first one retires).
Thinking Exercise
The Full Buffer
Imagine a ROB with 10 slots.
- Slot 1: A
LOADthat takes 100 cycles. - Slots 2-10:
ADDinstructions that take 1 cycle. Questions: - In cycle 5, are slots 2-10 âdoneâ executing? (Yes).
- Can slots 2-10 be removed from the ROB in cycle 6? (No, because Slot 1 is still busy and retirement is in-order).
- If the CPU encounters instruction 11 in cycle 7, can it enter the ROB? (No, the buffer is full).
The Interview Questions Theyâll Ask
- âWhat is the primary function of the Reorder Buffer?â
- âHow does the ROB size affect the CPUâs ability to hide memory latency?â
- âExplain the process of âretirementâ or âcommitâ in an OoO processor.â
- âWhat is the âInstruction Windowâ and how is it calculated?â
- âWhy canât we just make the ROB infinite?â (Power, area, and complexity of the renaming logic).
Hints in Layers
Hint 1: The Pointer Chase
Create a linked list of 100MB. Each nodeâs next pointer should be at least 4KB away from the previous one to avoid the hardware prefetcher. This creates a âMemory Wall.â
Hint 2: The Filler Block
Between two pointer-chase loads, insert N number of ADD RAX, 0. Use as many different registers as possible (RAX, RBX, RCX, RDX, RSI, RDI, R8-R15).
Hint 3: Measuring the Jump
Loop the whole test 1,000 times. Plot N (number of filler uOps) vs Cycles. You will see a flat line that suddenly jumps vertically when N > ROB_SIZE.
Hint 4: Zen 5 vs Intel On Zen 5, look for ~448. On Intel Lion Cove, look for ~576. If you hit the jump at ~180-200, you are likely hitting the Physical Register File limit instead of the ROB limit.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| ROB Architecture | âModern Processor Designâ | Ch. 4.3 |
| Register Renaming | âComputer Architectureâ | Ch. 3.2 |
| Intel Windows | âIntel Optimization Manualâ | Ch. 2.2 |
Project 8: Macro-op Fusion Detector
- File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
- Main Programming Language: Assembly
- Alternative Programming Languages: C
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The âMicro-SaaS / Pro Toolâ
- Difficulty: Level 2: Intermediate
- Knowledge Area: Decoding / Instruction Fusion
- Software or Tool:
perf(uops_retired.slots),RDTSC - Main Book: âAgner Fogâs Optimization Manualâ
What youâll build: A benchmark that compares the retirement throughput of common instruction pairs (e.g., CMP and JZ) against non-fusable pairs. Youâll measure if the CPU treats the pair as a single âMacro-opâ in the pipeline.
Why it teaches CPU internals: Youâll understand how the CPU âshrinksâ your code in the front-end to save energy and bandwidth. Youâll learn which coding patterns are âhardware-friendlyâ for Zen 5 vs Lunar Lake.
Core challenges youâll face:
- Precise retirement counting: Using hardware performance counters to see the number of âuops retiredâ vs âinstructions retired.â
- Isolating Fusion from Caching: Ensuring the uop cache doesnât mask the fusion effect.
- Architectural differences: Finding why certain fusions work on Intel (Lion Cove) but not on AMD (Zen 5), or vice versa.
Real World Outcome
You will generate a âFusion Catalogâ for your processor. This report identifies which instruction pairs (like TEST/JZ or CMP/JE) your CPUâs decoder can physically âglueâ into a single operation. Maximizing fusion effectively increases your CPUâs fetch and retirement width by up to 2x without increasing power consumption.
Example Output:
$ ./fusion_detect --arch lunar_lake
[Instruction Pair Analysis: Intel Lion Cove]
Pair: TEST EAX, EAX; JZ label; -> 1 uOp retired (SUCCESS: FUSED)
Pair: CMP RAX, 100; JG label; -> 1 uOp retired (SUCCESS: FUSED)
Pair: ADD RAX, 1; JZ label; -> 1 uOp retired (SUCCESS: FUSED)
Pair: INC RAX; JZ label; -> 1 uOp retired (SUCCESS: FUSED)
Pair: MOV RAX, [RBX]; JZ label; -> 2 uOps retired (FAILED: Load cannot fuse with Jump)
# EFFICIENCY GAINED: 50% reduction in Front-end Pressure for control flow.
# Tip: Modern Intel cores fuse almost all ALU+Branch pairs.
The Core Question Youâre Answering
âDoes the order of my
CMPandJMPinstructions actually matter?â
Yes. Modern CPUs have specific logic in the decoder that looks for pairs of instructions to âglueâ together. This is âMacro-op Fusion.â This project answers: How can I write assembly that the decoder can âshrinkâ to save energy and bandwidth?
Concepts You Must Understand First
Stop and research these before coding:
- Macro-op Fusion (MOP Fusion)
- How does the decoder merge two x86 instructions into one internal uop?
- Micro-op Fusion
- How does the CPU merge a memory load and an ALU operation into one uop?
- The Decoderâs âPeepholeâ
- How many instructions âaheadâ does the decoder look to find fusion candidates?
- Instruction Retirement Slots
- Why is counting âuops retiredâ the key to detecting fusion?
Questions to Guide Your Design
- Instruction Spacing
- If you put a
NOPbetweenCMPandJZ, does fusion still happen? (Hint: No, they must be adjacent).
- If you put a
- Register Usage
- Does
CMP RAX, RBX; JZ label;fuse the same way asCMP RAX, 0; JZ label;?
- Does
- The Counter Signal
- Which specific
perfevent allows you to see the âfusedâ state? (Hint:uops_retired.slotsvsinst_retired.any).
- Which specific
Thinking Exercise
The Fusion Logic
The CPU decoder sees:
CMP RAX, RCX
JE label
Questions:
- If these are fused, how many entries do they take in the ROB? (Only 1).
- Does the execution unit (ALU) treat them as a single operation (compare-and-branch)? (Yes).
- Why would fusing them save power? (Fewer bits to move through the pipeline).
The Interview Questions Theyâll Ask
- âWhat is the difference between Micro-op fusion and Macro-op fusion?â
- âWhy is fusion beneficial for the front-end of the CPU?â
- âList three instruction pairs that are commonly fused in x86-64.â
- âDoes fusion happen in the decoder or in the uOp cache?â
- âHow does the âInstruction Decoderâ width limit the number of fusions per cycle?â
Hints in Layers
Hint 1: The Measurement Loop
Execute 100 million pairs in a tight loop. Use perf to count total instructions and total uops. If uops < instructions, fusion occurred.
Hint 2: Subtraction of Loop Overhead
The loop itself has a DEC and JNZ. Measure a loop of just NOPs first to find the baseline cost of the loop control.
Hint 3: Boundary Crossings Test if fusion fails if the first instruction is at the end of a 16-byte fetch window and the second is at the start of the next.
Hint 4: Intel vs AMD AMD Zen 5 and Intel Lion Cove have different âPeepholeâ widths. Experiment with putting a non-fusing instruction between the pair to see when fusion breaks.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Fusion Rules | âAgner Fogâs Optimization Manualâ | Instruction Fusion section |
| Intel Front-end | âIntel Optimization Manualâ | Ch. 2.1 |
| Zen 5 Front-end | âAMD Zen 5 Microarchitecture Manualâ | Decoder section |
Project 9: L1 Bandwidth Stressor (Zen 5 focus)
Real World Outcome
You will attempt to saturate the massive 512-bit wide memory pipes of the Zen 5 architecture. Youâll measure âBytes per Cycleâ and compare it to the theoretical peak (e.g., 512 GB/s). This project proves if your code can actually âfeed the beastâ of modern AVX-512 vector units.
Example Output:
$ ./l1_stress --width 512 --unroll 8
[Zen 5 L1 Bandwidth Benchmark]
AVX-256 (2 Loads/cycle): 124.2 GB/s
AVX-512 (2 Loads/cycle): 248.8 GB/s
AVX-512 (3 Loads/cycle - Zen 5 Optimized): 480.5 GB/s
# RESULT: You are hitting 98% of the theoretical 512 GB/s bandwidth.
# Efficiency: This loop moves 64 bytes of data every 2 nanoseconds.
The Core Question Youâre Answering
âWhy did AMD double the cache bandwidth on Zen 5, and can my code actually use it?â
Hardware features are useless if software canât feed them. To hit peak speed, you must understand alignment, vector width, and AGU (Address Generation Unit) limits. This project answers: How do I structure my memory access to match the physical âpipeâ diameter of the L1 cache?
Concepts You Must Understand First
Stop and research these before coding:
- L1 Data Cache Ports
- How many loads and stores can Zen 5 handle in one cycle? (Hint: 3 Loads, 2 Stores).
- Vector Width (AVX2 vs AVX-512)
- Why does moving to 512-bit registers require doubling the bandwidth to keep the ALUs busy?
- Cache Line Alignment
- What is a âSplit Loadâ and why is it a performance killer?
- AGU (Address Generation Units)
- Why does the CPU need a specialized unit just to calculate an address?
Questions to Guide Your Design
- Data Alignment
- Why must your buffers be aligned to 64-byte boundaries (
posix_memalign)?
- Why must your buffers be aligned to 64-byte boundaries (
- Read/Write Ratio
- Is bandwidth higher for 100% Reads or a mix of Reads and Writes? (Check the port counts).
- Loop Unrolling
- If you donât unroll your loop, is the âbranchâ at the end the bottleneck? (Hint: Yes).
Thinking Exercise
The Pipe Diameter
Imagine the L1 cache as a water tank and the FPU as a thirsty engine.
- Zen 4 has a 32-byte âpipe.â
- Zen 5 has a 64-byte âpipe.â Questions:
- If you use 32-byte (AVX2) instructions on Zen 5, do you get the benefit of the bigger pipe?
- How many instructions must you issue per cycle to âfillâ the 64-byte pipe?
- What happens if the data is in the L2 cache instead of L1?
The Interview Questions Theyâll Ask
- âWhat is the difference between L1 bandwidth and L1 latency?â
- âHow many Load/Store units does the Zen 5 architecture have?â
- âExplain why unaligned memory access is slower.â
- âWhat is a Cache Bank Conflict?â
- âDoes AVX-512 cause clock downclocking on Zen 5?â (Answer: No, Zen 5 is designed for it).
Hints in Layers
Hint 1: Aligned Allocation
Use aligned_alloc(64, size) to ensure your memory starts at the beginning of a cache line.
Hint 2: Intrinsics
Use _mm512_load_ps to move 64 bytes in one instruction. Ensure you use the âalignedâ version of the load to avoid penalties.
Hint 3: Unrolling Manually unroll your loop 8 times. This ensures the CPU spends 99% of its time moving data and 1% of its time checking the loop counter.
Hint 4: Parallel Pointers Use 4 different pointers to 4 different memory areas to avoid âStore-to-Loadâ false dependencies in the buffer.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Cache Bandwidth | âComputer Systemsâ | Ch. 6.4 |
| Zen 5 Design | âAMD Zen 5 Microarchitecture Whitepaperâ | Cache section |
| SIMD Optimization | âThe Art of Writing Efficient Programsâ | Ch. 7 |
Project 10: Lunar Lake P vs E Core Profiler
Real World Outcome
You will generate an âArchitectural Efficiencyâ report for Intelâs Lunar Lake. Youâll compare the performance of Lion Cove (Performance) and Skymont (Efficiency) cores on identical code. Youâll discover the trade-offs in decode width, branch accuracy, and ROB size between the two designs.
Example Output:
$ ./hybrid_profile
[Intel Lunar Lake Topology]
P-Cores: 4 (Lion Cove)
E-Cores: 4 (Skymont)
Metric | Lion Cove (P) | Skymont (E) | Gap
-------------|---------------|-------------|-----
Integer IPC | 6.2 | 4.1 | 1.5x
Branch Acc | 98.2% | 95.1% | 1.03x
L2 Latency | 12 cycles | 18 cycles | 1.5x
ROB Size | 576 entries | 416 entries | 1.38x
# CONCLUSION: Skymont E-cores deliver 70% of P-core IPC at 1/4 the power.
# Optimization: Target P-cores for UI/Latency, E-cores for Background/Batch.
The Core Question Youâre Answering
âIf my CPU has 8 cores, why is my code slower on 4 of them?â
Intelâs Lunar Lake represents the future of heterogeneous computing. This project answers: How do I write code that performs well on both a 12-wide P-core and a 9-wide E-core?
Concepts You Must Understand First
Stop and research these before coding:
- Heterogeneous Computing
- Why did Intel remove SMT (Hyperthreading) from Lunar Lake? (Power and die area efficiency).
- Lion Cove vs Skymont Architecture
- Compare the decoder widths (8-wide vs 9-wide clusters).
- Intel Thread Director
- How does the hardware ânudgeâ the OS to move a thread to a different core type?
- Frequency Normalization
- To compare architecture, you must divide performance by Clock Speed.
Questions to Guide Your Design
- Thread Pinning
- How do you use
pthread_setaffinity_npto force a thread onto Core 0 (P) vs Core 4 (E)?
- How do you use
- Cache Latency
- Is it faster to share data between two P-cores than between a P and an E core?
- ISA Symmetry
- Are there instructions that run on P but fail or emulate on E?
Thinking Exercise
The Efficient Worker
Imagine a P-core as a specialized engineer and an E-core as a general worker. Questions:
- If you have 100 simple math tasks, is it better to use 1 P-core or 4 E-cores?
- If the P-core has a 12-wide decoder and the E-core has a 9-wide decoder, which one is more sensitive to âmessyâ code?
- Why did Intel give the E-cores a massive L2 cache cluster?
The Interview Questions Theyâll Ask
- âWhat is Intel Lunar Lakeâs architectural strategy for mobile devices?â
- âWhy is branch prediction accuracy different between P and E cores?â
- âExplain how thread affinity affects high-concurrency performance.â
- âWhat are the advantages of removing SMT from performance cores?â
- âHow does the Ring Bus connect P and E cores?â
Hints in Layers
Hint 1: CPUID Detection
Use CPUID Leaf 0x1A to identify core types. A result of 0x20 means E-core, 0x40 means P-core.
Hint 2: Pinning API
In Linux, use sched_setaffinity. In macOS, use thread_policy_set.
Hint 3: The Workload Run a variety of tasks: Integer math, Floating point, and Linked List walks. Measure IPC for each on both core types.
Hint 4: Power Monitoring
If on Linux, read the RAPL interface (/sys/class/powercap/intel-rapl) to measure the âEnergy Per Instructionâ (EPI) for both cores.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Intel Hybrid Arch | âIntel Optimization Manualâ | Ch. 18 |
| Core Differences | âAnandTech / Chips and Cheese Analysisâ | Lunar Lake Deep Dive |
| Linux Scheduling | âHow Linux Worksâ | Ch. 4 |
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Human Pipeline Trace | Beginner | Weekend | Medium | 3/5 |
| 2. Branch Torture Test | Intermediate | 1 Week | High | 4/5 |
| 3. Spectre-lite | Expert | 2 Weeks | Extremely High | 5/5 |
| 4. uOp Cache Prober | Advanced | 1 Week | High | 4/5 |
| 5. Port Pressure Map | Advanced | 1 Week | High | 4/5 |
| 6. Memory Disambiguation | Expert | 2 Weeks | High | 3/5 |
| 7. ROB Boundary Finder | Expert | 1 Week | High | 3/5 |
| 8. Macro-op Fusion | Intermediate | 3 Days | Medium | 4/5 |
| 9. L1 Bandwidth Stress | Advanced | 1 Week | High | 3/5 |
| 10. Lunar Lake Profiler | Advanced | 1 Week | High | 5/5 |
Recommendation
Based on your level:
- For the Absolute Beginner: Start with Project 1 (Human Pipeline Trace). You cannot understand modern out-of-order execution without first understanding the âsimpleâ pipeline it evolved from.
- For the Aspiring Systems Engineer: Focus on Project 6 (Memory Disambiguation) and Project 7 (ROB Boundary Finder). These address the most common bottlenecks in real-world database and engine code.
- For the Security Obsessed: Go straight to Project 3 (Spectre-lite). Itâs the most challenging but the most rewarding, as it bridges the gap between hardware architecture and cybersecurity.
Final Overall Project: The uArch-Aware JIT Engine
What youâll build: A âMicro-JITâ (Just-In-Time) compiler for a simple math language. This JIT is unique: it is Micro-architecture Aware. It wonât just generate generic machine code; it will optimize itself on-the-fly based on the CPU it detects.
- On Zen 5: It will use AVX-512 for parallel operations and unroll loops to match the 8-wide dispatch engine.
- On Lunar Lake: It will detect if itâs running on a P-core or E-core. On E-cores, it will favor instructions with lower port pressure and smaller instruction lengths.
- Across both: It will automatically align loop targets to 64-byte boundaries to maximize uOp Cache (DSB) hits and avoid cache-line split penalties.
Real World Outcome
You will have a high-performance JIT engine that outperforms generic compilers by tailoring code to the specific silicon. Youâll be able to demonstrate a 20-30% performance boost just by changing the code-generation strategy based on CPUID results.
Example Output:
$ ./uarch_jit --run my_math_script.math
[JIT: Target Detection]
Found: AMD Zen 5 (8-wide Front-end, 512-bit FPU)
[JIT: Optimization Strategy]
- Strategy: AVX-512 Vectorization
- Strategy: 8x Loop Unrolling
- Strategy: 64-byte Target Alignment
[Execution Result]
Result: 42.0000
Time: 450 cycles (Generic JIT took 680 cycles)
Performance Gain: 1.51x
The Core Question Youâre Answering
âHow do the worldâs fastest runtimes (like V8 or the JVM) achieve such high performance?â
Modern runtimes arenât just interpreters; they are sophisticated compilers that know the secrets of the hardware they run on. This project answers: How do I build a system that dynamically adapts its machine-code generation to exploit the hidden strengths of 2025 CPU architectures?
Concepts You Must Understand First
- JIT Compilation Basics
- Generating machine code at runtime (
mmapwithPROT_EXEC).
- Generating machine code at runtime (
- Instruction Selection & Scheduling
- Choosing the âcheapestâ instructions for a specific core.
- Runtime CPU Dispatching
- Using the
CPUIDinstruction to branch between different code-gen paths.
- Using the
- Binary Patching
- How to align instructions and insert
NOPs for padding at runtime.
- How to align instructions and insert
Thinking Exercise
The Adaptive Compiler
Imagine your JIT sees a loop that adds 8 numbers. Questions:
- On a CPU with 256-bit SIMD, how many instructions do you generate?
- On Zen 5 with 512-bit SIMD, how many?
- If the CPU has a 576-entry ROB, how many iterations of the loop can you âlook aheadâ if you unroll it?
The Interview Questions Theyâll Ask
- âWhat are the security risks of generating executable memory at runtime (W^X)?â
- âHow do you detect CPU features in a portable way?â
- âWhy does loop alignment matter for the uOp cache?â
- âExplain the trade-off between JIT compilation time and final code execution time.â
Hints in Layers
Hint 1: The Code Buffer
Use mmap to allocate a page, then use mprotect to make it executable after youâve written your machine code bytes.
Hint 2: Feature Flags
Create a struct CPUFeatures and fill it once at startup. Use these flags inside your emit_instruction() functions.
Hint 3: Alignment Logic
To align a loop, check the current write_pointer. If (ptr % 64) != 0, emit a âmulti-byte NOPâ until you hit the boundary.
Hint 4: Benchmarking
Run your math script 1 million times and use RDTSC to compare your âuArch-awareâ JIT against a âNaĂŻveâ JIT that uses the same code for every CPU.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| JIT Implementation | âWriting a C Compilerâ | Ch. 12 (Codegen) |
| Runtime Code Gen | âLinkers and Loadersâ | Ch. 11 |
| CPU Specifics | âModern Processor Designâ | Ch. 4-6 |
Summary
This learning path covers Modern CPU Internals through 10 hands-on projects designed for the 2025 hardware landscape.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Human Pipeline Trace | C | Beginner | Weekend |
| 2 | Branch Torture Test | C++ | Intermediate | 1 Week |
| 3 | Spectre-lite | C | Expert | 2 Weeks |
| 4 | uOp Cache Prober | Assembly | Advanced | 1 Week |
| 5 | Port Pressure Map | C++ | Advanced | 1 Week |
| 6 | Memory Disambiguation | C | Expert | 2 Weeks |
| 7 | ROB Boundary Finder | Assembly | Expert | 1 Week |
| 8 | Macro-op Fusion | Assembly | Intermediate | 3 Days |
| 9 | L1 Bandwidth Stress | C | Advanced | 1 Week |
| 10 | Lunar Lake Profiler | C++ | Advanced | 1 Week |
| F | uArch-Aware JIT | C | Expert | 1 Month |
Expected Outcomes
After completing these projects, you will:
- Understand every stage of the 2025 CPU execution pipeline.
- Be able to diagnose bottlenecks in front-end (decoding) and back-end (execution).
- Know how to write code that avoids branch mispredictions and port contention.
- Understand the architectural trade-offs between Zen 5 and Intel Lunar Lake.
- Be familiar with speculative execution security and how to measure mitigation impact.