Modern CPU Internals: 2025 Deep Dive

Goal: Build a deep, working mental model of modern CPU microarchitecture across x86 and ARM-class designs, including how instructions are fetched, decoded, renamed, scheduled, executed, and retired. You will learn to reason about performance from first principles (latency, throughput, bandwidth, speculation, and power) and translate that reasoning into real measurements. By the end, you will be able to design microbenchmarks, interpret hardware counters, and write code that consistently approaches the architectural limits of the silicon you are running on.

Introduction: What This Guide Covers

Modern CPU Internals is the study of what actually happens inside a processor after your source code becomes machine code. It is the map between your code and physical reality: pipelines, predictors, reorder buffers, execution ports, caches, and speculation. If you can see that map, you can make real performance decisions instead of guessing.

What you will build (by the end of this guide):

A pipeline-trace simulator that visualizes hazards and forwarding
A branch-predictor stress suite that reveals predictor limits
A Spectre-lite side-channel proof and mitigation experiments
A uOp cache boundary prober and a macro-op fusion detector
Port pressure and memory disambiguation profilers
A bandwidth stressor and a hybrid-core profiler
A final uArch-aware JIT that adapts codegen at runtime

Scope (what is included):

Front-end fetch/decode/uOp cache behavior
Branch prediction and speculative execution
Out-of-order execution (rename, scheduling, ROB)
Execution ports, throughput, and latency
Memory hierarchy, bandwidth, and disambiguation
SIMD/vectorization and heterogeneous cores
Measurement methodology (perf/RDTSC/PMU)

Out of scope (for this guide):

Full ISA design or compiler theory
GPU architecture and programming models
RTL/HDL design or physical layout
Formal verification or CPU microcode design

The Big Picture (Mental Model)

Source Code -> Compiler -> Machine Code -> Front-End -> Back-End -> Retire
     |            |            |             |           |         |
     |            |            |             |           |         +--> Architectural State
     |            |            |             |           +--> Out-of-order execute
     |            |            |             +--> Predict + Decode + uOps
     |            |            +--> Bytes in I-cache / uOp cache
     |            +--> Instruction selection + scheduling
     +--> Algorithm + data layout (determines cache/branch behavior)

Key Terms You Will See Everywhere

uOp (micro-operation): A simplified internal instruction produced by the decoder.
ROB (Reorder Buffer): Tracks in-flight uOps and commits in order.
Port pressure: When instructions contend for the same execution port.
Speculation: Executing ahead of known outcomes to keep the pipeline busy.
Disambiguation: Guessing whether a load can pass older stores.

How to Use This Guide

Read the Theory Primer once, fast. The first pass is about building a mental map.
Pick a starter project (Project 1 or 2) and build it end-to-end.
Return to the primer and re-read the chapters used by that project.
Use a Top-Down lens: classify bottlenecks into front-end, bad speculation, or back-end bound.
Run experiments, not just code. Always measure and compare to expected limits.
Keep a lab notebook. Log CPU model, kernel version, compiler flags, and measurements.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Programming Skills:

Comfortable in C or C++ (pointer arithmetic, arrays, structs, function pointers)
Basic assembly reading (x86-64 or ARM64) and calling conventions
Ability to use build tools (make, clang/gcc)

Computer Architecture Fundamentals:

Basic CPU pipeline stages
Cache basics (L1/L2/L3, cache lines)
Virtual memory (pages, TLB)
Recommended Reading: “Computer Systems: A Programmer’s Perspective” (CS:APP) Ch. 4-6

Operating Systems Basics:

Process vs thread
Pinning threads to cores
Syscalls, page faults
Recommended Reading: “Operating Systems: Three Easy Pieces” Ch. 5-7

Helpful But Not Required

Advanced Performance Topics:

Performance counters and PMU tooling
Compiler optimization passes
JIT code generation
Can learn during: Projects 2, 5, 10, Final JIT

Self-Assessment Questions

Can you read a short x86-64 or ARM64 function and explain its control flow?
Do you know what a cache line is and why alignment matters?
Can you explain the difference between latency and throughput?
Have you used perf or a similar profiling tool?
Can you pin a thread to a core on Linux or macOS?

If you answered “no” to 1-3, spend a week with CS:APP Ch. 4-6 and “Inside the Machine” Ch. 2-4.

Development Environment Setup

Required Tools:

Linux machine (Ubuntu 22.04+ recommended) or macOS with Apple Silicon
GCC or Clang (latest stable)
perf (Linux), dtrace or Instruments (macOS)
objdump, nm, ld, as

Recommended Tools:

pmu-tools (Linux) for top-down analysis
llvm-mca for static throughput estimates
hwloc for topology and NUMA layout

Testing Your Setup:

$ uname -a
$ gcc --version
$ clang --version
$ which perf objdump

Time Investment

Simple projects (1, 8): 4-8 hours each
Moderate projects (2, 4, 5, 9): 1 week each
Complex projects (3, 6, 7, 10): 1-2 weeks each
Final JIT project: 3-4 weeks

Important Reality Check

Microarchitecture is not deterministic. Your results will change across:

CPU models and stepping
BIOS settings and microcode
OS scheduler behavior
Thermal or power limits

This is normal. The goal is relative insight and repeatable methodology, not a single magical number.

Big Picture / Mental Model

            +------------------------------+
            |          FRONT-END           |
            | Fetch -> Decode -> uOps      |
            +---------------+--------------+
                            |
                            v
            +------------------------------+
            |       OUT-OF-ORDER CORE      |
            | Rename -> Schedule -> Execute|
            +---------------+--------------+
                            |
                            v
            +------------------------------+
            |         RETIRE / COMMIT      |
            | In-order architectural state |
            +---------------+--------------+
                            |
                            v
            +------------------------------+
            |   MEMORY HIERARCHY & IO      |
            | L1/L2/L3/DRAM + Prefetch     |
            +------------------------------+

Top-Down Bottleneck View (TMAM mental model):

Pipeline Slots (per cycle)
┌──────────────────────────────────────────────┐
│ Retiring (useful work)                      │
│ Bad Speculation (wasted work)               │
│ Front-End Bound (starved by fetch/decode)   │
│ Back-End Bound (core- or memory-bound)      │
└──────────────────────────────────────────────┘

Theory Primer (Read This Before Coding)

Chapter 1: Front-End, Decode, uOps, and Macro-op Fusion

Fundamentals

The front-end is the CPU’s instruction supply chain. It fetches bytes from the instruction cache, finds instruction boundaries, decodes variable-length instructions into simpler micro-operations, and feeds those uOps into the back-end. In modern x86 cores, the front-end is often the true performance ceiling because it must decode a complex ISA and deliver enough uOps per cycle to keep the back-end busy. To avoid decoding the same hot loop repeatedly, CPUs store decoded uOps in a special uOp cache (sometimes called the DSB on Intel). Macro-op fusion is a front-end optimization that combines certain instruction pairs (like CMP + Jcc) into a single uOp, effectively reducing front-end bandwidth demand. If your hot loop fits entirely in the uOp cache and fuses well, your CPU behaves like a wide-issue RISC engine. If it spills into legacy decode, performance drops sharply.

Deep Dive into the Concept

The front-end pipeline starts at the instruction fetch unit (IFU), which uses the branch predictor and the instruction TLB to assemble a stream of instruction bytes from the L1 I-cache. On x86, instruction boundaries are not fixed, so predecode logic scans bytes to find the start and end of each instruction. This is why the decode stage is expensive: the decoder must parse prefixes, opcode bytes, ModRM/SIB encodings, and immediate fields. Modern x86 cores typically have multiple decoders in parallel, but only a subset can handle complex instructions (like memory-to-memory operations or long immediates). To decouple decode complexity from back-end throughput, decoded uOps are placed in a micro-op cache indexed by instruction address. Subsequent executions can bypass the legacy decoders entirely and pull uOps directly from this cache. That is why loop alignment and loop size are critical for performance: a loop that fits in the uOp cache can sustain high IPC even when decode bandwidth is limited.

Macro-op fusion happens during decode. If the decoder sees a compare instruction immediately followed by a conditional branch, it can fuse them into a single uOp, meaning one entry in the uOp queue and one slot in the ROB. This effectively increases front-end capacity by reducing the number of uOps needed for control flow. There is also micro-op fusion, where a load and an ALU op are fused into one uOp that flows through the pipeline as a unit. However, these fusions have strict rules: the instructions must be adjacent, alignment-sensitive, and not cross certain boundaries (like 16-byte fetch blocks on some cores). This is why the placement of NOPs and alignment directives matters.

Front-end performance is constrained by several hard limits: fetch width (bytes per cycle), decode width (instructions per cycle), and uOp cache bandwidth (uOps per cycle). When the uOp cache misses, you pay the decode cost again. When the branch predictor is wrong, the front-end fetches from the wrong path and must restart, wasting uOp cache and decode bandwidth. When the decoder is saturated, the back-end starves even if execution ports are idle. The net effect is that front-end behavior determines the effective throughput of any tight loop. Modern optimizations often target front-end efficiency: align hot loops to cache lines, minimize instruction length, maximize fusion, and keep hot paths within uOp cache capacity.

From a measurement perspective, front-end bottlenecks show up as “front-end bound” slots in the Top-Down method: the back-end is ready, but not enough uOps arrive each cycle. Intel’s Optimization Reference Manual documents decode throughput limits and fusion rules, which you can validate by changing instruction layout and re-measuring IPC. On fixed-length ISAs like ARM, decode is simpler, but front-end limits still appear via I-cache misses, BTB pressure, and uOp delivery bandwidth.

How this fits in projects

Project 1 builds a pipeline timeline so you can see front-end stalls.
Project 4 finds the uOp cache boundary.
Project 8 detects macro-op fusion in practice.
Final JIT uses alignment and code layout for front-end wins.

Definitions & key terms

uOp cache / DSB: Cache of decoded micro-ops.
MITE: Intel term for legacy decode path.
Macro-op fusion: Combining two instructions into one uOp.
Micro-op fusion: Combining a load and ALU op into one uOp.

Mental model diagram

I-cache bytes
    |
    v
[Predecode] -> [Decoders] -> [uOps] -> [uOp cache]
      |                         ^
      +---- miss fallback -------+

How it works (step-by-step)

IFU fetches bytes using predicted control flow.
Predecode locates instruction boundaries.
Decoders translate instructions into uOps.
Decoder tries macro-op fusion on eligible pairs.
uOps are queued to the scheduler and cached in the uOp cache.
Later iterations fetch uOps directly from the uOp cache if present.

Minimal concrete example

; Macro-op fusion candidate
cmp rax, 0
je  label

Common misconceptions

“If the back-end is wide, decode does not matter.” -> Wrong: decode can starve the back-end.
“Any compare + branch always fuses.” -> Wrong: fusion rules are strict and core-specific.

Check-your-understanding questions

Why does a uOp cache hit bypass the decoders?
Why do variable-length instructions make decoding hard?
What conditions typically break macro-op fusion?

Check-your-understanding answers

The uOp cache stores decoded uOps indexed by address, so decode can be skipped.
Without fixed instruction length, the decoder must parse the byte stream to find boundaries.
Non-adjacent instructions, alignment boundaries, or unsupported instruction pairs.

Real-world applications

Compiler code layout and loop alignment
JIT engines (V8, JVM) optimizing hot loops
Crypto and media code tuned for decode bandwidth

Where you’ll apply it

Project 1, Project 4, Project 8, Final JIT

References

Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html
Intel® 64 and IA-32 Architectures Software Developer’s Manual (2025)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

Key insight

Front-end efficiency is often the real bottleneck; decode bandwidth is a hard ceiling.

Summary

The front-end translates complex instructions into uOps and feeds the back-end. The uOp cache and fusion are critical optimizations that allow modern x86 cores to sustain high throughput. Understanding front-end limits explains why small, aligned, predictable loops run dramatically faster than large, irregular ones.

Homework/Exercises to practice the concept

Assemble a loop of 16 instructions and align it to 64 bytes. Measure IPC.
Insert a single 15-byte instruction into the loop and re-measure.
Add or remove NOPs to test fusion boundaries.

Solutions to the homework/exercises

IPC should be higher when the loop fits in the uOp cache and is aligned.
The long instruction can reduce fetch/decode efficiency and lower IPC.
Fusion can break if instruction pairs cross alignment boundaries.

Chapter 2: Branch Prediction and Control Speculation

Fundamentals

Branch prediction is the CPU’s guess about which path your code will take before the data is known. Because modern pipelines are deep and out-of-order, a single misprediction can waste dozens of cycles. Predictors use history (local and global) to recognize patterns in branches. TAGE (Tagged Geometric) predictors combine multiple predictor tables indexed with different history lengths, allowing them to recognize both short and long patterns. Speculation lets the CPU keep working while waiting for a branch outcome; when the guess is wrong, work is discarded but time is lost. Performance-sensitive code must avoid unpredictable branches or structure them so the predictor can learn.

Deep Dive into the Concept

Branch predictors are hierarchical. At the front, a Branch Target Buffer (BTB) predicts target addresses for taken branches. A direction predictor decides taken vs not-taken. Modern predictors combine several components: global history tables, local history tables, loop predictors, and statistical correctors. The TAGE family uses multiple tagged tables, each indexed by a different history length (often geometric). Each table provides a candidate prediction, and a chooser selects the best based on usefulness counters. This design handles both short and long correlations without exploding storage. Because tables are tagged, they reduce aliasing (two unrelated branches mapping to the same entry). In practice, this means a branch pattern like T,T,N repeats can be learned quickly while longer periodic patterns can still be tracked by longer-history tables.

Control speculation is what happens after the prediction: the front-end fetches and decodes the predicted path and sends uOps into the out-of-order engine. The CPU can execute hundreds of uOps on a speculative path before the real branch condition is known. If the prediction was correct, all of that work retires as if it were executed in-order. If incorrect, the CPU squashes the speculative state and restarts from the correct target. The cost of a misprediction is roughly the depth of the pipeline plus the time to refill the front-end, which can be 10s of cycles or more. Deep pipelines and wide decode make this cost higher, which is why prediction accuracy is now one of the most critical performance factors.

Branch prediction accuracy is shaped by both hardware and code. Hardware uses multiple predictors and complex indexing; software can help by avoiding data-dependent branches in hot loops, using bitwise logic or conditional moves, or grouping correlated branches. Compilers also rewrite branches into predicated instructions when possible. However, predication can sometimes increase instruction count and pressure the front-end. The art is choosing when to keep a branch and when to eliminate it. Understanding predictor behavior lets you design microbenchmarks that reveal its limits, such as periodic patterns that overflow history or random patterns that defeat the predictor entirely.

In practice, the predictor is a collection of specialized structures: BTBs for target addresses, direction predictors with multiple history lengths (TAGE-style families), loop predictors for simple loops, and return stack buffers (RSB) for ret instructions. Indirect branches (virtual calls, jump tables) are especially challenging because they require predicting among many targets, which is why code layout and virtual dispatch can strongly affect IPC.

How this fits in projects

Project 2 stress-tests predictor history and reveals misprediction cost.
Project 3 shows how speculation can leak data.

Definitions & key terms

BTB: Branch Target Buffer for predicting target addresses.
BHT/PHT: Branch/Pattern History Tables.
TAGE: Tagged Geometric History Length predictor family.
Misprediction penalty: Cycles lost on wrong prediction.

Mental model diagram

Branch PC -> BTB -> Target
           |-> Direction Predictor (TAGE tables)
                    |-> predict T/NT

How it works (step-by-step)

Fetch unit consults BTB and direction predictor.
Front-end fetches from predicted path.
Branch executes later in the pipeline.
If prediction correct, speculative work retires.
If incorrect, pipeline is flushed and restarted.

Minimal concrete example

for (int i = 0; i < N; i++) {
    if (i % 2 == 0) sum += a[i];
}

Common misconceptions

“Branch predictors only look at the last outcome.” -> Modern predictors use long histories.
“Misprediction cost is constant.” -> It varies by pipeline depth and front-end refill.

Check-your-understanding questions

Why do predictors use multiple history lengths?
What is the role of tags in TAGE?
Why can a predictable branch still be expensive?

Check-your-understanding answers

To capture both short and long patterns without huge tables.
Tags reduce aliasing and improve accuracy.
If it is frequently mispredicted during warm-up or conflicts, the cost remains.

Real-world applications

High-frequency trading (branchless hot loops)
OS schedulers and interpreters (high branch density)
Database query engines (branch-heavy filters)

Where you’ll apply it

Project 2, Project 3

References

Seznec, Michaud, “A Case for (Partially) Tagged Geometric History Length Branch Prediction” (2006)
https://jilp.org/vol8/v8paper1.pdf
Jiménez et al., “Branch Prediction Is Not a Solved Problem” (arXiv, 2019)
https://arxiv.org/abs/1906.08170
Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html

Key insight

Branch prediction is a machine-learning system whose limits you can measure and exploit.

Summary

Predictors use history and tagging to guess control flow. Their accuracy dominates performance in branchy code. Speculation keeps the pipeline full, but mispredictions waste work and can even leak information.

Homework/Exercises to practice the concept

Write a loop with a pattern period of 2, 4, 16, 1024 and measure cycles/iter.
Force a branch to be random and compare with a predictable branch.
Replace a branch with a conditional move and compare throughput.

Solutions to the homework/exercises

Longer periods eventually exceed predictor history and cause mispredictions.
Random branches show much higher cycles/iter.
CMOV removes misprediction but may increase uOps and port pressure.

Chapter 3: Out-of-Order Core, Rename, and the ROB

Fundamentals

Out-of-order (OoO) execution is the reason modern CPUs can keep working while waiting on memory. Instead of executing instructions strictly in program order, the CPU decodes ahead, renames registers to remove false dependencies, and schedules any uOp whose operands are ready. The Reorder Buffer (ROB) tracks every in-flight uOp so results can be committed in the original order. This preserves the illusion of sequential execution even though the hardware is running far ahead. The size of the ROB and associated reservation stations define the “instruction window” and determine how much latency the CPU can hide.

Deep Dive into the Concept

When uOps enter the out-of-order engine, the rename stage maps architectural registers (like RAX) to physical registers. This breaks write-after-read (WAR) and write-after-write (WAW) hazards, leaving only true data dependencies (RAW). The renamed uOps are placed into a scheduling structure, often split by execution domain (integer, vector, memory). The scheduler watches for operand readiness and issues uOps to execution ports when their inputs are ready. This allows independent operations later in the program to run before earlier operations that are stalled on memory.

Memory operations are tracked in load and store queues so that the core can speculate on memory ordering while still enforcing correctness. Loads can execute before older stores if the hardware predicts they do not alias; if that prediction is wrong, the core rolls back and replays. This is why OoO performance is so sensitive to pointer-chasing and aliasing patterns: even with a large ROB, a congested load/store queue can become the real bottleneck.

The ROB holds metadata for every in-flight uOp: original program order, destination register mapping, exception status, and completion state. Even though uOps execute out of order, retirement must happen in order to maintain precise exceptions and architectural state. If a branch mispredicts, or a speculative load violates a dependency, the ROB allows the CPU to roll back to a known-good point by discarding younger uOps. The ROB size is therefore both a performance lever and a complexity cost: bigger windows hide more latency but require more physical registers, more wakeup logic, and more energy.

The instruction window is not just the ROB. It is limited by the smaller of ROB size, scheduler entries, physical register file capacity, and load/store queue size. A large ROB without enough physical registers or scheduler entries will still stall. Additionally, the latency of wakeup/selection logic grows superlinearly as the window grows, which is why design trade-offs exist. Modern CPUs often balance window size, scheduler partitioning, and clock speed. From a software perspective, what matters is that long-latency operations (like cache misses) can be overlapped with independent work if and only if the instruction window is large enough and dependency chains are broken.

How this fits in projects

Project 7 measures your CPU’s effective instruction window (ROB boundary).
Projects 5 and 6 reveal scheduler and memory dependencies that limit OoO benefit.

Definitions & key terms

Rename: Mapping architectural registers to physical registers.
ROB: Tracks in-flight uOps and retires in order.
Instruction window: The set of uOps the CPU can see ahead.
Scheduler / Reservation station: Queue where ready uOps are issued.

Mental model diagram

Decode -> Rename -> ROB -> Schedulers -> Execute -> Retire
               ^        |                         |
               |        +-- tracks completion ----+

How it works (step-by-step)

uOps are renamed to physical registers.
uOps enter the ROB and scheduler.
Ready uOps issue to execution ports.
Results write back to physical registers.
ROB retires uOps in program order.

Minimal concrete example

// Independent operations allow OoO overlap
x = a + b;
y = c + d;
*z = load(ptr); // long-latency
w = e + f;      // can run while load waits

Common misconceptions

“Bigger ROB always means faster.” -> It helps only if you have independent work.
“OoO removes all memory latency.” -> It hides latency only within window limits.

Check-your-understanding questions

Why do we need register renaming?
Why must retirement be in order?
What typically limits the instruction window besides ROB size?

Check-your-understanding answers

To eliminate false dependencies and increase parallelism.
To provide precise exceptions and correct architectural state.
Physical registers, scheduler entries, or load/store queue size.

Real-world applications

Database engines overlapping pointer chasing with independent math
Compilers reordering independent operations to increase ILP
HPC kernels designed to keep the window full

Where you’ll apply it

Project 7, Project 5, Project 6, Final JIT

References

“Computer Architecture: A Quantitative Approach” Ch. 3
“Computer Organization and Design” (Patterson & Hennessy) Ch. 3-4
Intel® 64 and IA-32 Architectures Software Developer’s Manual (2025)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html

Key insight

OoO execution is powerful but finite; you can only hide latency you can see ahead.

Summary

The OoO engine is a speculation machine that runs far ahead while guaranteeing correct final state. ROB size, renaming capacity, and scheduler size define the limit of hidden latency. Software can either feed the window with independent work or stall it with dependencies.

Homework/Exercises to practice the concept

Write a loop with 1 independent add per iteration and measure IPC.
Increase independent accumulators to 4, 8, 16 and measure IPC changes.
Add a long-latency load and see at which unroll factor IPC stops improving.

Solutions to the homework/exercises

IPC is low due to dependency chain.
IPC increases as you add independent work.
IPC flattens once the instruction window is saturated.

Chapter 4: Execution Ports, Throughput, and Port Pressure

Fundamentals

Execution ports are the physical doorways from the scheduler to functional units. A single port may connect to an integer ALU, a vector unit, or a load/store pipe. If many uOps require the same port, they queue and wait even if other ports are idle. This is why throughput is not just about total ALUs, but about port distribution and binding. Instruction latency tells you how long a single instruction takes; throughput tells you how many you can start per cycle. High-performance code spreads work across ports and avoids long dependency chains.

Deep Dive into the Concept

Each microarchitecture defines a port map: which uOps can issue on which ports, and which functional units are attached to those ports. For example, integer adds may execute on multiple ports, while divides or shuffles may only execute on one. The scheduler chooses a port each cycle based on availability and constraints, but if the port map is narrow for certain instructions, they become bottlenecks. This is the source of port pressure. Port pressure is measurable: if you run a tight loop of a single instruction type and measure uOps per cycle, you can infer which ports that instruction uses.

Instruction latency vs throughput is the key mental model. A multiply might have a latency of 3-5 cycles but a throughput of 1 per cycle if the pipeline is fully pipelined. A divide might have a latency of 20+ cycles and throughput of 1 per 4-8 cycles, meaning it is both slow and scarce. When mixing instructions, the scheduler can fill otherwise idle ports if you choose complementary operations. For example, alternating integer and vector instructions may increase total throughput. However, dependencies limit this: if instruction B depends on the result of A, you cannot overlap them, regardless of port availability. This is why out-of-order execution and port pressure are coupled.

Modern cores also have specialized execution units for address generation (AGUs), load/store, branch, and vector operations. Load/store units are often the true bottleneck in memory-heavy loops. Port pressure analysis is critical in performance tuning and explains why some compiler transformations help. For instance, unrolling increases independent work so the scheduler can saturate ports, but too much unrolling can increase register pressure and cause spills, which adds memory uOps and increases load/store pressure.

You can observe port pressure indirectly with performance counters. When “core-bound” is high in a Top-Down breakdown and memory stalls are low, the bottleneck is often a saturated execution port or scarce unit (like divides or shuffles). Combining static analysis (llvm-mca) with dynamic counters (perf stat) lets you triangulate which port or unit is limiting throughput.

How this fits in projects

Project 5 builds a port pressure map.
Project 9 stresses load/store and vector bandwidth.

Definitions & key terms

Latency: Cycles for an instruction to produce a result.
Throughput: Instructions per cycle when pipelined.
Port pressure: Contention on ports due to binding.
AGU: Address Generation Unit for memory ops.

Mental model diagram

Scheduler -> Port 0 (ALU)
          -> Port 1 (ALU/MUL)
          -> Port 2 (Load)
          -> Port 3 (Store)

How it works (step-by-step)

Scheduler selects ready uOps.
Each uOp is assigned to an eligible port.
Port issues to its functional unit.
Result is written back, freeing resources.

Minimal concrete example

// Independent accumulators reduce latency bottlenecks
sum0 += a[i];
sum1 += a[i+1];

Common misconceptions

“CPU usage at 50% means half the ports are idle.” -> Not necessarily; you can be port-bound.
“Latency is all that matters.” -> Throughput is the main limiter in steady-state loops.

Check-your-understanding questions

Why can two adds run faster than two multiplies?
What is the difference between port binding and dependency?
How does unrolling increase throughput?

Check-your-understanding answers

Adds often have multiple ports; multiplies often have fewer.
Port binding limits which ports are eligible; dependency limits scheduling order.
It creates independent uOps so multiple ports can be used each cycle.

Real-world applications

SIMD hot loops in media and crypto
HPC kernels tuned for port balance
Compiler auto-vectorization choices

Where you’ll apply it

Project 5, Project 9, Final JIT

References

Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html
Intel perfmon events (for port/throughput counter mapping)
https://perfmon-events.intel.com/
“Inside the Machine” Ch. 4

Key insight

You do not have one CPU; you have a grid of ports, and your code is a traffic pattern.

Summary

Execution ports explain why throughput varies by instruction mix. To achieve peak performance, you must balance port usage, avoid long dependency chains, and feed the scheduler with independent uOps.

Homework/Exercises to practice the concept

Measure throughput for ADD-only, MUL-only, and mixed loops.
Use llvm-mca to predict port usage and compare with measurements.
Introduce dependencies and observe throughput drop.

Solutions to the homework/exercises

ADD-only should have higher IPC than MUL-only on most cores.
If measurements differ, investigate hidden dependencies or front-end limits.
Dependencies serialize instructions and reduce throughput.

Chapter 5: Memory Hierarchy, Caches, and Bandwidth

Fundamentals

CPU cores are fast, but memory is slow. The memory hierarchy exists to bridge that gap: L1 cache is tiny and fast, L2 is larger and slower, L3 is shared and slower still, and DRAM is orders of magnitude slower. Performance depends not only on latency but also on bandwidth: how many bytes per cycle you can move. Modern SIMD units can consume hundreds of bytes per cycle, so the cache subsystem must deliver data at extremely high bandwidth. Alignment, access patterns, and prefetching determine whether you hit L1, L2, L3, or DRAM.

Deep Dive into the Concept

Caches are organized into cache lines (typically 64 bytes). When you load a single byte, the entire line is fetched. This creates spatial locality benefits but also means poor access patterns can waste bandwidth. The L1 cache is typically single-cycle or low single-digit latency and can deliver multiple loads per cycle. The L2 cache is slower but larger, acting as a buffer between L1 and L3. The L3 cache is shared across cores, and its latency varies with topology (ring, mesh, or fabric). DRAM access involves the memory controller and can take hundreds of cycles.

Bandwidth is limited by the number of load/store ports, cache banks, and the width of data paths. Modern CPUs widen these paths to feed vector units, because wide SIMD (AVX-512, SVE, NEON) can consume large amounts of data per cycle. The practical result is that bandwidth is often the true bottleneck: the arithmetic units are ready, but the data path cannot keep them fed.

Cache policies also matter: some designs keep L3 inclusive of L2/L1, while others are non-inclusive or victim-style. This changes eviction behavior and how cross-core traffic is handled. On multi-core systems, shared caches and memory controllers can become the limiting factor long before per-core execution units are saturated.

Prefetchers are hardware units that detect access patterns and fetch data ahead of time. They are effective for linear or strided access, but can fail or even hurt performance for irregular patterns. TLBs (Translation Lookaside Buffers) cache page table entries so that virtual-to-physical translation is fast; TLB misses can be just as expensive as cache misses. Effective performance tuning requires understanding not just cache size, but cache line behavior, bank conflicts, and prefetch behavior.

How this fits in projects

Project 9 measures L1 bandwidth limits.
Project 6 explores aliasing and load/store stalls.

Definitions & key terms

Cache line: The unit of data transfer (often 64 bytes).
TLB: Cache of address translations.
Prefetcher: Hardware that fetches data ahead of demand.
Bandwidth vs latency: Throughput vs access time.

Mental model diagram

L1 (fast, tiny) -> L2 (bigger) -> L3 (shared) -> DRAM (slow)

How it works (step-by-step)

Load checks L1 cache.
If miss, requests L2, then L3, then DRAM.
Cache line is filled at each level.
Prefetchers may fetch adjacent lines.

Minimal concrete example

// Strided access that defeats prefetch
for (int i = 0; i < N; i += 4096) sum += a[i];

Common misconceptions

“Caches are just about latency.” -> Bandwidth is often the limiting factor.
“Bigger cache always means faster.” -> Access patterns matter more than size.

Check-your-understanding questions

Why is a cache line 64 bytes important for alignment?
What is the difference between bandwidth and latency?
How can prefetching hurt performance?

Check-your-understanding answers

Misalignment causes extra cache line accesses.
Latency is time per access; bandwidth is bytes per second.
Wrong prefetches waste bandwidth and evict useful data.

Real-world applications

SIMD kernels and matrix multiplication
Data analytics and streaming pipelines
Memory-bound workloads like graph analytics

Where you’ll apply it

Project 9, Project 6, Final JIT

References

“Computer Systems: A Programmer’s Perspective” Ch. 6
Intel® 64 and IA-32 Architectures Software Developer’s Manual (2025)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html

Key insight

Memory bandwidth is now a first-class performance constraint, not just latency.

Summary

The memory hierarchy balances size and speed. Cache lines, prefetchers, and bandwidth limits determine how fast data can reach execution units. The best code maximizes locality and alignment to keep data in L1.

Homework/Exercises to practice the concept

Measure bandwidth for sequential vs strided access.
Align a buffer to 64 bytes and compare with unaligned.
Increase working set size and identify L1, L2, L3, DRAM thresholds.

Solutions to the homework/exercises

Sequential access should yield higher bandwidth.
Aligned buffers reduce split loads and improve bandwidth.
You will see distinct latency jumps at each cache boundary.

Chapter 6: Load/Store Subsystem and Memory Disambiguation

Fundamentals

Loads and stores are not just memory operations; they are deeply integrated into the out-of-order engine. The CPU must decide whether a load can execute before older stores have resolved their addresses. This is memory disambiguation. If the CPU guesses wrong, it must replay the load, which costs cycles. Store-to-load forwarding (STLF) is a fast path where a load can read data directly from the store buffer without going through cache. Address aliasing (notably 4K aliasing on x86) can force conservative stalls.

Deep Dive into the Concept

The load/store queue (LSQ) tracks in-flight memory operations. Stores enter a store buffer and wait until their address and data are ready; loads enter a load buffer and may execute speculatively. The critical decision is whether a load can pass older stores. Full address comparison is expensive because older store addresses may not yet be resolved, so CPUs use partial comparisons. On x86, many cores use the lower 12 bits (page offset) to determine if a load might alias a store. If the low bits match, the CPU may stall the load or issue it speculatively but be prepared to replay. This is the root of 4K aliasing: addresses that differ by multiples of 4096 bytes share the same page offset and cause false dependencies.

Store-to-load forwarding is a major optimization: if a load reads from an address that was recently written by an older store still in the store buffer, the data can be forwarded directly, avoiding cache access. However, forwarding only works in certain conditions: the load must be fully aligned and the store must cover the exact bytes needed. Partial overlaps or misalignment can break forwarding, causing expensive “store forwarding stalls”. Compilers and programmers can avoid such stalls by aligning data and ensuring consistent access sizes.

Memory disambiguation interacts with speculation and the ROB. When a load is executed speculatively and later found to have violated a dependency (an older store writes to the same address), the CPU must replay the load and potentially flush younger uOps. This is similar to a branch misprediction but often smaller in scope. The performance cost can be significant in pointer-heavy code or in tight loops that repeatedly access addresses with aliasing patterns. Understanding and measuring these effects helps you design data structures and memory layouts that avoid aliasing hotspots.

How this fits in projects

Project 6 measures disambiguation penalties and 4K aliasing.
Project 9 relies on efficient load/store pipelines.

Definitions & key terms

LSQ: Load/Store Queue tracking in-flight memory ops.
Store buffer: Holds pending stores before commit.
STLF: Store-to-Load Forwarding.
4K aliasing: False dependency from matching page offset.

Mental model diagram

Stores -> Store Buffer -----+
                            |--> Disambiguation logic --> Loads
Loads  -> Load Buffer  ------+

How it works (step-by-step)

Stores enter the store buffer; addresses may be unresolved.
Loads enter the load buffer and check against older stores.
If no conflict, load issues speculatively.
If conflict discovered later, load is replayed.

Minimal concrete example

*ptrA = 1;
int x = *ptrB; // may alias ptrA + 4096

Common misconceptions

“Loads always wait for stores.” -> They often bypass using speculation.
“Pointer aliasing is only a compiler problem.” -> Hardware aliasing matters too.

Check-your-understanding questions

Why does 4K aliasing happen on x86?
When does store-to-load forwarding fail?
What is a memory dependency violation?

Check-your-understanding answers

Partial address comparisons use the low 12 bits (page offset).
On misaligned or partial overlaps, or size mismatches.
When a speculative load was issued before an older store to the same address.

Real-world applications

Database buffer managers
Lock-free data structures
High-frequency trading order books

Where you’ll apply it

Project 6, Project 9

References

“Computer Architecture” Ch. 3.9
Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html
Intel® 64 and IA-32 Architectures Software Developer’s Manual (2025)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

Key insight

Memory disambiguation is a hidden source of stalls that you can provoke and measure.

Summary

The load/store subsystem speculates to keep the pipeline busy. When it guesses wrong, it pays a penalty. 4K aliasing and store-forwarding stalls are measurable effects that influence real performance.

Homework/Exercises to practice the concept

Create two pointers with the same page offset and measure latency.
Compare aligned vs unaligned store-to-load forwarding.
Vary store/load sizes and observe penalties.

Solutions to the homework/exercises

Same page offset triggers aliasing stalls.
Aligned accesses forward more reliably.
Mismatched sizes often disable forwarding.

Chapter 7: Speculative Execution Security (Spectre and Friends)

Fundamentals

Speculative execution improves performance by executing instructions before it is known if they are needed. When speculation is wrong, architectural state is rolled back, but microarchitectural state (like caches) is not. This creates side channels: by measuring cache timing, an attacker can infer data that should have been inaccessible. Spectre exploits this by training the branch predictor to speculatively execute out-of-bounds memory accesses and then observing the cache footprint.

Deep Dive into the Concept

Spectre attacks exploit the gap between architectural correctness and microarchitectural side effects. In Spectre Variant 1 (bounds check bypass), an attacker repeatedly trains a branch predictor with safe inputs so that the predictor expects a bounds check to pass. Then the attacker provides an out-of-bounds index, causing the CPU to speculatively execute the body of the bounds check. Even though the architectural state is rolled back when the bounds check resolves false, the speculative load brings secret-dependent data into the cache. By timing subsequent accesses, the attacker can infer which cache line was loaded and thus the secret data.

This attack relies on three elements: a mistrained predictor, speculative execution that reads secret data, and a side channel (usually cache timing). The technique is often combined with Flush+Reload or Prime+Probe to amplify timing differences. Mitigations include inserting fences (LFENCE on x86), retpolines to mitigate indirect branch prediction, compiler-based speculation barriers, and hardware changes that restrict speculation across security boundaries. However, these mitigations often reduce performance, especially in tight loops or kernel transitions.

Modern CPUs provide controls like Speculative Store Bypass Disable (SSBD) and other microcode features, but complete mitigation is challenging because speculation is deeply intertwined with performance. Many real-world mitigations rely on isolating sensitive data, avoiding shared memory between trust domains, and reducing timing resolution in sandboxed environments. Understanding Spectre at a microarchitectural level helps you reason about which code patterns are risky and how performance features can create security trade-offs.

Beyond Variant 1, real systems also mitigate indirect branch target injection (Variant 2) and related attacks with techniques like retpolines, indirect branch restriction (IBRS), and core/thread isolation in sensitive contexts. These defenses reduce speculation in critical code paths but often introduce measurable performance costs, which is why microarchitecture-aware profiling is essential for security-focused tuning.

How this fits in projects

Project 3 implements a Spectre-lite demo and tests mitigations.

Definitions & key terms

Spectre: Class of speculative execution attacks.
Side channel: Information leak via timing or other indirect effects.
LFENCE: Serialization instruction that blocks speculation.

Mental model diagram

Bounds check -> predicted taken -> speculative load -> cache footprint
                                -> rollback arch state (cache unchanged)

How it works (step-by-step)

Train predictor with in-bounds indices.
Issue out-of-bounds index.
Speculative load touches secret-derived cache line.
Measure cache timing to recover secret.

Minimal concrete example

if (idx < size) {
    temp &= probe[secret[idx] * 4096];
}

Common misconceptions

“Speculation always respects bounds checks.” -> The check can be bypassed speculatively.
“Side channels are rare.” -> They exist in most shared microarchitectural resources.

Check-your-understanding questions

Why is cache state not rolled back on mis-speculation?
What is the role of branch predictor training?
How does LFENCE help?

Check-your-understanding answers

Rolling back cache state is complex and too costly.
It biases the predictor to take the speculative path.
It prevents later instructions from executing until earlier ones retire.

Real-world applications

Browser sandboxing and JIT hardening
OS kernel isolation techniques
Secure enclave design considerations

Where you’ll apply it

Project 3

References

Kocher et al., “Spectre Attacks: Exploiting Speculative Execution” (2018)
https://arxiv.org/abs/1801.01203
Intel® 64 and IA-32 Architectures Software Developer’s Manual (2025)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

Key insight

Speculation is performance magic with security consequences; you must measure both.

Summary

Speculative execution can leak secrets through cache timing. Spectre demonstrates that microarchitectural side effects can violate software isolation assumptions. Mitigations trade performance for safety.

Homework/Exercises to practice the concept

Implement Flush+Reload and measure L1 vs L2 timing thresholds.
Add LFENCE and measure how it changes leakage and performance.
Compare speculation impact across two CPUs.

Solutions to the homework/exercises

L1 hits should be significantly faster than L2/L3/DRAM.
LFENCE reduces leakage but increases cycles.
Different CPUs show different misprediction costs and leakage rates.

Chapter 8: Vectorization and Heterogeneous Cores

Fundamentals

Modern CPUs rely on wide vector units (SIMD) and heterogeneous core designs to deliver performance per watt. SIMD executes multiple data elements per instruction (e.g., 4, 8, or 16 lanes), but this increases pressure on memory bandwidth and load/store units. Heterogeneous cores (performance and efficiency cores) trade width and frequency for power. The OS and runtime must schedule the right workload on the right core. Intel Thread Director is one example of hardware that guides the OS scheduler using real-time telemetry. Apple and ARM designs also use big/little style approaches with different core types.

Deep Dive into the Concept

Vectorization is the primary way to scale performance without increasing clock speed. AVX2 provides 256-bit vectors, AVX-512 provides 512-bit vectors, and ARM uses NEON or SVE. Wider vectors increase throughput but also increase register file pressure and bandwidth demand. Vector pipelines often have different port bindings than scalar pipelines, so vector code can be either faster or slower depending on port pressure and memory bandwidth. For example, doubling vector width without doubling L1 bandwidth will leave the vector units starved. This is why cache bandwidth upgrades are a key part of modern microarchitectures.

Heterogeneous cores complicate performance analysis. Performance cores (P-cores) typically have wider front-ends, bigger ROBs, and more execution resources, while efficiency cores (E-cores) are narrower but more power efficient. Intel Thread Director provides hardware guidance to the OS scheduler by monitoring instruction mix and core state; this enables better placement of latency-sensitive threads on P-cores and background work on E-cores. Apple and ARM systems often use a similar concept, but with different scheduling stacks. The practical impact is that performance characteristics vary by core type, and microbenchmarks must be pinned to specific cores to be meaningful.

Vectorization and heterogeneity also affect JITs and runtime dispatch. Libraries often detect CPU capabilities (AVX-512 vs AVX2) and choose different code paths. On heterogeneous systems, the best code path for a P-core might be worse on an E-core due to narrower ports or different cache sizes. A microarchitecture-aware JIT can adapt code generation to both vector width and core type, enabling high performance across diverse hardware.

On ARM systems, big.LITTLE designs pair “big” cores with “LITTLE” cores to balance throughput and efficiency, and the OS must schedule accordingly. On Intel hybrid systems, Thread Director provides hardware hints about core suitability, but you still need to pin and compare workloads to understand actual performance. These scheduling layers explain why “same CPU” can deliver very different results depending on core placement.

How this fits in projects

Project 9 measures cache bandwidth to feed vector units.
Project 10 compares P vs E cores.
Final JIT selects vector code paths based on CPU and core type.

Definitions & key terms

SIMD: Single Instruction, Multiple Data.
AVX-512 / SVE / NEON: Vector instruction sets.
Heterogeneous cores: Mix of performance and efficiency cores.
Thread Director: Intel hardware guidance for scheduling.

Mental model diagram

Workload -> Scheduler -> P-core (wide) or E-core (efficient)
         -> Vector width selection (AVX2 / AVX-512 / NEON)

How it works (step-by-step)

Detect CPU features and core type.
Choose vector width and code path.
Pin threads to specific cores for consistency.
Measure bandwidth and compute efficiency.

Minimal concrete example

// Dispatch based on AVX-512 support
if (cpu.has_avx512) vec_add_512(a,b,c,n);
else vec_add_256(a,b,c,n);

Common misconceptions

“Wider SIMD is always faster.” -> Not if memory bandwidth is the limiter.
“All cores behave the same.” -> P/E cores can differ significantly.

Check-your-understanding questions

Why does SIMD require more memory bandwidth?
What role does Thread Director play?
Why pin threads for microbenchmarks?

Check-your-understanding answers

Each instruction processes more data per cycle.
It guides OS scheduling decisions using hardware telemetry.
To avoid cross-core variability and scheduler noise.

Real-world applications

Video codecs and ML inference
Scientific computing and analytics
Mobile devices balancing performance and battery

Where you’ll apply it

Project 9, Project 10, Final JIT

References

Intel Thread Director overview (Intel support documentation)
https://www.intel.com/content/www/us/en/support/articles/000097634/processors.html
Intel Thread Director (Game Dev Guide)
https://www.intel.com/content/www/us/en/developer/articles/guide/optimization-guide-intel-thread-director.html
ARM big.LITTLE overview
https://www.arm.com/technologies/big-little
Intel 64 and IA-32 Architectures Optimization Reference Manual (v50, 2024)
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html

Key insight

Performance per watt comes from wide vectors and smart scheduling, not just clocks.

Summary

Vectorization and heterogeneity are the dominant levers in modern CPU design. SIMD increases throughput but requires bandwidth. Heterogeneous cores require intelligent scheduling and runtime adaptability.

Homework/Exercises to practice the concept

Measure vectorized vs scalar throughput for a simple loop.
Pin the same workload to P and E cores and compare IPC.
Test AVX2 vs AVX-512 code paths with identical data sizes.

Solutions to the homework/exercises

Vectorized code should be faster until bandwidth limits are hit.
P-cores typically show higher IPC and lower latency.
AVX-512 may not win if bandwidth or port pressure dominates.

Glossary (High-Signal)

AGU: Address Generation Unit used for calculating memory addresses.
BPU: Branch Prediction Unit, combines multiple predictors to guess control flow.
BTB: Branch Target Buffer, stores target addresses of branches.
BHT/PHT: Branch/Pattern History Table used to predict branch direction.
DSB/uOp cache: Cache of decoded micro-operations.
IPC: Instructions per cycle, a measure of throughput.
LSQ: Load/Store Queue tracking in-flight memory ops for ordering and forwarding.
MITE: Intel term for the legacy decode path.
PMU: Performance Monitoring Unit that exposes hardware counters.
PRF: Physical Register File used after register renaming.
RSB: Return Stack Buffer predicting returns.
ROB: Reorder Buffer that retires uOps in order.
Speculation: Executing before outcomes are known.
STLF: Store-to-Load Forwarding.
TMAM: Top-Down Microarchitecture Analysis Method (slot-based bottleneck analysis).
TAGE: Tagged Geometric branch predictor family.
TLB: Translation Lookaside Buffer for address translation.

Why Modern CPU Internals Matter

The Modern Problem It Solves

CPU performance today is limited less by clock speed and more by microarchitectural efficiency: branch prediction accuracy, front-end bandwidth, cache behavior, and the ability to overlap long-latency work inside the out-of-order window. This matters because performance-per-watt has become the dominant constraint, especially in data centers and AI-heavy workloads where small per-core inefficiencies multiply into massive energy costs.

Real-world impact (2022–2030):

Global electricity share: Data centers consumed ~240–340 TWh in 2022 (about 1–1.3% of global electricity). Source: IEA data center & network energy analysis (2024 update).
https://www.iea.org/energy-system/buildings/data-centres-and-data-transmission-networks
Current and projected demand: Data center electricity use reached ~415 TWh in 2024 and is projected to rise to ~945 TWh by 2030. Source: IEA Energy and AI report (2024).
https://www.iea.org/reports/energy-and-ai/executive-summary
AI/accelerated workloads: The IEA notes that AI workloads are a key driver of the next wave of data center electricity growth.
https://www.iea.org/topics/artificial-intelligence

Why this matters to developers:

A loop that misses the uOp cache can be 2–4× slower even on “fast” CPUs.
Mispredicted branches waste a full pipeline refill worth of cycles.
SIMD code that ignores port pressure or bandwidth can underperform scalar code.

OLD APPROACH (Frequency Scaling)       NEW APPROACH (Specialization)
┌──────────────────────────────┐       ┌──────────────────────────────┐
│ Higher GHz every generation   │       │ Wider decode, smarter uOps   │
│ Single-thread focus           │       │ SIMD + accelerators + caches │
└──────────────────────────────┘       └──────────────────────────────┘

Context & Evolution (History)

Dennard scaling ended, so power became the limiter.
ILP, OoO, and branch prediction became the main performance levers.
SIMD widths expanded and heterogeneous cores became mainstream.

Concept Summary Table

Concept Cluster	What You Need to Internalize
Front-End & uOps	Decode is a hard bottleneck; uOp cache and fusion determine loop speed.
Branch Prediction	TAGE-style predictors learn patterns; mispredictions dominate control-heavy code.
OoO + ROB	The instruction window limits latency hiding and defines true parallelism.
Execution Ports	Port binding and throughput determine real IPC ceilings.
Memory Hierarchy	Bandwidth and cache lines are just as important as latency.
Disambiguation	Load/store speculation and 4K aliasing create hidden stalls.
Speculation Security	Speculation leaks data via microarchitectural side effects.
Vector + Hetero	Wide SIMD and core heterogeneity define modern performance per watt.

Project-to-Concept Map

Project	What It Builds	Primer Chapters It Uses
Project 1: Human Pipeline Trace	Pipeline timeline simulator	Front-End & uOps
Project 2: Branch Predictor Torture Test	Predictor stress suite	Branch Prediction
Project 3: Spectre-lite	Side-channel demo	Branch Prediction, Speculation Security
Project 4: uOp Cache Prober	uOp cache boundary detection	Front-End & uOps
Project 5: Port Pressure Map	Port binding profiler	Execution Ports
Project 6: Memory Disambiguation Probe	Aliasing stall map	Disambiguation
Project 7: ROB Boundary Finder	Instruction window measurement	OoO + ROB
Project 8: Macro-op Fusion Detector	Fusion rule catalog	Front-End & uOps
Project 9: L1 Bandwidth Stressor	Bandwidth saturation	Memory Hierarchy, Vector + Hetero
Project 10: Lunar Lake P vs E Core Profiler	P/E core comparison	Vector + Hetero
Final Project: uArch-Aware JIT	Adaptive codegen	All chapters

Deep Dive Reading by Concept

Fundamentals & Architecture

Concept	Book & Chapter	Why This Matters
Pipelines and hazards	“Computer Systems: A Programmer’s Perspective” Ch. 4	Ground truth for pipeline timing.
OoO execution	“Computer Architecture” (Hennessy & Patterson) Ch. 3	Core model for modern CPUs.
Register renaming	“Computer Organization and Design” Ch. 3-4	Explains physical vs architectural state.

Front-End and Branch Prediction

Concept	Book & Chapter	Why This Matters
Instruction decoding	“Inside the Machine” Ch. 4	Decode is a major performance limit.
Branch prediction	“Computer Architecture” Ch. 3	Predictor accuracy dominates IPC.

Memory and Performance

Concept	Book & Chapter	Why This Matters
Memory hierarchy	“Computer Systems” Ch. 6	Caches and bandwidth determine throughput.
Latency vs throughput	“Write Great Code, Vol 1” Ch. 12	Core mental model for tuning.

Security and Speculation

Concept	Book & Chapter	Why This Matters
Side channels	“Practical Binary Analysis” Ch. 11	Spectre-like techniques explained.

Performance Measurement & Tooling

Concept	Book & Chapter	Why This Matters
Timing methodology	“Computer Systems” Ch. 5–6	Baseline for cycles/IPC and memory behavior.
Benchmarking mindset	“Write Great Code, Vol 1” Ch. 12	Practical throughput vs latency framing.
Low-level debugging	“The Art of Debugging with GDB, DDD, and Eclipse” Ch. 1–4	Tooling for inspecting machine code and timing loops.

Vectorization & SIMD

Concept	Book & Chapter	Why This Matters
SIMD fundamentals	“Inside the Machine” Ch. 4	SIMD width and uOp throughput basics.
Cache-aware loops	“Computer Systems” Ch. 6	Cache-aware data layout and bandwidth limits.

Primary Vendor Docs & Standards (Official)

Concept	Doc	Why This Matters
Front-end / decode / uOps	Intel 64 and IA-32 Optimization Reference Manual (2024, v50)	Official rules for decoding, fusion, and front-end throughput limits. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-and-ia-32-architectures-optimization-reference-manual.html
Architectural state, exceptions, memory ordering	Intel® 64 and IA-32 Architectures Software Developer’s Manual (2025)	Canonical ISA behavior, memory model, and low-level system details. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Top-down bottleneck analysis	Intel Top-Down Microarchitecture Analysis Method (TMAM)	Defines “pipeline slots” and a systematic bottleneck taxonomy. https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2024-1/top-down-microarchitecture-analysis-method.html
Performance counters	Intel perfmon events	Up-to-date PMU event lists used by `perf`/VTune. https://perfmon-events.intel.com/
Linux perf system calls	`perf_event_open(2)`	The kernel API for configuring PMU events. https://man7.org/linux/man-pages/man2/perf_event_open.2.html
`perf` permissions	Linux perf security docs	Explains `perf_event_paranoid` and user access. https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
Speculative execution attacks	Spectre paper (2018)	Original threat model and variants. https://arxiv.org/abs/1801.01203
Heterogeneous cores (ARM)	ARM big.LITTLE	Official description of heterogeneous core design. https://www.arm.com/technologies/big-little
Heterogeneous scheduling (Intel)	Intel Thread Director	Hardware guidance for OS scheduling on hybrid CPUs. https://www.intel.com/content/www/us/en/support/articles/000097634/processors.html

Quick Start: Your First 48 Hours

Day 1 (4 hours):

Read Chapter 1 and Chapter 2 in the primer.
Skim Project 1 and Project 2 to understand outcomes.
Build Project 1 quickly (even if minimal) and generate a pipeline trace.

Day 2 (4 hours):

Extend Project 1 with forwarding logic.
Run Project 2 on a single pattern and measure cycles.
Compare your results with the expected misprediction penalties.

End of Weekend: You will understand pipeline hazards and branch prediction basics and can explain why certain loops are slow.

Recommended Learning Paths

Path 1: Systems Engineer (Recommended Start)

Project 1 -> Pipeline intuition
Project 2 -> Predictor intuition
Project 5 -> Port pressure
Project 7 -> ROB window

Path 2: Performance Engineer

Project 4 -> uOp cache
Project 5 -> Port pressure
Project 9 -> Bandwidth
Final JIT -> Adaptive codegen

Path 3: Security Focus

Project 3 -> Spectre-lite
Project 2 -> Predictor behavior
Project 6 -> Disambiguation

Path 4: Completionist

Projects 1-10 in order, then Final JIT

Success Metrics

You can predict IPC changes when code is aligned vs unaligned.
You can explain a misprediction penalty in cycles on your CPU.
You can measure the ROB window within 10% of reported values.
You can increase throughput by tuning for port pressure and bandwidth.
You can explain and mitigate a Spectre-style side channel.
You can separate front-end vs back-end bottlenecks using top-down analysis.
You can explain why a given loop is compute-bound vs memory-bound.
You can choose between scalar/AVX/AVX-512 (or NEON/SVE) based on bottlenecks.
You can build microbenchmarks whose results are reproducible across runs.

Appendix A: Measurement & Benchmarking Toolkit

Minimal toolchain:

perf stat -e cycles,instructions,branches,branch-misses
rdtsc / rdtscp timing loops
taskset or sched_setaffinity for core pinning

Recommended extras:

perf stat -e topdown.* (Intel Top-Down Microarchitecture Analysis Method)
perf stat -e mem_load_retired.* (cache-level hit/miss breakdown)
perf record + perf report for hot-path attribution
llvm-mca for static uOp/port estimates (sanity check vs measurements)

Counter access & permissions:

On Linux, perf_event_open(2) is the kernel API behind perf.
https://man7.org/linux/man-pages/man2/perf_event_open.2.html
If you see “permission denied,” check perf_event_paranoid.
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

Top-Down quick recipe (TMAM-style):

Measure total slots and “bad speculation” vs “frontend/back-end bound.”
If front-end bound, check uOp cache hit rate and I-cache misses.
If back-end bound, split into core-bound vs memory-bound.
If memory-bound, identify L1/L2/L3/DRAM bottlenecks and prefetcher behavior.

Best practices:

Warm up caches and predictors before timing.
Pin to a single core; disable turbo if you want stable comparisons.
Use large iteration counts and report median of 10 runs.
Record CPU model, microcode, OS, compiler flags, and kernel version.

Project Overview Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Human Pipeline Trace	Beginner	Weekend	Medium	3/5
2. Branch Torture Test	Intermediate	1 Week	High	4/5
3. Spectre-lite	Expert	2 Weeks	Extremely High	5/5
4. uOp Cache Prober	Advanced	1 Week	High	4/5
5. Port Pressure Map	Advanced	1 Week	High	4/5
6. Memory Disambiguation	Expert	2 Weeks	High	3/5
7. ROB Boundary Finder	Expert	1 Week	High	3/5
8. Macro-op Fusion	Intermediate	3 Days	Medium	4/5
9. L1 Bandwidth Stress	Advanced	1 Week	High	3/5
10. Lunar Lake Profiler	Advanced	1 Week	High	5/5
Final: uArch-Aware JIT	Expert	1 Month	Extreme	5/5

Project List

Project 1: The Human Pipeline Trace

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Assembly, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Pipelining / Hazards
Software or Tool: objdump, gdb
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A CLI tool that parses a sequence of assembly instructions and generates a cycle-by-cycle timeline of how they move through a 5-stage pipeline (IF/ID/EX/MEM/WB), highlighting stalls, bubbles, and forwarding events.

Why it teaches CPU internals: It turns the CPU into a visual system. You stop thinking in terms of “one instruction per line” and start seeing a factory line where data hazards and structural hazards are real physical delays.

Core challenges you’ll face:

Parsing dependencies -> Determine which registers each instruction reads/writes.
Modeling forwarding paths -> Allow bypass from EX/MEM to dependent EX stages.
Stall injection -> Insert bubbles when hazards cannot be resolved.

Real World Outcome

You will have a working micro-simulator that exposes the heartbeat of a pipeline. You will see exactly why a dependency creates a stall and how forwarding repairs it.

What you will see:

A pipeline timeline with cycles and stage occupancy.
A hazard report (RAW/WAR/WAW/control hazards).
Summary statistics (total cycles, IPC, stall breakdown).

Command Line Outcome Example:

$ gcc -S -O0 my_loop.c -o my_loop.s
$ ./pipe_trace my_loop.s --forwarding=on

[Pipeline Trace]
Cycle | IF | ID | EX | MEM | WB | Notes
------+----+----+----+-----+----+------------------------------
  1   | I1 |    |    |     |    | Fetch ADD
  2   | I2 | I1 |    |     |    | Decode ADD
  3   | I3 | I2 | I1 |     |    | EX start for I1
  4   | I4 | I3 | I2 | I1  |    | Forward R1 from I1 to I2
  5   | I5 | I4 | I3 | I2  | I1 | No stall

[Hazards]
- RAW: I2 depends on I1 (R1) -> Resolved via forwarding
- Control: none

[Stats]
Total cycles: 8
Instructions: 3
IPC: 0.375
Stalls: 0

The Core Question You’re Answering

“Why does adding a single instruction sometimes make my loop twice as slow?”

This project reveals that instructions are not abstract; they are physical data movements. Hazards and stage conflicts create stalls that you can see and count.

Concepts You Must Understand First

Stop and research these before coding:

5-stage pipeline model
- What happens in IF/ID/EX/MEM/WB?
- When is a register value actually written?
- Book Reference: “Computer Systems” Ch. 4
Data hazards (RAW/WAR/WAW)
- Which hazards occur in an in-order pipeline?
- Why is RAW the most important in a 5-stage pipeline?
- Book Reference: “Computer Organization and Design” Ch. 4
Forwarding/bypassing
- What paths can forward a result from EX/MEM to EX?
- When does forwarding still fail?
- Book Reference: “Digital Design and Computer Architecture” Ch. 7

Questions to Guide Your Design

Pipeline representation
- Will you model by cycle or by instruction?
- How do you represent bubbles and stalls?
Dependency tracking
- How do you parse registers and track availability?
- How do you handle memory operands vs register operands?
Forwarding logic
- Which stage produces results? Which stage consumes them?
- How do you decide if a dependency is resolved by forwarding?

Thinking Exercise

The Two-Instruction Trap

Trace by hand:

ADD R1, R2, R3
SUB R4, R1, R5

Questions:

In which cycle is R1 written?
In which cycle is R1 read?
How many bubbles if forwarding is OFF?
How many if forwarding is ON?

The Interview Questions They’ll Ask

“What is the difference between throughput and latency in a pipeline?”
“Why does RAW hazard cause a stall but WAR does not in in-order?”
“Explain forwarding and when it fails.”
“How does branch prediction reduce control hazards?”
“What is a pipeline bubble and why does it exist?”

Hints in Layers

Hint 1: Data structures Use a struct with src/dst regs and a current stage index.

typedef struct {
  int src1, src2, dst;
  int stage; // 0=IF,1=ID,2=EX,3=MEM,4=WB
} Instr;

Hint 2: Move backward Update pipeline stages from WB -> IF so instructions do not move twice.

Hint 3: Stall decision point Stalls happen in ID. If ID needs a register that EX/MEM will write, freeze IF/ID.

Hint 4: Verification Add a debug flag to dump register readiness each cycle.

Books That Will Help

Topic	Book	Chapter
Pipeline basics	“Computer Systems”	Ch. 4
Hazards	“Computer Organization and Design”	Ch. 4
Forwarding	“Digital Design and Computer Architecture”	Ch. 7

Common Pitfalls & Debugging

Problem 1: “Instructions jump multiple stages in one cycle”

Why: You updated IF->ID before ID->EX.
Fix: Update from WB backward.
Quick test: Add a unit test with 1 instruction, confirm 5 cycles to retire.

Problem 2: “Forwarding never triggers”

Why: You compare architectural registers but did not normalize register names.
Fix: Map register names to numeric IDs.
Quick test: Use a test with ADD then SUB using the same reg.

Problem 3: “Stalls never happen”

Why: Your hazard detection ignores EX/MEM stages.
Fix: Check dependencies against all in-flight instructions.
Quick test: Use back-to-back dependent ops.

Definition of Done

Pipeline timeline matches hand-traced examples
RAW hazards create stalls when forwarding is OFF
Forwarding resolves expected hazards
IPC and total cycles are computed correctly
At least 3 test programs produce expected traces

Project 2: The Branch Predictor Torture Test

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: C, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Branch Prediction / TAGE
Software or Tool: perf, RDTSC
Main Book: “Computer Architecture” by Hennessy & Patterson

What you’ll build: A suite of microbenchmarks that execute branches with carefully constructed patterns (periodic, nested, and random). You will measure cycles/iteration and infer predictor capacity and misprediction penalty.

Why it teaches CPU internals: It exposes the predictor as a real system with memory and capacity limits rather than a black box. You will see where it succeeds and where it saturates.

Core challenges you’ll face:

Preventing compiler branch elimination
Measuring tiny timing differences
Constructing long-period patterns

Real World Outcome

You will produce a predictor profile report for your CPU, including the approximate history depth where prediction accuracy collapses.

What you will see:

Cycles/iter for each branch pattern.
Estimated misprediction penalty.
Approximate predictor history length.

Command Line Outcome Example:

$ ./branch_torture --patterns all --warmup 10000

[Target: x86-64]
Pattern: T,N                  -> 1.02 cycles/iter (predictor stable)
Pattern: T,T,N                -> 1.04 cycles/iter
Pattern: Period 1024          -> 1.12 cycles/iter
Pattern: Period 16384         -> 12.9 cycles/iter (predictor saturated)
Random 50/50                  -> 18.3 cycles/iter

Estimated mispredict penalty: 16-20 cycles
Estimated history depth: ~8K outcomes

The Core Question You’re Answering

“How much history can my CPU’s branch predictor really learn?”

This tells you whether your branch-heavy code is predictable or doomed to mispredicts.

Concepts You Must Understand First

Two-bit counters and hysteresis
- Why do predictors require hysteresis?
- Book Reference: “Computer Architecture” Ch. 3
Global history and TAGE
- How does long history improve accuracy?
- Book Reference: “Computer Architecture” Ch. 3
Measurement with RDTSC
- Why use RDTSCP instead of clock()?
- Book Reference: “Low-Level Programming” Ch. 8

Questions to Guide Your Design

Pattern generation
- How will you generate long-period patterns without huge memory?
Warming the predictor
- How many iterations to train before measurement?
Baseline subtraction
- How will you measure loop overhead without a branch?

Thinking Exercise

The Pattern Trap

for (int i = 0; i < N; i++) {
    if ((i % 5) != 0) x++;
}

Questions:

What is the branch pattern?
How long is the period?
Would a short-history predictor succeed?

The Interview Questions They’ll Ask

“What is the difference between local and global branch history?”
“Why are mispredictions so costly in deep pipelines?”
“How does a TAGE predictor reduce aliasing?”
“What is the role of the BTB?”
“When would you prefer branchless code?”

Hints in Layers

Hint 1: Keep the branch Use volatile or asm volatile to prevent branch removal.

Hint 2: Pin the core Use sched_setaffinity to avoid core migration noise.

Hint 3: Training phase Run 10000 iterations before timing to stabilize predictor state.

Hint 4: Baseline Measure a version of the loop with no branch to subtract overhead.

Books That Will Help

Topic	Book	Chapter
Branch prediction	“Computer Architecture”	Ch. 3
Performance timing	“Low-Level Programming”	Ch. 8
Pipeline effects	“Inside the Machine”	Ch. 4

Common Pitfalls & Debugging

Problem 1: “Compiler replaced branch with CMOV”

Why: Optimizer removes branches.
Fix: Use inline asm or volatile condition.
Quick test: Inspect assembly with objdump -d.

Problem 2: “Timing noise dominates results”

Why: Scheduler migration or turbo frequency changes.
Fix: Pin to a core and fix frequency.
Quick test: Run multiple trials and check variance.

Problem 3: “Predictor seems perfect for all patterns”

Why: Pattern length too short.
Fix: Increase period to thousands of iterations.
Quick test: Use a random pattern to force mispredicts.

Definition of Done

Can reproduce stable cycles/iter for at least 5 patterns
Misprediction penalty estimated from random pattern
Predictor saturation observed at long history
Results logged with CPU model and OS info

Project 3: Speculative Side-Channel Explorer (Spectre-lite)

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Assembly
Coolness Level: Level 5: Pure Magic
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Security / Speculative Execution
Software or Tool: clflush, lfence, rdtscp
Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you’ll build: A minimal Spectre Variant 1 proof-of-concept that leaks a secret byte through cache timing. You will implement Flush+Reload, branch predictor training, and a controlled out-of-bounds access.

Why it teaches CPU internals: It proves that speculation is not just performance magic; it is a security liability. You will understand how microarchitectural state differs from architectural state.

Core challenges you’ll face:

Predictor training -> Making the CPU expect a valid index.
Timing measurement -> Distinguishing L1 hits from misses.
Noise control -> OS interrupts and other processes pollute caches.

Real World Outcome

You will produce a working Spectre-lite demo and a mitigation comparison showing how LFENCE or masking changes leakage and performance.

What you will see:

A recovered secret string with confidence scores.
Timing histograms for cache hits vs misses.
A mitigation report with performance impact.

Command Line Outcome Example:

$ ./spectre_lite --trials 1000 --fence=off

[Training]
Branch predictor trained with 0..15 indices.

[Attack]
Speculative access on index 40.
Cache hits detected: 83 85 80 69 82 95 83 69 67 82 69 84
Recovered secret: "SUPER_SECRET"

[Mitigation]
LFENCE: disabled -> leakage success 92%
LFENCE: enabled  -> leakage success 5% (performance -18%)

The Core Question You’re Answering

“If my code checks bounds, why can data still leak?”

Because speculation bypasses the bounds check temporarily and leaves cache footprints behind.

Concepts You Must Understand First

Speculative execution and rollback
- What is rolled back and what is not?
- Book Reference: “Computer Architecture” Ch. 3
Flush+Reload side channel
- How does timing reveal cache state?
- Book Reference: “Practical Binary Analysis” Ch. 11
Serialization fences
- How does LFENCE prevent speculation?
- Book Reference: “Low-Level Programming” Ch. 12

Questions to Guide Your Design

Probe array layout
- Why should each probe entry be 4096 bytes apart?
Timing calibration
- How will you choose your hit/miss threshold?
Training ratio
- How many “safe” iterations before one attack iteration?

Thinking Exercise

Speculative Path Walkthrough

if (idx < size) {
    tmp &= probe[secret[idx] * 4096];
}

Questions:

What happens if idx is out of bounds but predicted in-bounds?
Why does the cache line remain even after rollback?

The Interview Questions They’ll Ask

“What is the difference between Spectre and Meltdown?”
“Why does cache timing reveal secret data?”
“What is the role of branch predictor training?”
“How does LFENCE mitigate Spectre V1?”
“What are the performance costs of mitigations?”

Hints in Layers

Hint 1: Cache flush Flush the entire probe array before the attack.

Hint 2: Training loop Run 30 iterations: 29 safe, 1 malicious.

Hint 3: Timing Use rdtscp and compare against a calibrated threshold.

Hint 4: Mitigation Insert lfence after bounds check and compare leakage rate.

Books That Will Help

Topic	Book	Chapter
Side channels	“Practical Binary Analysis”	Ch. 11
Cache hierarchy	“Computer Systems”	Ch. 6
Speculation	“Computer Architecture”	Ch. 3

Common Pitfalls & Debugging

Problem 1: “No leakage observed”

Why: Threshold too high or predictor not trained.
Fix: Calibrate hit/miss timing and increase training iterations.
Quick test: Measure known L1 hits to set a threshold.

Problem 2: “Leakage is random”

Why: Noise from scheduler or frequency scaling.
Fix: Pin core and disable turbo if possible.
Quick test: Run with taskset -c 0 and compare variance.

Problem 3: “Compiler optimizes away code”

Why: Missing volatile or memory barriers.
Fix: Use volatile for critical arrays and inline asm.
Quick test: Inspect assembly for speculative load.

Definition of Done

Can recover at least 8 bytes of secret reliably
Cache hit/miss threshold is calibrated
Leakage rate drops when LFENCE is enabled
Report includes CPU model and mitigation status

Project 4: The uOp Cache Prober

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: Assembly (x86-64)
Alternative Programming Languages: C (with __asm__)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Front-end / uOp Cache
Software or Tool: perf, rdtsc
Main Book: “Inside the Machine” by Jon Stokes

What you’ll build: A microbenchmark that varies loop body size and detects the uOp cache capacity by observing IPC transitions from uOp cache hits to legacy decode.

Why it teaches CPU internals: It shows where front-end bandwidth collapses and how alignment and loop size affect performance.

Core challenges you’ll face:

Loop sizing -> Expand instruction count without changing semantics.
Alignment -> Align loops to cache line boundaries.
Counter interpretation -> Distinguish decode vs uOp cache behavior.

Real World Outcome

You will produce a front-end profile that shows the precise loop size boundary where uOp cache hits stop and IPC drops.

What you will see:

IPC vs loop size chart.
A transition point where decode becomes the bottleneck.
A recommended maximum loop size for hot code.

Command Line Outcome Example:

$ ./uop_prober --step 16 --max 8192

Loop uOps | IPC | Path
----------|-----|------
64        | 6.1 | uOp cache
512       | 6.0 | uOp cache
2048      | 5.7 | uOp cache
4096      | 4.9 | mixed
8192      | 2.3 | legacy decode

Estimated uOp cache capacity: ~4K uOps

The Core Question You’re Answering

“How big can a hot loop be before decode becomes the bottleneck?”

This tells you how to size inner loops and why alignment is a first-order optimization.

Concepts You Must Understand First

uOp cache vs legacy decode
- Why is decoded uOp reuse faster?
- Book Reference: “Inside the Machine” Ch. 4
Instruction length variability
- How do long x86 instructions reduce fetch efficiency?
- Book Reference: “Computer Systems” Ch. 3
Alignment
- Why do 64-byte boundaries matter?
- Book Reference: “Write Great Code, Vol 1” Ch. 12

Questions to Guide Your Design

Loop body construction
- How will you ensure instructions do not add dependencies?
Measurement
- Will you measure using perf counters or RDTSC?
Boundary detection
- How will you detect the exact transition point?

Thinking Exercise

The Alignment Puzzle

If a loop starts at byte 63 of a cache line, how many fetch blocks does it span? How might this affect decode and uOp cache indexing?

The Interview Questions They’ll Ask

“What is a uOp cache and why does it exist?”
“Why does decode width matter more than execution width in some cases?”
“How does alignment influence front-end performance?”
“Explain macro-op fusion in the context of decode throughput.”
“Why are x86 instructions hard to decode?”

Hints in Layers

Hint 1: Use single-uOp instructions Use add rax, 1 or xor rax, rax to avoid port pressure.

Hint 2: Alignment directive Use .p2align 6 to align to 64 bytes.

Hint 3: Step size Increase loop size in fixed increments (e.g., 16 uOps).

Hint 4: Counters Use perf stat -e idq.mite_uops,idq.dsb_uops (Intel) to confirm path.

Books That Will Help

Topic	Book	Chapter
Front-end decode	“Inside the Machine”	Ch. 4
Instruction length	“Computer Systems”	Ch. 3
Performance tuning	“Write Great Code, Vol 1”	Ch. 12

Common Pitfalls & Debugging

Problem 1: “IPC never drops”

Why: Loop body too small or uOp cache too large for your range.
Fix: Increase max uOp count and use longer sequences.
Quick test: Double loop size range.

Problem 2: “IPC is low for all sizes”

Why: Port pressure or dependencies dominate.
Fix: Use independent registers and simple instructions.
Quick test: Replace body with NOP-like ops.

Problem 3: “Results are inconsistent”

Why: Turbo or frequency scaling.
Fix: Fix CPU frequency and pin core.
Quick test: Repeat runs and compare variance.

Definition of Done

Clear IPC drop observed at a loop size boundary
Alignment effects tested (aligned vs unaligned)
uOp cache vs legacy decode path inferred
Results logged with CPU model and compiler flags

Project 5: Execution Port Pressure Map

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C++ (with inline assembly)
Alternative Programming Languages: Rust, Assembly
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Execution Units / Throughput
Software or Tool: perf, pmu-tools (toplev)
Main Book: “Inside the Machine” by Jon Stokes

What you’ll build: A tool that runs controlled instruction mixes and measures throughput to infer execution port binding and port contention.

Why it teaches CPU internals: It reveals the physical wiring of your CPU’s execution engine and explains why some instruction mixes are faster than others.

Core challenges you’ll face:

Instruction independence -> Avoid dependency chains.
Port inference -> Distinguish port contention from front-end bottlenecks.
Counter mapping -> Use PMU counters to validate hypotheses.

Real World Outcome

You will create a port pressure report that tells you which instructions share ports and which can run in parallel.

What you will see:

Throughput measurements for instruction mixes.
A inferred port map for your CPU.
Suggested scheduling strategies.

Command Line Outcome Example:

$ ./port_mapper --mix add,mul --iters 10000000

Mix: ADD+ADD -> 4.0 uOps/c
Mix: MUL+MUL -> 1.0 uOps/c
Mix: ADD+MUL -> 4.0 uOps/c
Mix: FMA+FMA -> 2.0 uOps/c

Inferred: MUL limited to single port, ADD on multiple ports

The Core Question You’re Answering

“Why is my CPU only half utilized even when it is busy?”

Because you can be bound by a single port even if others are idle.

Concepts You Must Understand First

Latency vs throughput
- Why can an instruction be slow yet still pipeline well?
- Book Reference: “Write Great Code, Vol 1” Ch. 12
Port binding
- What does it mean when an instruction can issue on multiple ports?
- Book Reference: “Inside the Machine” Ch. 4
Instruction independence
- Why do dependencies destroy throughput measurement?
- Book Reference: “Computer Architecture” Ch. 3

Questions to Guide Your Design

Instruction selection
- Which instruction types will you test (int, FP, SIMD, loads)?
Dependency elimination
- How will you ensure each instruction uses a different register?
Validation
- How will you check that front-end is not the bottleneck?

Thinking Exercise

The Port Conflict

If an instruction can execute only on Port 0, and you issue two of them per cycle, what is the maximum throughput? Why?

The Interview Questions They’ll Ask

“What is port pressure and how do you detect it?”
“Why is throughput more important than latency in steady state?”
“How do you ensure you are not measuring dependency chains?”
“What is the difference between a port and a functional unit?”
“How does unrolling help saturate execution ports?”

Hints in Layers

Hint 1: Independent registers Use many registers to avoid RAW dependencies.

Hint 2: Isolate front-end Use a loop small enough to fit in the uOp cache.

Hint 3: Baseline Measure single-instruction throughput before mixing.

Hint 4: PMU validation Use perf stat -e uops_executed.port_0 or similar counters where available.

Books That Will Help

Topic	Book	Chapter
Execution units	“Inside the Machine”	Ch. 4
Performance analysis	“Write Great Code, Vol 1”	Ch. 12
OoO scheduling	“Computer Architecture”	Ch. 3

Common Pitfalls & Debugging

Problem 1: “Throughput matches latency”

Why: Dependencies serialize instructions.
Fix: Use independent registers and unroll.
Quick test: Increase number of accumulators.

Problem 2: “Results differ from static tools”

Why: Front-end or memory effects.
Fix: Ensure loop fits in uOp cache, pin core.
Quick test: Compare with llvm-mca predictions.

Problem 3: “Counter not supported”

Why: PMU event names differ by vendor.
Fix: Use generic counters and vendor docs.
Quick test: perf list | grep uops.

Definition of Done

At least 6 instruction mixes measured
Port binding inferred for int and FP operations
Front-end and memory effects ruled out
Report includes methodology and CPU model

Project 6: Memory Disambiguation Probe

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Assembly
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Load/Store Buffers / Speculation
Software or Tool: rdtsc
Main Book: “Computer Architecture” by Hennessy & Patterson

What you’ll build: A microbenchmark that probes memory disambiguation by forcing address aliasing and measuring load stalls and replay costs.

Why it teaches CPU internals: It reveals how the CPU guesses whether loads can bypass stores and how wrong guesses cost cycles.

Core challenges you’ll face:

Creating aliasing pairs
Timing small stall penalties
Avoiding prefetcher interference

Real World Outcome

You will build a disambiguation heatmap that highlights offsets that trigger 4K aliasing stalls and store-forwarding penalties.

What you will see:

Latency spikes at aliasing offsets.
Forwarding success/failure detection.
A rule-of-thumb for safe memory offsets.

Command Line Outcome Example:

$ ./alias_probe --offsets 0,64,4096,8192

Offset | Latency (cycles) | Status
-------|------------------|--------
0      | 3                | STLF hit
64     | 5                | L1 hit
4096   | 22               | 4K alias stall
8192   | 21               | 4K alias stall

The Core Question You’re Answering

“Why can two independent arrays still block each other?”

Because the CPU uses partial address checks and can stall on false dependencies.

Concepts You Must Understand First

Load/store buffers
- Why must loads check older stores?
- Book Reference: “Computer Architecture” Ch. 3
Store-to-load forwarding
- When does forwarding succeed or fail?
- Book Reference: “Computer Systems” Ch. 6
4K aliasing
- Why do matching page offsets cause stalls?
- Book Reference: “Modern Processor Design” (memory speculation)

Questions to Guide Your Design

Address selection
- How will you ensure the low 12 bits match?
Measurement noise
- How many iterations are needed for stable timing?
Prefetch control
- How will you prevent prefetchers from hiding stalls?

Thinking Exercise

The Aliasing Trap

If p1 and p2 are 4096 bytes apart, why does the CPU suspect aliasing? How could this affect a pointer-chasing loop?

The Interview Questions They’ll Ask

“What is memory disambiguation and why is it needed?”
“Explain 4K aliasing in x86.”
“What is store-to-load forwarding?”
“How does a memory dependency violation get fixed in hardware?”
“Why can misalignment prevent forwarding?”

Hints in Layers

Hint 1: Address mask Ensure (uintptr_t)p1 & 0xFFF == (uintptr_t)p2 & 0xFFF.

Hint 2: Slow store Use a dependency chain to delay store address generation.

Hint 3: Flush Use clflush to remove cache effects and expose stalls.

Hint 4: Plot results Graph offset vs cycles to visualize aliasing spikes.

Books That Will Help

Topic	Book	Chapter
Memory speculation	“Computer Architecture”	Ch. 3
Caches	“Computer Systems”	Ch. 6
CPU pipelines	“Inside the Machine”	Ch. 4

Common Pitfalls & Debugging

Problem 1: “No aliasing spikes”

Why: Offsets not aligned to 4K boundaries.
Fix: Enforce identical low 12 bits.
Quick test: Print addresses and offsets.

Problem 2: “Results are noisy”

Why: OS scheduling and turbo.
Fix: Pin core, run longer tests.
Quick test: Compare median of 1000 runs.

Problem 3: “Prefetch hides stalls”

Why: Sequential access patterns trigger prefetch.
Fix: Randomize access pattern.
Quick test: Compare sequential vs random.

Definition of Done

Clear latency spike at 4K alias offsets
STLF success/failure measured with alignment changes
Report includes methodology and CPU model
Graph of offset vs latency produced

Project 7: The Reorder Buffer (ROB) Boundary Finder

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C (with inline assembly)
Alternative Programming Languages: Assembly
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: OoO Engine / ROB
Software or Tool: rdtsc, perf
Main Book: “Computer Architecture” by Hennessy & Patterson

What you’ll build: A benchmark that identifies the maximum number of independent uOps your CPU can keep in flight before stalling on a long-latency memory load.

Why it teaches CPU internals: It exposes the true size of the instruction window and shows how OoO execution hides latency until the ROB fills.

Core challenges you’ll face:

Creating long-latency loads
Generating independent filler uOps
Detecting the stall cliff

Real World Outcome

You will determine the approximate ROB capacity of your CPU and the window size required to hide a cache miss.

What you will see:

A flat latency curve that suddenly spikes when the ROB is full.
An estimated ROB size in uOps.
A recommended unroll factor for latency hiding.

Command Line Outcome Example:

$ ./rob_prober --max-window 1024

Window uOps | Cycles | Status
------------|--------|-----------------
128         | 305    | L3 miss hidden
256         | 308    | L3 miss hidden
512         | 315    | L3 miss hidden
640         | 620    | ROB full, stall

Estimated ROB size: ~576 uOps

The Core Question You’re Answering

“How far into the future can my CPU look while waiting for memory?”

This tells you how much independent work you must provide to hide memory latency.

Concepts You Must Understand First

ROB and retirement
- Why must retirement be in order?
- Book Reference: “Computer Architecture” Ch. 3
Register renaming
- How does rename enable large instruction windows?
- Book Reference: “Computer Organization and Design” Ch. 3
Memory latency
- Why does a cache miss cost hundreds of cycles?
- Book Reference: “Computer Systems” Ch. 6

Questions to Guide Your Design

Latency source
- How will you generate consistent long-latency loads?
Filler uOps
- How will you guarantee independence among filler instructions?
Measurement
- How will you detect the point where the ROB fills?

Thinking Exercise

The Full Window

If your load miss takes 300 cycles and your CPU can issue 4 uOps/cycle, how many independent uOps would you need to hide the miss completely?

The Interview Questions They’ll Ask

“What is the ROB and why does it exist?”
“What is the difference between issue and retire?”
“Why can’t we make the ROB infinite?”
“What limits the instruction window besides ROB size?”
“How do cache misses interact with OoO execution?”

Hints in Layers

Hint 1: Pointer chasing Use a linked list that exceeds LLC size to force DRAM misses.

Hint 2: Independent ALU work Use many registers with add operations between loads.

Hint 3: Unroll gradually Increase independent uOps in steps and plot cycles.

Hint 4: Verify Cross-check with perf counters like rob_full if available.

Books That Will Help

Topic	Book	Chapter
OoO execution	“Computer Architecture”	Ch. 3
Register renaming	“Computer Organization and Design”	Ch. 3
Caches	“Computer Systems”	Ch. 6

Common Pitfalls & Debugging

Problem 1: “No clear spike”

Why: Cache miss not consistent.
Fix: Use pointer chasing with large strides.
Quick test: Measure cache miss rate with perf.

Problem 2: “Spike too early”

Why: Physical register file or scheduler limit.
Fix: Reduce number of live registers or change pattern.
Quick test: Compare with different instruction mixes.

Problem 3: “Results unstable”

Why: Prefetchers and turbo.
Fix: Randomize list order and pin core.
Quick test: Run multiple trials and take median.

Definition of Done

Clear stall cliff observed as window size increases
Approximate ROB size estimated
Cache miss latency measured independently
Results include CPU model and OS version

Project 8: Macro-op Fusion Detector

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: Assembly
Alternative Programming Languages: C
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Decoding / Instruction Fusion
Software or Tool: perf (uops_retired.slots), rdtsc
Main Book: “Inside the Machine” by Jon Stokes

What you’ll build: A benchmark that tests which instruction pairs fuse into single uOps on your CPU by comparing uOps retired vs instructions retired.

Why it teaches CPU internals: It exposes the real decode rules that compilers and assembly programmers exploit.

Core challenges you’ll face:

Identifying fusion pairs
Isolating uOp cache effects
Counting uOps accurately

Real World Outcome

You will build a fusion catalog for your CPU listing which instruction pairs are fused and which are not.

What you will see:

A list of fusable pairs (CMP+Jcc, TEST+Jcc, etc.).
A measurable uOp count reduction.
A recommended coding style for branch-heavy code.

Command Line Outcome Example:

$ ./fusion_detect --pairs all

Pair: CMP rax,0 + JE -> uOps: 1 (fused)
Pair: TEST rbx,rbx + JZ -> uOps: 1 (fused)
Pair: ADD rax,1 + JZ -> uOps: 1 (fused)
Pair: MOV rax,[rbx] + JZ -> uOps: 2 (not fused)

The Core Question You’re Answering

“Does instruction ordering affect decode bandwidth?”

Yes. Fusion can cut uOp count in half, effectively widening the front-end.

Concepts You Must Understand First

Macro-op vs micro-op fusion
- Which stage performs fusion?
- Book Reference: “Inside the Machine” Ch. 4
uOps retired counters
- Why do uOps retired reveal fusion?
- Book Reference: “Computer Systems” Ch. 4
Alignment effects
- Why can alignment break fusion?
- Book Reference: “Write Great Code, Vol 1” Ch. 12

Questions to Guide Your Design

Pair selection
- Which common instruction pairs should you test?
Loop overhead
- How will you subtract loop control uOps?
Boundary testing
- How will you test 16-byte or 32-byte fetch boundaries?

Thinking Exercise

Fusion Boundary

If you insert a NOP between CMP and JE, what do you expect to happen to uOp count and why?

The Interview Questions They’ll Ask

“What is macro-op fusion and why is it useful?”
“Does fusion happen in the uOp cache or the decoder?”
“Name three common fused pairs in x86.”
“How does fusion affect ROB usage?”
“Why would fusion fail at fetch boundaries?”

Hints in Layers

Hint 1: Tight loop Execute billions of pairs in a tight loop to magnify differences.

Hint 2: Baseline Measure a loop of NOPs to subtract loop overhead.

Hint 3: Counter choice Compare uops_retired vs inst_retired.

Hint 4: Alignment Use directives to force pairs across boundaries and compare.

Books That Will Help

Topic	Book	Chapter
Decode rules	“Inside the Machine”	Ch. 4
Pipeline basics	“Computer Systems”	Ch. 4
Performance tuning	“Write Great Code, Vol 1”	Ch. 12

Common Pitfalls & Debugging

Problem 1: “All pairs appear fused”

Why: You are measuring uOp cache effects or using wrong counters.
Fix: Confirm perf counters and use legacy decode path.
Quick test: Force a large loop to evict uOp cache.

Problem 2: “No pairs fused”

Why: Pairs are not adjacent or alignment broke fusion.
Fix: Ensure adjacency and test with aligned loop.
Quick test: Disassemble loop to verify layout.

Problem 3: “Results change between runs”

Why: Turbo or background activity.
Fix: Pin core and repeat with fixed frequency.
Quick test: Compare median of 10 runs.

Definition of Done

At least 5 fused pairs identified
uOp count reduction measured
Alignment impact tested
Results recorded with CPU model

Project 9: L1 Bandwidth Stressor (Zen 5 focus)

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++ with intrinsics
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Cache Bandwidth / SIMD
Software or Tool: perf, intrinsics (immintrin.h)
Main Book: “Computer Systems” by Bryant & O’Hallaron

What you’ll build: A bandwidth microbenchmark that uses vector loads/stores to saturate L1 cache bandwidth and compare measured throughput to theoretical limits.

Why it teaches CPU internals: It connects vector width, load/store ports, and cache bandwidth into a single measurable limit.

Core challenges you’ll face:

Alignment -> Avoid split loads/stores.
Vector width -> Choose AVX2 vs AVX-512 or NEON.
Unrolling -> Hide loop overhead and saturate ports.

Real World Outcome

You will produce a bandwidth report showing GB/s for different vector widths and unroll factors.

What you will see:

Achieved GB/s for AVX2 and AVX-512 (or NEON on ARM).
Evidence of bandwidth saturation or port limits.
Efficiency vs theoretical peak.

Command Line Outcome Example:

$ ./l1_stress --width 512 --unroll 8

Mode: AVX-512
Loads/cycle: 2
Measured: 240 GB/s
Peak (theoretical): 256 GB/s
Efficiency: 93.7%

The Core Question You’re Answering

“Can my code actually feed the vector units at full speed?”

This tells you if you are bandwidth-bound or compute-bound.

Concepts You Must Understand First

Cache lines and alignment
- Why does unaligned access reduce bandwidth?
- Book Reference: “Computer Systems” Ch. 6
SIMD width vs bandwidth
- Why does wider SIMD require more bytes per cycle?
- Book Reference: “Inside the Machine” Ch. 4
Load/store ports
- How many loads/stores per cycle can the core issue?
- Book Reference: “Computer Architecture” Ch. 3

Questions to Guide Your Design

Data layout
- How will you align buffers to 64 bytes?
Instruction selection
- Which intrinsics map to aligned loads/stores?
Measurement
- Will you compute bytes/cycle or GB/s directly?

Thinking Exercise

The Pipe Diameter

If your core can issue 2 x 64-byte loads per cycle, what is the maximum theoretical bandwidth at 4 GHz?

The Interview Questions They’ll Ask

“What is the difference between L1 bandwidth and L1 latency?”
“Why does SIMD code sometimes run slower than scalar?”
“How does alignment affect cache behavior?”
“What is a split load and why is it expensive?”
“How do you measure memory bandwidth correctly?”

Hints in Layers

Hint 1: Aligned allocation Use aligned_alloc(64, size) or posix_memalign.

Hint 2: Intrinsics Use _mm512_load_ps or _mm256_load_ps with aligned pointers.

Hint 3: Unroll Unroll loops 8x or more to remove branch overhead.

Hint 4: Verify Check with perf stat -e cycles,instructions,mem_load_retired.l1_hit.

Books That Will Help

Topic	Book	Chapter
Caches	“Computer Systems”	Ch. 6
SIMD	“Inside the Machine”	Ch. 4
Performance	“Write Great Code, Vol 1”	Ch. 12

Common Pitfalls & Debugging

Problem 1: “Bandwidth is far below expected”

Why: Loop not unrolled or misaligned data.
Fix: Align buffers and unroll aggressively.
Quick test: Measure aligned vs unaligned.

Problem 2: “AVX-512 slower than AVX2”

Why: Bandwidth-bound or downclocking on some CPUs.
Fix: Compare with scalar and check frequency scaling.
Quick test: Record cycles/iter with fixed frequency.

Problem 3: “Results vary wildly”

Why: Prefetchers or OS noise.
Fix: Pin core and repeat multiple times.
Quick test: Compare median of 10 runs.

Definition of Done

L1 bandwidth measured for at least two vector widths
Alignment impact tested
Theoretical peak vs measured efficiency reported
Results include CPU model and clock rate

Project 10: Lunar Lake P vs E Core Profiler

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: C, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Heterogeneous Cores / Scheduling
Software or Tool: sched_setaffinity, perf, lscpu
Main Book: “Operating Systems: Three Easy Pieces”

What you’ll build: A profiler that runs the same workloads on P-cores and E-cores and compares IPC, latency, bandwidth, and energy per instruction.

Why it teaches CPU internals: It demonstrates that core type matters and that scheduling decisions are a first-order performance factor.

Core challenges you’ll face:

Detecting core types
Pinning threads
Normalizing results for frequency

Real World Outcome

You will generate a comparative report of P-core vs E-core behavior on your system.

What you will see:

IPC differences for compute-heavy loops.
Latency differences for memory-heavy loops.
Energy-per-instruction estimates (if RAPL available).

Command Line Outcome Example:

$ ./hybrid_profile --workloads add,mem,branch

Core Type | IPC | L1 BW (GB/s) | Branch MPKI
----------|-----|--------------|-------------
P-core    | 5.8 | 210          | 2.1
E-core    | 3.9 | 140          | 3.4

Summary: P-cores deliver ~1.5x IPC but higher power.

The Core Question You’re Answering

“If my CPU has many cores, why do some of them feel slower?”

Because heterogeneous cores have different widths, caches, and scheduling priorities.

Concepts You Must Understand First

Heterogeneous scheduling
- How does the OS decide which core to use?
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 5
Frequency normalization
- Why must you normalize IPC by frequency?
- Book Reference: “Write Great Code, Vol 1” Ch. 12
Thread pinning
- How does core pinning reduce noise?
- Book Reference: “How Linux Works” Ch. 4

Questions to Guide Your Design

Core detection
- How will you identify P vs E cores on Linux/macOS?
Workload mix
- Which workloads best show front-end vs back-end differences?
Energy measurement
- Can you read RAPL or other energy counters?

Thinking Exercise

The Scheduling Choice

If a workload is latency-sensitive (UI thread), which core should it run on and why?

The Interview Questions They’ll Ask

“What is the purpose of heterogeneous cores?”
“How does Thread Director influence scheduling?”
“Why is IPC alone not enough to compare cores?”
“What is the effect of cache size differences across core types?”
“How do you pin threads to a specific core in Linux?”

Hints in Layers

Hint 1: Core IDs Use lscpu -e to list cores and their types (on Linux where supported).

Hint 2: Pinning Use sched_setaffinity or taskset.

Hint 3: Normalize Divide IPC by frequency or compare cycles per operation.

Hint 4: Multiple workloads Use integer, memory, and branch-heavy microbenchmarks.

Books That Will Help

Topic	Book	Chapter
Scheduling	“Operating Systems: Three Easy Pieces”	Ch. 5
Performance	“Write Great Code, Vol 1”	Ch. 12
Linux tools	“How Linux Works”	Ch. 4

Common Pitfalls & Debugging

Problem 1: “P vs E results overlap”

Why: Workload is too small or dominated by memory.
Fix: Use compute-heavy loops and pin cores strictly.
Quick test: Increase loop iterations.

Problem 2: “Core types not detectable”

Why: OS does not expose core type info.
Fix: Use manual core mapping or vendor tools.
Quick test: Compare IPC differences to infer core type.

Problem 3: “Frequency scaling hides differences”

Why: CPU boosts or throttles based on load.
Fix: Fix frequency and compare cycles per operation.
Quick test: Use performance governor on Linux.

Definition of Done

Separate P vs E core benchmarks recorded
Results normalized for frequency
IPC and branch metrics compared
Report includes core mapping method

Final Overall Project: The uArch-Aware JIT Engine

File: MODERN_CPU_INTERNALS_2025_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 5: Expert
Knowledge Area: JIT / Microarchitecture-Aware Codegen
Software or Tool: mmap, mprotect, rdtsc
Main Book: “Writing a C Compiler” by Nora Sandler

What you’ll build: A minimal JIT compiler for a tiny math language that generates different machine code depending on the detected CPU (vector width, core type, front-end capacity). The JIT will benchmark itself and choose the fastest code path at runtime.

Why it teaches CPU internals: It forces you to combine everything: front-end alignment, port pressure, SIMD width, and branch prediction. You become the compiler.

Core challenges you’ll face:

Runtime code generation -> Emitting safe executable memory.
CPU feature detection -> Dispatching based on CPUID or OS APIs.
Self-benchmarking -> Choosing code paths based on measured performance.

Real World Outcome

You will have a working JIT that outperforms a naive interpreter and adapts to CPU differences.

What you will see:

Generated machine code bytes for each target profile.
Performance comparison across code paths.
A runtime-selected fastest path.

Command Line Outcome Example:

$ ./uarch_jit --run expr.txt

[Detect]
CPU: x86-64, AVX-512 available, P-core

[Codegen]
Path A: scalar
Path B: AVX2
Path C: AVX-512 + unroll

[Benchmark]
scalar: 820 cycles
AVX2:   510 cycles
AVX-512: 360 cycles (selected)

Result: 42.0

The Core Question You’re Answering

“How do real-world runtimes adapt to the hardware they run on?”

This project demonstrates how VMs, JITs, and performance libraries extract maximum performance by targeting the actual microarchitecture.

Concepts You Must Understand First

JIT basics and executable memory
- How do you allocate RW then RX memory safely?
- Book Reference: “Writing a C Compiler” Ch. 12
Instruction selection and scheduling
- How do you pick instructions to match port and bandwidth limits?
- Book Reference: “Computer Architecture” Ch. 3
CPU feature detection
- How do you detect AVX2/AVX-512 or NEON?
- Book Reference: “Low-Level Programming” Ch. 10

Questions to Guide Your Design

IR design
- How will you represent expressions for codegen?
Code layout
- How will you align loops to favor the uOp cache?
Auto-tuning
- How will you benchmark code paths without bias?

Thinking Exercise

The Adaptive Loop

If AVX-512 reduces instruction count but increases bandwidth demand, when will AVX2 be faster?

The Interview Questions They’ll Ask

“How does a JIT allocate executable memory safely?”
“Why is self-tuning important for performance libraries?”
“How do you prevent JIT code from violating W^X policies?”
“How do you decide between AVX2 and AVX-512 at runtime?”
“What are the security risks of JIT compilation?”

Hints in Layers

Hint 1: Minimal IR Start with only add/mul and a single loop construct.

Hint 2: Code buffer Use mmap for RW, write bytes, then mprotect to RX.

Hint 3: Feature flags Cache CPUID results in a struct to select code paths.

Hint 4: Auto-tune Run each code path multiple times and select the median fastest.

Books That Will Help

Topic	Book	Chapter
JIT basics	“Writing a C Compiler”	Ch. 12
Runtime codegen	“Linkers and Loaders”	Ch. 11
Performance	“Write Great Code, Vol 1”	Ch. 12

Common Pitfalls & Debugging

Problem 1: “Segfault on execution”

Why: Memory not marked executable.
Fix: Use mprotect after writing code.
Quick test: Print page permissions in /proc/self/maps.

Problem 2: “Incorrect results”

Why: Calling convention mismatch or register clobbering.
Fix: Preserve caller-saved registers and follow ABI.
Quick test: Compare against a reference interpreter.

Problem 3: “Auto-tuning unstable”

Why: Timing noise and cache effects.
Fix: Warm up and measure multiple times.
Quick test: Compare median vs mean.

Definition of Done

JIT generates correct output for 10+ expressions
At least two code paths (scalar + vector) implemented
Runtime selects fastest path reliably
Benchmark report includes CPU model and vector width