Project 8: Macro-op Fusion Detector

Build a benchmark that discovers which instruction pairs fuse and how fusion impacts uOp count.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Main Programming Language	Assembly (Alternatives: C)
Alternative Programming Languages	C
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Prerequisites	Assembly basics, uOp cache idea, perf counters
Key Topics	macro-op fusion, decode pipeline, uOp counting, alignment

1. Learning Objectives

By completing this project, you will:

Explain what macro-op fusion is and why it exists.
Detect fused instruction pairs via uOp counts.
Distinguish fusion effects from uOp cache effects.
Build a catalog of fusion rules for your CPU.
Quantify throughput improvements due to fusion.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Macro-op Fusion and Decode Rules

Fundamentals

Macro-op fusion is when the front-end combines two instructions into a single fused micro-op. This reduces the number of uOps entering the backend, improving throughput and reducing power. Typical fusion pairs include compare + branch or test + branch. Fusion is limited by decoder rules and alignment constraints. Understanding fusion helps you write faster low-level code and explains why some instruction pairs are faster than others.

Additional fundamentals for Macro-op Fusion and Decode Rules: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

The x86 front-end decodes complex instructions into uOps. Some instruction pairs are so common that the front-end can fuse them, treating them as one macro-op. This usually happens for compare/test followed by a conditional branch, where the comparison sets flags and the branch consumes them. By fusing, the decoder emits a single fused uOp that carries both operations to the backend. This reduces uOp count and can improve throughput, especially in branch-heavy code.

Fusion is not universal. It depends on CPU generation, alignment, and instruction form. For example, fusion may require the compare and branch to be adjacent, with no intervening instructions. It may also require certain addressing modes or immediate sizes. Some CPUs support fusion of load + op, or cmp + jcc, but not all do. Additionally, if the instruction pair crosses a fetch boundary or uOp cache line, fusion might not occur. This is why alignment and loop layout matter.

The impact of fusion is measurable using uops_retired.slots or similar counters. A fused pair should retire fewer uOps than two separate instructions. You can also observe changes in cycles per iteration in tight loops. However, you must isolate fusion from other effects like uOp cache or decode bandwidth. This means keeping the loop small, aligned, and using a consistent instruction mix. You can also compare with a non-fusible pair (e.g., add + jcc) as a control.

Fusion also interacts with branch prediction. A fused compare+branch is a single macro-op, which can change how the predictor and frontend handle it. For the purposes of this project, you should treat fusion as a decode optimization that reduces uOp count. But be aware that in some CPUs, fusion may enable additional optimizations such as reduced front-end pressure.

Your benchmark should generate a matrix of instruction pairs and determine whether each pair fuses. The output should include uOp counts and a boolean fused/not-fused flag. This creates a practical fusion catalog, similar to the tables in vendor manuals but tailored to your CPU.

Additional deep dive considerations for Macro-op Fusion and Decode Rules: In real designs, Macro-op Fusion and Decode Rules is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Macro-op Fusion and Decode Rules via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use this concept to design pair tests in §3.2 and to interpret uOp counters in §3.7.

Definitions & key terms

macro-op fusion -> combining two instructions into a single uOp
fused pair -> instruction pair that can be fused
uOp count -> number of micro-operations retired
decode rule -> constraint on which pairs can fuse

Mental model diagram (ASCII)

cmp rax, rbx
jne label
   => [fused macro-op] -> backend

How it works (step-by-step, with invariants and failure modes)

Fetch and decode adjacent instruction pair.
Decoder checks fusion rules.
If eligible, emit single fused uOp.
Otherwise, emit two uOps.

Invariants:

Fusion requires adjacency and compatible forms.
Fusion is front-end only; backend sees fewer uOps.

Failure modes:

Misalignment or boundary crossing prevents fusion.
Different instruction forms do not fuse.

Minimal concrete example

cmp rax, rbx
jne label

Common misconceptions

“All compare+branch pairs fuse” -> some forms do not.
“Fusion happens in the backend” -> it is a decode-stage optimization.

Check-your-understanding questions

Why does fusion improve throughput?
What conditions must be met for fusion?
How can you detect fusion experimentally?

Check-your-understanding answers

It reduces uOp count and front-end pressure.
Adjacent compatible instruction forms and alignment constraints.
Measure uOp count and compare with non-fused pairs.

Real-world applications

Hand-tuned assembly
Compiler backend instruction selection

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §6.2 Critical Test Cases.
Also used in: P04-the-uop-cache-prober.md.

References

Intel Optimization Manual, macro-fusion section
“Agner Fog’s Optimization Manuals”

Key insights

Fusion is a decode shortcut that reduces uOp pressure and improves throughput.

Summary

Macro-op fusion is a real, measurable front-end optimization. Your benchmark will build a fusion rule catalog.

Homework/Exercises to practice the concept

Predict whether test rax, rax + je will fuse.
Compare cmp + jcc to sub + jcc and predict differences.

Solutions to the homework/exercises

Often yes, but depends on CPU generation and alignment.
cmp + jcc tends to fuse; sub + jcc may not.

2.2 uOp Counting and Front-End Isolation

Fundamentals

To detect fusion, you must count uOps accurately and isolate front-end effects. If the uOp cache hits, the decoder is bypassed and fusion may not be visible in the same way. You must therefore design loops that either avoid the uOp cache or control for it. Performance counters like uops_retired.slots can approximate how many uOps were executed. Combined with timing, this gives a reliable signal of fusion.

Additional fundamentals for uOp Counting and Front-End Isolation: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

Counting uOps is not the same as counting instructions. Complex instructions can expand into multiple uOps, while fused pairs reduce uOp count. PMU counters expose uOps retired, but some counters count in slots rather than literal uOps. You must understand the counter semantics for your CPU. A common approach is to run a loop with a known instruction mix and measure the uOps retired per iteration. If a pair is fused, the uOps count will be closer to 1 per pair than 2.

To isolate front-end effects, you should design the loop to fit comfortably within the uOp cache or to deliberately miss it, then compare results. If the loop hits the uOp cache, the front-end decode may not be on the critical path, and fusion effects can be harder to observe. One approach is to make the loop large enough to exceed the uOp cache, forcing decode. Another is to use alignment tricks to force a decode boundary and see how fusion changes. The key is to hold other factors constant.

The loop should be dependency-free so that backend throughput does not dominate. If the backend is the bottleneck, uOp count may not correlate with performance. This is why you use simple register comparisons and branches that are perfectly predicted. You are measuring front-end uOp production, not branch mispredictions. You can also use a baseline loop of NOPs or adds to calibrate uOp counter behavior.

When you detect fusion, you should confirm it with both counters and timing. A fused pair should show fewer uOps and slightly lower cycles per iteration. If the uOp count decreases but timing does not, the bottleneck might be elsewhere (e.g., branch unit or front-end fetch). This dual-channel validation is essential for credible results.

Finally, you should remember that fusion rules are microarchitecture-specific. Your results may not match published tables exactly, especially if microcode updates have changed behavior. That is why the project produces a local catalog with CPU model and microcode version recorded.

Additional deep dive considerations for uOp Counting and Front-End Isolation: In real designs, uOp Counting and Front-End Isolation is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target uOp Counting and Front-End Isolation via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use uOp counting in §3.7 to classify fused pairs and in §7.1 to debug anomalies.

Definitions & key terms

uops_retired.slots -> counter for uOps retired in backend slots
front-end isolation -> designing tests to avoid backend bottlenecks
baseline loop -> control loop to calibrate counters

Mental model diagram (ASCII)

Loop -> uOp counter -> uOps/iter
Compare pair A vs pair B to detect fusion

How it works (step-by-step, with invariants and failure modes)

Build a loop with a candidate instruction pair.
Measure uOps retired per iteration.
Compare against a known non-fused pair.
Classify as fused if uOps drop.

Invariants:

Branch is perfectly predicted.
Loop size and alignment are constant.

Failure modes:

uOp cache hit hides decode differences.
Backend port pressure masks uOp changes.

Minimal concrete example

perf stat -e uops_retired.slots ./fusion_test

Common misconceptions

“Counters give exact uOp counts” -> they are sometimes scaled.
“Timing alone proves fusion” -> timing can be dominated by other factors.

Check-your-understanding questions

Why use both counters and timing?
How can uOp cache hits confuse fusion measurement?
What is a good baseline loop?

Check-your-understanding answers

Counters show uOp count, timing shows performance impact.
The decoder is bypassed, so fusion effects may not appear.
A loop of independent NOPs or adds with known uOp counts.

Real-world applications

Compiler tuning for branch-heavy code
Microarchitecture reverse engineering

Where you’ll apply it

In this project: see §3.7 Real World Outcome and §6.2 Critical Test Cases.
Also used in: P05-execution-port-pressure-map.md.

References

Intel/AMD PMU documentation
uops.info

Key insights

Fusion is best detected by combining uOp counters with controlled timing.

Summary

Accurate uOp counting is the foundation of fusion detection. Control front-end effects to make the signal clear.

Homework/Exercises to practice the concept

Compare uOp counts for cmp+jcc vs add+jcc.
Make a loop that misses the uOp cache and repeat the test.

Solutions to the homework/exercises

The fused pair should show fewer uOps.
The difference should become clearer when decode is active.

2.3 Decode Boundary Effects and Fusion Inhibition

Fundamentals

Macro-op fusion is a front-end optimization with strict boundary rules. A candidate pair can fuse only if the decoder sees them adjacent and within certain fetch or decode windows. If a pair straddles a fetch boundary or crosses a 16-byte or 32-byte limit (depending on microarchitecture), fusion may be blocked. This means that alignment, instruction length, and padding can turn a “fusable” pair into a non-fusable one. A fusion detector must therefore test not just which pairs fuse, but also where they stop fusing. Otherwise you risk assuming fusion that only happens in ideal alignment cases.

Deep Dive into the concept

The front-end fetches instruction bytes in fixed-size blocks and then decodes them. Macro-fusion usually happens in the decoder, before uOps enter the scheduler or uOp cache. The decoder has limited visibility: it processes a small window of bytes each cycle. If two instructions are adjacent but split across decoder windows, fusion can fail. On some Intel cores, a 32-byte boundary can inhibit fusion. On others, a 16-byte decode block is the limiting factor. These details are not always documented, but you can infer them experimentally by varying alignment and measuring uOp counts.

Instruction length is a common way fusion is accidentally broken. For example, a long immediate or a complex addressing mode can make an instruction large enough that the paired branch starts in a new decode block. In that case, the decoder may treat them as separate and no fusion occurs. This is why assembly-level tuning sometimes includes rewriting instructions to shorter encodings or inserting alignment directives to keep pairs inside a single decode block.

The uOp cache complicates measurement. When uOps are delivered from the uOp cache, the decoder does not run, and fusion behavior might appear fixed even if the decoded stream would not fuse. To isolate decoder behavior, you can force the legacy decode path by using a loop that exceeds uOp cache capacity or by alternating with other code to evict the cache. A robust fusion detector should report both “uOp cache path” and “legacy decode path” results.

Another subtlety is micro-op fusion (e.g., load+ALU) versus macro-op fusion (e.g., CMP+Jcc). These are distinct optimizations with different rules. A pair may be macro-fused into one uOp, but the resulting uOp may still split into micro-ops in the back-end for port scheduling. This is why you should compare uOps retired (macro view) with port pressure or throughput (micro view). If uOps retired drop but throughput does not improve, you may have discovered a front-end win without a back-end win.

Finally, to build a real fusion catalog, you should measure multiple alignments for each pair and record the conditions under which fusion holds. Your output can be a table with columns: pair, alignment, uOps retired, instructions retired, and a boolean for fusion. This is the practical artifact a compiler engineer or assembly programmer can use.

How this fits on projects

You will use this in §3.5 Data Formats to include alignment fields, in §5.10 Phase 2 to run boundary sweeps, and in §7.1 to interpret false negatives.

Definitions & key terms

decode window -> limited byte region processed by decoders per cycle
boundary inhibition -> loss of fusion when a pair crosses a boundary
legacy decode path -> decoder-based uOp generation (not uOp cache)
macro-op fusion -> combining two instructions into one uOp at decode
micro-op fusion -> combining load and ALU into a single uOp later in pipeline

Mental model diagram (ASCII)

[Bytes 0..31] [Bytes 32..63]
   CMP  Jcc|  <- boundary split => no fusion

How it works (step-by-step, with invariants and failure modes)

Generate a candidate instruction pair in a tight loop.
Sweep alignment so the pair crosses decode boundaries.
Measure uOps retired and instructions retired.
Force legacy decode to isolate decoder behavior.
Record fusion outcomes by alignment.

Invariants:

Pair must be adjacent with no intervening uOps.
Measurements compare identical loop bodies except alignment.

Failure modes:

uOp cache hides decode boundary effects.
Loop overhead dominates and masks uOp counts.

Minimal concrete example

.align 32
cmp rax, 0
je  target

Common misconceptions

“If a pair is fusable, it always fuses” -> boundaries can block fusion.
“Fusion is a back-end feature” -> macro-fusion happens in the decoder.
“uOp cache always preserves fusion” -> it depends on what was decoded.

Check-your-understanding questions

Why does alignment change whether a pair fuses?
How can you force the legacy decode path?
Why compare uOps retired with port pressure data?

Check-your-understanding answers

Because the decoder only sees a limited window and boundaries split pairs.
Use a loop larger than uOp cache or evict with other code.
It distinguishes front-end fusion from back-end execution cost.

Real-world applications

Hand-written assembly for low-latency trading or crypto
Compiler back-end code layout and branch scheduling
Reverse-engineering microarchitecture behavior

Where you’ll apply it

In this project: see §3.5 Data Formats and §5.10 Phase 2.
Also used in: P04-the-uop-cache-prober.md, P11-the-uarch-aware-jit-engine.md.

References

Intel Optimization Reference Manual, macro-fusion tables
Agner Fog instruction tables (fusion notes)

Key insights

Fusion is conditional on alignment; a pair is not always a pair.

Summary

Macro-op fusion depends on decoder window boundaries. By sweeping alignment and forcing legacy decode, you can map the real fusion rules for your CPU.

Homework/Exercises to practice the concept

Test CMP+Jcc with alignments from 0 to 63 bytes and plot uOp count.
Compare the same test with a larger loop that evicts the uOp cache.

Solutions to the homework/exercises

You should see fusion breaks at specific boundary offsets.
The legacy decode path shows stronger boundary sensitivity than the uOp cache path.

3. Project Specification

3.1 What You Will Build

A benchmarking tool that tests a list of instruction pairs for macro-op fusion. It measures uOp counts and cycles per iteration, then reports which pairs fuse and by how much they reduce uOp count.

3.2 Functional Requirements

Pair Generator: Emit loops for candidate instruction pairs.
Timing + Counter Harness: Measure cycles and uOp counts.
Fusion Classifier: Decide fused/not fused based on thresholds.
Report Generator: Produce a fusion catalog with CPU metadata.

3.3 Non-Functional Requirements

Performance: Test at least 20 pairs in under 2 seconds.
Reliability: Consistent classification across trials.
Usability: CLI flags for pair lists and output format.

3.4 Example Usage / Output

$ ./fusion_detector --pairs cmp+jcc,test+jcc,add+jcc
pair,uops_per_iter,cycles_per_iter,fused
cmp+jcc,1.0,0.25,yes
test+jcc,1.0,0.25,yes
add+jcc,2.0,0.50,no

3.5 Data Formats / Schemas / Protocols

CSV output:

pair,uops_per_iter,cycles_per_iter,fused
cmp+jcc,1.0,0.25,yes

3.6 Edge Cases

Pairs that are not adjacent due to assembler reordering
Branch prediction failures skewing timing
uOp cache hits hiding decode differences

3.7 Real World Outcome

You will produce a fusion catalog table that lists fused pairs for your CPU.

3.7.1 How to Run (Copy/Paste)

as --64 -o fusion.o fusion.s
ld -o fusion_detector fusion.o
./fusion_detector --pairs cmp+jcc,test+jcc

3.7.2 Golden Path Demo (Deterministic)

Use cmp+jcc and add+jcc as baseline pairs.
Expect cmp+jcc to show fewer uOps.

3.7.3 If CLI: Exact Terminal Transcript

$ ./fusion_detector --pairs cmp+jcc,add+jcc
pair,uops_per_iter,cycles_per_iter,fused
cmp+jcc,1.0,0.25,yes
add+jcc,2.0,0.50,no

$ echo $?
0

Failure demo (bad pair):

$ ./fusion_detector --pairs foo+bar
Error: unknown pair 'foo+bar'

$ echo $?
3

Exit codes:

0 success
2 PMU init error
3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+-------------+   +------------------+   +-----------------+
| Pair Builder|-> | Timing + Counters|-> | Fusion Catalog  |
+-------------+   +------------------+   +-----------------+

4.2 Key Components

Component	Responsibility	Key Decisions
Pair Builder	Emit loops for pairs	adjacency guaranteed
Counter Harness	Read uOp counts	minimal events
Classifier	Determine fusion	threshold on uOps

4.3 Data Structures (No Full Code)

struct Result { const char* pair; double uops; double cycles; bool fused; };

4.4 Algorithm Overview

Key Algorithm: Fusion Classification

Measure uOps per iter for each pair.
Compare against baseline non-fused pair.
Classify as fused if uOps <= threshold.

Complexity Analysis:

Time: O(P) for P pairs
Space: O(P)

5. Implementation Guide

5.1 Development Environment Setup

as --version
perf --version

5.2 Project Structure

fusion-detector/
├── src/
│   ├── fusion.s
│   ├── timing.c
│   └── main.c
└── README.md

5.3 The Core Question You’re Answering

“Which instruction pairs fuse on my CPU, and how much does it help?”

5.4 Concepts You Must Understand First

Macro-op fusion rules
uOp counting and PMU counters
Alignment and uOp cache effects

5.5 Questions to Guide Your Design

How will you ensure instruction adjacency?
Which baseline pairs will you use?
How will you handle CPUs without certain counters?

5.6 Thinking Exercise

Choose three candidate pairs and predict which will fuse. Explain why.

5.7 The Interview Questions They’ll Ask

What is macro-op fusion?
Why does fusion reduce uOp count?
How do you detect fusion experimentally?

5.8 Hints in Layers

Hint 1: Keep loops tiny and aligned to avoid uOp cache noise.

Hint 2: Use perf to read uops_retired events.

Hint 3: Compare against a known non-fused pair.

5.9 Books That Will Help

Topic	Book	Chapter
Decode pipeline	“Inside the Machine”	Ch. 4
Optimization manuals	“Agner Fog”	macro-fusion section

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Build pair templates and compile them.
Checkpoint: loops run and produce stable timing.

Phase 2: Core Functionality (4-6 days)

Add timing and uOp counters.
Checkpoint: fused pairs show lower uOp counts.

Phase 3: Catalog (2-3 days)

Build report and export CSV.
Checkpoint: fusion table generated.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Pair encoding	asm macros vs codegen	asm macros	precise adjacency
Threshold	fixed vs adaptive	adaptive	CPU differences

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Pair adjacency	disassembly check
Integration Tests	End-to-end	cmp+jcc
Edge Tests	No counters	fallback mode

6.2 Critical Test Cases

cmp+jcc should fuse on many CPUs.
add+jcc should not fuse.
Misaligned loops should reduce fusion frequency.

6.3 Test Data

pairs: cmp+jcc, add+jcc

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Non-adjacent pairs	no fusion	enforce adjacency
uOp cache effects	inconsistent uOp counts	control loop size
Branch mispredicts	noisy timing	use predictable branches

7.2 Debugging Strategies

Disassemble loop to confirm instruction order.
Compare with uops.info expected fusion.

7.3 Performance Traps

Large loop bodies reduce decode control and hide fusion.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a JSON report.

8.2 Intermediate Extensions

Test fusion across multiple compiler outputs.

8.3 Advanced Extensions

Compare fusion behavior across CPU generations.

9. Real-World Connections

9.1 Industry Applications

Hand-tuned assembly in performance libraries
Compiler instruction selection for branch-heavy code

uops.info: reference microbenchmarks
llvm-mca: static throughput modeling

9.3 Interview Relevance

Demonstrates deep understanding of front-end optimizations.

10. Resources

10.1 Essential Reading

Intel Optimization Manual, macro-fusion section
“Agner Fog’s Optimization Manuals”

10.2 Video Resources

“Instruction Decoding and Fusion” lecture

10.3 Tools & Documentation

perf: uOps counters
objdump: verify adjacency

Also: P04-the-uop-cache-prober.md
Later: P09-l1-bandwidth-stressor-zen-5-focus.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain macro-op fusion.
I can detect fusion using counters.
I can separate fusion from uOp cache effects.

11.2 Implementation

Fusion catalog generated.
Results stable across trials.
CPU metadata recorded.

11.3 Growth

I can discuss fusion in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Test at least 10 instruction pairs and classify fusion.

Full Completion:

Include uOp counts and timing in report.

Excellence (Going Above & Beyond):

Compare fusion results against published tables and discuss differences.