Project 5: Execution Port Pressure Map
Build a microbenchmark suite that maps instruction throughput to backend execution ports.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C++ (with inline assembly) (Alternatives: Rust, Assembly) |
| Alternative Programming Languages | Rust, Assembly |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 2. The “Micro-SaaS / Pro Tool” |
| Prerequisites | Basic assembly, CPU pipeline basics, perf tooling |
| Key Topics | execution ports, throughput vs latency, port contention, PMU counters |
1. Learning Objectives
By completing this project, you will:
- Explain how execution ports and functional units limit throughput.
- Design instruction mixes that stress specific ports.
- Use PMU counters to validate port pressure hypotheses.
- Build a port pressure heat map for your CPU.
- Interpret results to guide performance tuning.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Execution Ports, Functional Units, and Throughput
Fundamentals
Execution ports are groups of functional units that execute micro-operations. Each cycle, the scheduler issues ready uOps to available ports. The number and type of ports determine the maximum throughput for different instruction classes. For example, a CPU might have two load ports, one store port, and multiple ALU ports. If your loop uses instructions bound to a single port, it will bottleneck even if everything else is idle. Throughput describes how many uOps can be completed per cycle, and it is distinct from latency. Understanding the port map explains why some instruction sequences are faster than others.
Additional fundamentals for Execution Ports, Functional Units, and Throughput: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.
Deep Dive into the concept
A modern out-of-order core has a scheduler that holds uOps and dispatches them to ports. Each port connects to one or more execution units (ALU, FPU, load/store, branch). An instruction type maps to one or more ports. For example, integer add might be able to execute on ports 0 and 1, while integer multiply might only execute on port 1. If an instruction can use multiple ports, it is flexible; if it can only use one, it is a port bottleneck. When you run a tight loop of a single instruction, the steady-state throughput is limited by the number of ports that instruction can use and the port’s issue capacity.
Port pressure refers to how heavily a loop stresses each port. A balanced loop spreads uOps across ports and reaches higher throughput. An unbalanced loop saturates one port and leaves others idle. This is why optimizing throughput often means mixing instructions or reordering operations to use different ports. The classic example is combining integer adds and multiplies to utilize separate units.
Port usage is not always obvious from the ISA because complex instructions can decode into multiple uOps. Some instructions also have micro-fused uOps that can execute together. Therefore, empirical measurement is valuable. By generating a loop with one instruction type and measuring cycles per iteration, you can infer the throughput. If you also measure the number of uOps retired and cycles, you can estimate uOps per cycle. By combining instruction mixes (e.g., two types in the same loop), you can test if they contend for the same port or use different ports. If the combined throughput equals the maximum of the individual throughputs, they likely contend. If it approaches the sum, they use different ports.
The scheduler and reorder buffer allow multiple in-flight uOps, but for a tight, dependency-free loop, the bottleneck is the port throughput. You should structure your loop to avoid dependency chains so that the scheduler can issue uOps freely. This usually means using multiple independent registers and unrolling the loop. If you inadvertently create dependencies, latency will dominate and your measurements will not reflect port pressure. Therefore, constructing independent instruction sequences is part of the experimental design.
Execution port models are often summarized in tables (e.g., from uops.info). But CPUs can differ by generation. Your project builds a local map for your CPU, which is valuable because it captures the actual hardware and microcode. This is also the foundation for performance analysis tools like Intel’s IACA or LLVM-MCA.
Additional deep dive considerations for Execution Ports, Functional Units, and Throughput: In real designs, Execution Ports, Functional Units, and Throughput is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Execution Ports, Functional Units, and Throughput via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.
How this fits on projects
You will use this concept to design independent instruction sequences in §3.2 and to interpret throughput limits in §3.7.
Definitions & key terms
- execution port -> pipeline endpoint with specific functional units
- throughput -> uOps per cycle sustainable in steady state
- latency -> cycles from issue to result
- port contention -> multiple uOps competing for the same port
Mental model diagram (ASCII)
Scheduler -> Port 0 (ALU) -> Exec
-> Port 1 (ALU) -> Exec
-> Port 2 (Load) -> L1
How it works (step-by-step, with invariants and failure modes)
- Decode loop into uOps and place in scheduler.
- Each cycle, issue ready uOps to free ports.
- If a port is saturated, uOps wait.
- Throughput equals issue rate of the busiest port.
Invariants:
- A uOp can only execute on its allowed port set.
- Issue rate per port is limited per cycle.
Failure modes:
- Dependency chains hide port pressure by introducing latency stalls.
- Front-end bottlenecks mask backend port limits.
Minimal concrete example
; Independent adds to avoid dependencies
add r8, r9
add r10, r11
add r12, r13
Common misconceptions
- “Latency and throughput are the same” -> throughput is steady-state rate.
- “More ports always means faster” -> only if the instruction can use them.
Check-your-understanding questions
- Why must you avoid dependencies when measuring port pressure?
- What does it mean if two instructions run faster together than apart?
- How can you infer port binding from timing?
Check-your-understanding answers
- Dependencies introduce latency stalls that hide port limits.
- They likely use different ports and do not contend.
- Compare throughput of isolated vs mixed instruction loops.
Real-world applications
- Compiler instruction scheduling
- Performance tuning for HPC kernels
Where you’ll apply it
- In this project: see §3.2 Functional Requirements and §5.10 Phase 2.
- Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md.
References
- “Agner Fog’s Optimization Manuals”
- uops.info (port mapping data)
Key insights
- Port pressure is the hidden limiter once dependencies are removed.
Summary
Execution ports define the backend throughput ceiling. Your benchmark reveals which ports dominate for specific instruction mixes.
Homework/Exercises to practice the concept
- Predict whether a loop of only integer multiplies will be port-limited.
- Design a loop that uses both ALU and load ports.
Solutions to the homework/exercises
- Yes, multiplies typically map to fewer ports and will bottleneck.
- Combine add instructions with independent loads from aligned arrays.
2.2 PMU Counters and Port-Binding Measurement
Fundamentals
Performance Monitoring Units (PMUs) provide hardware counters for events like cycles, instructions retired, and port usage. By reading these counters around a benchmark, you can validate hypotheses about port pressure. Counters are not perfect and can be noisy, but they are invaluable for confirmation. Tools like perf and pmu-tools simplify access. The core idea is to use counters as a second measurement channel alongside timing.
Additional fundamentals for PMU Counters and Port-Binding Measurement: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.
Deep Dive into the concept
PMU counters count microarchitectural events. For port pressure, some CPUs provide events such as uops_executed.port_0 or uops_retired.slots. These events can indicate how many uOps executed on each port. However, counter availability and meaning vary by CPU model. Some counters are derived rather than direct, and their accuracy can be limited. Therefore, you should use them for validation rather than as the sole measurement.
The typical workflow is: run a microbenchmark, collect counters, compute derived metrics (uOps per cycle, port distribution), and compare to expected values. For example, if a loop of independent adds yields 4 uOps per cycle and counters show high activity on ports 0 and 1, that matches expectations. If counters show high activity on a different port, your assumption about port binding may be wrong. Similarly, if timing shows slower throughput than expected, counters can indicate whether the backend is saturated or the front-end is the bottleneck.
Counter measurement requires careful setup. You should pin the process to a core, disable frequency scaling, and run for enough iterations to avoid sampling error. You should also avoid multiplexing counters by measuring only a few at a time. perf stat can multiplex if too many events are requested, which reduces accuracy. For this project, choose a small set of counters: cycles, instructions, and a couple of port-specific events if available.
Interpreting counters also requires understanding that some events count uOps, not instructions. If your instruction decodes into multiple uOps, the uOp count will exceed the instruction count. Therefore, you should report both. The ratio of uOps to instructions is itself a useful metric because it indicates how heavy the instruction mix is. Additionally, some events include speculative uOps that were later squashed; this matters if your loop includes branches or mispredictions. For steady-state loops with predictable branches, the speculative overhead should be minimal.
Finally, be aware of measurement overhead. Reading counters can perturb timing slightly. If your benchmark runs for many iterations, this overhead is amortized, but you should still avoid reading counters inside the tight loop. Instead, start counters, run the loop, stop counters. Record the loop size and iteration count so results are comparable across runs.
Additional deep dive considerations for PMU Counters and Port-Binding Measurement: In real designs, PMU Counters and Port-Binding Measurement is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target PMU Counters and Port-Binding Measurement via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.
How this fits on projects
You will use counters to validate your port pressure map in §3.7 and to debug anomalies in §7.1.
Definitions & key terms
- PMU -> Performance Monitoring Unit
- event -> specific hardware counter (e.g., cycles)
- multiplexing -> sharing counters over time when too many are requested
- uOps_retired -> count of micro-operations retired
Mental model diagram (ASCII)
[Benchmark] -> [PMU counters] -> [Derived metrics] -> [Port map]
How it works (step-by-step, with invariants and failure modes)
- Select a small set of counters.
- Start counters, run loop, stop counters.
- Compute uOps/cycle and port distribution.
- Compare with timing-based throughput.
Invariants:
- Counters must be read outside the hot loop.
- Compare like-for-like configurations.
Failure modes:
- Counter multiplexing skews results.
- CPU model mismatch causes invalid counter semantics.
Minimal concrete example
perf stat -e cycles,instructions,uops_retired.slots ./port_pressure
Common misconceptions
- “Counters are always exact” -> they can be sampled or derived.
- “More counters is better” -> multiplexing reduces accuracy.
Check-your-understanding questions
- Why is multiplexing bad for microbenchmarks?
- What does uOps_retired measure compared to instructions?
- Why measure both timing and counters?
Check-your-understanding answers
- It reduces counter precision and adds sampling error.
- It counts decoded micro-operations, which can be more than instructions.
- Timing shows performance; counters explain why.
Real-world applications
- Performance analysis in compiler teams
- CPU verification and microarchitecture tuning
Where you’ll apply it
- In this project: see §3.7 Real World Outcome and §6.2 Critical Test Cases.
- Also used in: P04-the-uop-cache-prober.md.
References
- Intel and AMD PMU documentation
- “pmu-tools” by Andi Kleen
Key insights
- PMU counters are evidence, not truth; use them to validate timing-based conclusions.
Summary
Counters let you see port usage directly. Combined with timing, they produce a reliable port pressure map.
Homework/Exercises to practice the concept
- Measure cycles and instructions for a NOP loop and compute IPC.
- Compare uOps_retired for a mix of adds and multiplies.
Solutions to the homework/exercises
- IPC should be high; cycles per instruction should be low.
- The mix should show more uOps if multiplies expand.
2.3 Dependency Chains and Port Isolation Techniques
Fundamentals
Execution ports are shared resources. To map port pressure, you need to isolate which ports an instruction uses and ensure no other bottleneck dominates. The most reliable technique is to construct dependency chains that force an instruction to execute serially, then remove dependencies to expose throughput limits. A dependency chain creates a single stream of uOps that must execute one after another, revealing latency and port choice. Independent chains let you fill all available ports and measure throughput. This difference between latency-limited and throughput-limited regimes is the heart of port mapping. If you do not control dependencies, you will misinterpret front-end stalls or cache effects as port pressure.
Deep Dive into the concept
An instruction can often execute on multiple ports. For example, an ADD might use port 0 or port 1 on a given microarchitecture. If you create a dependency chain (e.g., add rax, 1; add rax, 1; …), each instruction depends on the previous result, so the chain exposes the instruction’s latency and the port chosen for that dependency. If the chain runs at one per cycle, you have evidence that the instruction’s latency is 1 cycle and that the chosen port can issue each cycle. If the chain runs at one every two cycles, you have found a port or pipeline limitation.
To measure throughput, you break dependencies. Use multiple registers with independent adds, or use a vector instruction that has no dependencies between iterations. Now the scheduler can issue multiple uOps per cycle, limited by available ports. If you see, for example, 2 adds per cycle, you have evidence of two suitable ports. The port map emerges by combining multiple tests: a dependent chain gives you latency and a hint of preferred ports; independent chains reveal the maximum issue rate.
Port isolation is about reducing confounders. You must ensure the front-end is not the bottleneck by keeping the loop short and in the uOp cache. You must avoid memory operations unless your goal is to map load/store ports, because cache misses will dominate. You must also avoid mixing instructions that compete for the same ports unless you are explicitly testing contention. A clean experiment runs only one instruction class at a time, then uses mixed workloads to confirm port sharing. For example, a loop of only MUL can reveal the throughput of the multiply pipeline; a mixed loop of ADD and MUL can reveal whether they contend for the same port or use different ports.
Another advanced technique is to use “port pressure signatures” from performance counters. On some Intel CPUs, you can read uops_executed.port or similar events to see how many uOps issued on each port. These counters are not always precise, but they provide an independent check on your microbenchmark inferences. Combine this with static tools like llvm-mca or uops.info to build a triangulated port map.
Be aware of scheduling artifacts. The scheduler may distribute uOps across ports to balance load, so your measured throughput may look better than a naive single-port model. However, certain instructions have fixed port usage. By testing known fixed-port instructions, you can calibrate your environment and then infer ports for more flexible ones. The end goal is not a perfect map but a practical understanding: which instruction mixes saturate which ports, and how that affects throughput.
How this fits on projects
You will use this in §3.2 to define the microbenchmark loops, in §5.10 Phase 2 to construct dependent and independent chains, and in §6.2 to define tests for contention.
Definitions & key terms
- dependency chain -> a sequence where each instruction depends on the previous result
- throughput -> maximum number of operations per cycle in steady state
- latency -> cycles from issue to result availability
- port map -> mapping of instructions to execution ports
- contention -> multiple uOps competing for the same port
Mental model diagram (ASCII)
Chain: RAX -> ADD -> RAX -> ADD -> RAX (latency)
Independent: R1/R2/R3 -> ADDs -> ports 0+1 (throughput)
How it works (step-by-step, with invariants and failure modes)
- Build a single-register dependency chain and measure cycles per op.
- Build multiple independent chains and measure ops per cycle.
- Compare results to infer port availability and sharing.
- Validate with mixed-instruction loops.
- Cross-check with perf counters or llvm-mca.
Invariants:
- Dependency chains must enforce true RAW dependencies.
- Independent chains must not share registers.
Failure modes:
- Front-end limits cap throughput and hide port limits.
- Unintended memory ops create cache bottlenecks.
Minimal concrete example
; Latency chain
add rax, 1
add rax, 1
add rax, 1
; Throughput test
add rax, 1
add rbx, 1
add rcx, 1
add rdx, 1
Common misconceptions
- “One benchmark is enough” -> you need both latency and throughput tests.
- “Ports are fixed for all instructions” -> many instructions are flexible.
- “Throughput equals latency” -> only if there is a single port and no pipelining.
Check-your-understanding questions
- Why do dependency chains reveal latency rather than throughput?
- How can you tell if two instructions contend for the same port?
- What front-end effect can make a throughput test misleading?
Check-your-understanding answers
- Dependencies serialize execution, so throughput collapses to latency.
- Mixed loops show reduced throughput compared to separate loops.
- Decode or uOp cache limits can cap issue rate regardless of port capacity.
Real-world applications
- Hand-optimizing hot loops in compilers or JITs
- Understanding why vector code stalls despite high clock rates
- Performance engineering for low-latency systems
Where you’ll apply it
- In this project: see §5.4 Concepts You Must Understand First and §6.2 Critical Test Cases.
- Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md, P11-the-uarch-aware-jit-engine.md.
References
- “uops.info” instruction tables (public database)
- “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3
Key insights
- Port mapping is an experimental science built on controlled dependencies.
Summary
Dependency chains isolate latency; independent chains reveal throughput. Together, they let you map execution ports and explain why certain instruction mixes saturate the core.
Homework/Exercises to practice the concept
- Build a latency chain for MUL and measure cycles per op.
- Build an independent chain and compute maximum ops per cycle.
Solutions to the homework/exercises
- The chain exposes MUL latency, often several cycles on modern cores.
- The independent chain should approach the documented throughput (for example, one per cycle).
3. Project Specification
3.1 What You Will Build
A benchmark suite that generates independent instruction loops for different instruction classes (integer add, multiply, loads, stores, SIMD) and measures throughput. The output is a port pressure heat map showing cycles per instruction and inferred port bindings.
3.2 Functional Requirements
- Instruction Loop Generator: Create loops for each instruction class.
- Dependency Avoidance: Use multiple registers and unrolling to remove dependencies.
- Timing Harness: Measure cycles per iteration and compute throughput.
- Counter Integration: Optionally collect PMU counters per loop.
- Report Generator: Produce a map of instruction class -> throughput.
3.3 Non-Functional Requirements
- Performance: Each loop runs under 0.5 seconds.
- Reliability: Stable results across trials.
- Usability: CLI to select instruction classes and output formats.
3.4 Example Usage / Output
$ ./port_pressure --ops add,mul,load
op,cycles_per_iter,uops_per_cycle
add,0.25,4.0
mul,1.00,1.0
load,0.50,2.0
3.5 Data Formats / Schemas / Protocols
CSV output:
op,cycles_per_iter,uops_per_cycle
add,0.25,4.0
mul,1.00,1.0
3.6 Edge Cases
- Instruction sequences accidentally dependent
- Front-end bottleneck due to too many bytes
- PMU events not available on target CPU
3.7 Real World Outcome
You will produce a port pressure report that helps decide which instruction mixes are optimal for throughput.
3.7.1 How to Run (Copy/Paste)
c++ -O2 -Wall -o port_pressure src/main.cpp
sudo taskset -c 2 ./port_pressure --ops add,mul,load --trials 5
3.7.2 Golden Path Demo (Deterministic)
- Use
--ops addand expect high throughput (close to 4 uOps/cycle on many cores).
3.7.3 If CLI: Exact Terminal Transcript
$ taskset -c 2 ./port_pressure --ops add,mul
add,0.25,4.0
mul,1.00,1.0
$ echo $?
0
Failure demo (unknown op):
$ ./port_pressure --ops foo
Error: unknown op 'foo'
$ echo $?
3
Exit codes:
0success2PMU init error3invalid argument
4. Solution Architecture
4.1 High-Level Design
+----------------+ +------------------+ +-----------------+
| Loop Generator |-> | Timing + Counters|-> | Report Builder |
+----------------+ +------------------+ +-----------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Loop Generator | Emit independent ops | Use unrolled templates |
| Timing Harness | Measure cycles | RDTSCP + fences |
| Counter Module | Collect PMU events | Small set of events |
4.3 Data Structures (No Full Code)
struct Result { const char* op; double cycles; double uops_per_cycle; };
4.4 Algorithm Overview
Key Algorithm: Throughput Measurement
- Warm up loop to steady state.
- Time N iterations of unrolled loop.
- Compute cycles per instruction and uOps per cycle.
Complexity Analysis:
- Time: O(N) per op class
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
c++ --version
perf --version
5.2 Project Structure
port-pressure/
├── src/
│ ├── main.cpp
│ ├── loops.S
│ └── timing.cpp
└── README.md
5.3 The Core Question You’re Answering
“Which execution port is my bottleneck for this instruction mix?”
The answer shows how to design faster loops.
5.4 Concepts You Must Understand First
- Port throughput vs latency
- Dependency chains and unrolling
- PMU counters and measurement
5.5 Questions to Guide Your Design
- How will you ensure no dependencies?
- How will you count uOps per iteration?
- Which PMU counters are reliable on your CPU?
5.6 Thinking Exercise
Design a loop that mixes integer adds and loads. Predict whether throughput improves compared to pure adds.
5.7 The Interview Questions They’ll Ask
- What is the difference between latency and throughput?
- How do you detect port contention?
- Why are PMU counters useful but imperfect?
5.8 Hints in Layers
Hint 1: Use multiple registers to avoid dependencies.
Hint 2: Unroll by 8 or 16 to reduce branch overhead.
Hint 3: Measure with and without PMU counters to validate.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Execution units | “Inside the Machine” | Ch. 4 |
| Performance analysis | “Computer Architecture” | Ch. 3 |
5.10 Implementation Phases
Phase 1: Foundation (2-3 days)
- Implement loop templates for add and mul.
- Checkpoint: loops compile and run.
Phase 2: Core Functionality (4-6 days)
- Add timing harness and compute throughput.
- Checkpoint: stable cycles per iter.
Phase 3: Validation (2-3 days)
- Integrate PMU counters and compare.
- Checkpoint: port pressure map generated.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Loop structure | inline asm vs intrinsics | inline asm | precise control |
| Reporting | CSV vs table | CSV | easy plotting |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Loop correctness | no dependencies |
| Integration Tests | End-to-end timing | add/mul loops |
| Edge Tests | Missing PMU event | fallback mode |
6.2 Critical Test Cases
- Pure add loop should hit maximum throughput.
- Add+mul mix should show different port usage.
- Missing PMU events should not crash.
6.3 Test Data
ops: add, mul, load
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Dependency chain | low throughput | use more registers |
| Front-end bottleneck | cycles too high | reduce instruction bytes |
| Counter mismatch | weird results | verify CPU model |
7.2 Debugging Strategies
- Compare with uops.info expected throughput.
- Use perf to verify instructions retired.
7.3 Performance Traps
- Excessive unrolling can overflow the uOp cache and skew results.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a visualization script for CSV output.
8.2 Intermediate Extensions
- Include SIMD instruction classes.
8.3 Advanced Extensions
- Auto-detect port bindings via linear programming.
9. Real-World Connections
9.1 Industry Applications
- HPC kernel tuning
- Compiler backend scheduling
9.2 Related Open Source Projects
- uops.info: public port data
- llvm-mca: throughput modeling
9.3 Interview Relevance
- Port pressure and throughput are frequent topics in systems interviews.
10. Resources
10.1 Essential Reading
- “Agner Fog’s Optimization Manuals”
- “Inside the Machine” by Jon Stokes
10.2 Video Resources
- “CPU Backend and Ports” lecture
10.3 Tools & Documentation
- perf: PMU counter collection
- pmu-tools: high-level counter analysis
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how ports limit throughput.
- I can design dependency-free loops.
- I can interpret PMU counters.
11.2 Implementation
- Port pressure map is produced.
- Results are stable across trials.
- PMU counters align with timing.
11.3 Growth
- I can explain a port bottleneck in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Measure throughput for at least 3 instruction classes.
- Produce a CSV report.
Full Completion:
- Include PMU counter validation.
Excellence (Going Above & Beyond):
- Build an automated port inference model.