Project 5: Execution Port Pressure Map

Build a microbenchmark suite that maps instruction throughput to backend execution ports.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Main Programming Language	C++ (with inline assembly) (Alternatives: Rust, Assembly)
Alternative Programming Languages	Rust, Assembly
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	2. The “Micro-SaaS / Pro Tool”
Prerequisites	Basic assembly, CPU pipeline basics, perf tooling
Key Topics	execution ports, throughput vs latency, port contention, PMU counters

1. Learning Objectives

By completing this project, you will:

Explain how execution ports and functional units limit throughput.
Design instruction mixes that stress specific ports.
Use PMU counters to validate port pressure hypotheses.
Build a port pressure heat map for your CPU.
Interpret results to guide performance tuning.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Execution Ports, Functional Units, and Throughput

Fundamentals

Execution ports are groups of functional units that execute micro-operations. Each cycle, the scheduler issues ready uOps to available ports. The number and type of ports determine the maximum throughput for different instruction classes. For example, a CPU might have two load ports, one store port, and multiple ALU ports. If your loop uses instructions bound to a single port, it will bottleneck even if everything else is idle. Throughput describes how many uOps can be completed per cycle, and it is distinct from latency. Understanding the port map explains why some instruction sequences are faster than others.

Additional fundamentals for Execution Ports, Functional Units, and Throughput: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

A modern out-of-order core has a scheduler that holds uOps and dispatches them to ports. Each port connects to one or more execution units (ALU, FPU, load/store, branch). An instruction type maps to one or more ports. For example, integer add might be able to execute on ports 0 and 1, while integer multiply might only execute on port 1. If an instruction can use multiple ports, it is flexible; if it can only use one, it is a port bottleneck. When you run a tight loop of a single instruction, the steady-state throughput is limited by the number of ports that instruction can use and the port’s issue capacity.

Port pressure refers to how heavily a loop stresses each port. A balanced loop spreads uOps across ports and reaches higher throughput. An unbalanced loop saturates one port and leaves others idle. This is why optimizing throughput often means mixing instructions or reordering operations to use different ports. The classic example is combining integer adds and multiplies to utilize separate units.

Port usage is not always obvious from the ISA because complex instructions can decode into multiple uOps. Some instructions also have micro-fused uOps that can execute together. Therefore, empirical measurement is valuable. By generating a loop with one instruction type and measuring cycles per iteration, you can infer the throughput. If you also measure the number of uOps retired and cycles, you can estimate uOps per cycle. By combining instruction mixes (e.g., two types in the same loop), you can test if they contend for the same port or use different ports. If the combined throughput equals the maximum of the individual throughputs, they likely contend. If it approaches the sum, they use different ports.

The scheduler and reorder buffer allow multiple in-flight uOps, but for a tight, dependency-free loop, the bottleneck is the port throughput. You should structure your loop to avoid dependency chains so that the scheduler can issue uOps freely. This usually means using multiple independent registers and unrolling the loop. If you inadvertently create dependencies, latency will dominate and your measurements will not reflect port pressure. Therefore, constructing independent instruction sequences is part of the experimental design.

Execution port models are often summarized in tables (e.g., from uops.info). But CPUs can differ by generation. Your project builds a local map for your CPU, which is valuable because it captures the actual hardware and microcode. This is also the foundation for performance analysis tools like Intel’s IACA or LLVM-MCA.

Additional deep dive considerations for Execution Ports, Functional Units, and Throughput: In real designs, Execution Ports, Functional Units, and Throughput is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Execution Ports, Functional Units, and Throughput via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use this concept to design independent instruction sequences in §3.2 and to interpret throughput limits in §3.7.

Definitions & key terms

execution port -> pipeline endpoint with specific functional units
throughput -> uOps per cycle sustainable in steady state
latency -> cycles from issue to result
port contention -> multiple uOps competing for the same port

Mental model diagram (ASCII)

Scheduler -> Port 0 (ALU) -> Exec
         -> Port 1 (ALU) -> Exec
         -> Port 2 (Load) -> L1

How it works (step-by-step, with invariants and failure modes)

Decode loop into uOps and place in scheduler.
Each cycle, issue ready uOps to free ports.
If a port is saturated, uOps wait.
Throughput equals issue rate of the busiest port.

Invariants:

A uOp can only execute on its allowed port set.
Issue rate per port is limited per cycle.

Failure modes:

Dependency chains hide port pressure by introducing latency stalls.
Front-end bottlenecks mask backend port limits.

Minimal concrete example

; Independent adds to avoid dependencies
add r8, r9
add r10, r11
add r12, r13

Common misconceptions

“Latency and throughput are the same” -> throughput is steady-state rate.
“More ports always means faster” -> only if the instruction can use them.

Check-your-understanding questions

Why must you avoid dependencies when measuring port pressure?
What does it mean if two instructions run faster together than apart?
How can you infer port binding from timing?

Check-your-understanding answers

Dependencies introduce latency stalls that hide port limits.
They likely use different ports and do not contend.
Compare throughput of isolated vs mixed instruction loops.

Real-world applications

Compiler instruction scheduling
Performance tuning for HPC kernels

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 2.
Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md.

References

“Agner Fog’s Optimization Manuals”
uops.info (port mapping data)

Key insights

Port pressure is the hidden limiter once dependencies are removed.

Summary

Execution ports define the backend throughput ceiling. Your benchmark reveals which ports dominate for specific instruction mixes.

Homework/Exercises to practice the concept

Predict whether a loop of only integer multiplies will be port-limited.
Design a loop that uses both ALU and load ports.

Solutions to the homework/exercises

Yes, multiplies typically map to fewer ports and will bottleneck.
Combine add instructions with independent loads from aligned arrays.

2.2 PMU Counters and Port-Binding Measurement

Fundamentals

Performance Monitoring Units (PMUs) provide hardware counters for events like cycles, instructions retired, and port usage. By reading these counters around a benchmark, you can validate hypotheses about port pressure. Counters are not perfect and can be noisy, but they are invaluable for confirmation. Tools like perf and pmu-tools simplify access. The core idea is to use counters as a second measurement channel alongside timing.

Additional fundamentals for PMU Counters and Port-Binding Measurement: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

PMU counters count microarchitectural events. For port pressure, some CPUs provide events such as uops_executed.port_0 or uops_retired.slots. These events can indicate how many uOps executed on each port. However, counter availability and meaning vary by CPU model. Some counters are derived rather than direct, and their accuracy can be limited. Therefore, you should use them for validation rather than as the sole measurement.

The typical workflow is: run a microbenchmark, collect counters, compute derived metrics (uOps per cycle, port distribution), and compare to expected values. For example, if a loop of independent adds yields 4 uOps per cycle and counters show high activity on ports 0 and 1, that matches expectations. If counters show high activity on a different port, your assumption about port binding may be wrong. Similarly, if timing shows slower throughput than expected, counters can indicate whether the backend is saturated or the front-end is the bottleneck.

Counter measurement requires careful setup. You should pin the process to a core, disable frequency scaling, and run for enough iterations to avoid sampling error. You should also avoid multiplexing counters by measuring only a few at a time. perf stat can multiplex if too many events are requested, which reduces accuracy. For this project, choose a small set of counters: cycles, instructions, and a couple of port-specific events if available.

Interpreting counters also requires understanding that some events count uOps, not instructions. If your instruction decodes into multiple uOps, the uOp count will exceed the instruction count. Therefore, you should report both. The ratio of uOps to instructions is itself a useful metric because it indicates how heavy the instruction mix is. Additionally, some events include speculative uOps that were later squashed; this matters if your loop includes branches or mispredictions. For steady-state loops with predictable branches, the speculative overhead should be minimal.

Finally, be aware of measurement overhead. Reading counters can perturb timing slightly. If your benchmark runs for many iterations, this overhead is amortized, but you should still avoid reading counters inside the tight loop. Instead, start counters, run the loop, stop counters. Record the loop size and iteration count so results are comparable across runs.

Additional deep dive considerations for PMU Counters and Port-Binding Measurement: In real designs, PMU Counters and Port-Binding Measurement is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target PMU Counters and Port-Binding Measurement via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use counters to validate your port pressure map in §3.7 and to debug anomalies in §7.1.

Definitions & key terms

PMU -> Performance Monitoring Unit
event -> specific hardware counter (e.g., cycles)
multiplexing -> sharing counters over time when too many are requested
uOps_retired -> count of micro-operations retired

Mental model diagram (ASCII)

[Benchmark] -> [PMU counters] -> [Derived metrics] -> [Port map]

How it works (step-by-step, with invariants and failure modes)

Select a small set of counters.
Start counters, run loop, stop counters.
Compute uOps/cycle and port distribution.
Compare with timing-based throughput.

Invariants:

Counters must be read outside the hot loop.
Compare like-for-like configurations.

Failure modes:

Counter multiplexing skews results.
CPU model mismatch causes invalid counter semantics.

Minimal concrete example

perf stat -e cycles,instructions,uops_retired.slots ./port_pressure

Common misconceptions

“Counters are always exact” -> they can be sampled or derived.
“More counters is better” -> multiplexing reduces accuracy.

Check-your-understanding questions

Why is multiplexing bad for microbenchmarks?
What does uOps_retired measure compared to instructions?
Why measure both timing and counters?

Check-your-understanding answers

It reduces counter precision and adds sampling error.
It counts decoded micro-operations, which can be more than instructions.
Timing shows performance; counters explain why.

Real-world applications

Performance analysis in compiler teams
CPU verification and microarchitecture tuning

Where you’ll apply it

In this project: see §3.7 Real World Outcome and §6.2 Critical Test Cases.
Also used in: P04-the-uop-cache-prober.md.

References

Intel and AMD PMU documentation
“pmu-tools” by Andi Kleen

Key insights

PMU counters are evidence, not truth; use them to validate timing-based conclusions.

Summary

Counters let you see port usage directly. Combined with timing, they produce a reliable port pressure map.

Homework/Exercises to practice the concept

Measure cycles and instructions for a NOP loop and compute IPC.
Compare uOps_retired for a mix of adds and multiplies.

Solutions to the homework/exercises

IPC should be high; cycles per instruction should be low.
The mix should show more uOps if multiplies expand.

2.3 Dependency Chains and Port Isolation Techniques

Fundamentals

Execution ports are shared resources. To map port pressure, you need to isolate which ports an instruction uses and ensure no other bottleneck dominates. The most reliable technique is to construct dependency chains that force an instruction to execute serially, then remove dependencies to expose throughput limits. A dependency chain creates a single stream of uOps that must execute one after another, revealing latency and port choice. Independent chains let you fill all available ports and measure throughput. This difference between latency-limited and throughput-limited regimes is the heart of port mapping. If you do not control dependencies, you will misinterpret front-end stalls or cache effects as port pressure.

Deep Dive into the concept

An instruction can often execute on multiple ports. For example, an ADD might use port 0 or port 1 on a given microarchitecture. If you create a dependency chain (e.g., add rax, 1; add rax, 1; …), each instruction depends on the previous result, so the chain exposes the instruction’s latency and the port chosen for that dependency. If the chain runs at one per cycle, you have evidence that the instruction’s latency is 1 cycle and that the chosen port can issue each cycle. If the chain runs at one every two cycles, you have found a port or pipeline limitation.

To measure throughput, you break dependencies. Use multiple registers with independent adds, or use a vector instruction that has no dependencies between iterations. Now the scheduler can issue multiple uOps per cycle, limited by available ports. If you see, for example, 2 adds per cycle, you have evidence of two suitable ports. The port map emerges by combining multiple tests: a dependent chain gives you latency and a hint of preferred ports; independent chains reveal the maximum issue rate.

Port isolation is about reducing confounders. You must ensure the front-end is not the bottleneck by keeping the loop short and in the uOp cache. You must avoid memory operations unless your goal is to map load/store ports, because cache misses will dominate. You must also avoid mixing instructions that compete for the same ports unless you are explicitly testing contention. A clean experiment runs only one instruction class at a time, then uses mixed workloads to confirm port sharing. For example, a loop of only MUL can reveal the throughput of the multiply pipeline; a mixed loop of ADD and MUL can reveal whether they contend for the same port or use different ports.

Another advanced technique is to use “port pressure signatures” from performance counters. On some Intel CPUs, you can read uops_executed.port or similar events to see how many uOps issued on each port. These counters are not always precise, but they provide an independent check on your microbenchmark inferences. Combine this with static tools like llvm-mca or uops.info to build a triangulated port map.

Be aware of scheduling artifacts. The scheduler may distribute uOps across ports to balance load, so your measured throughput may look better than a naive single-port model. However, certain instructions have fixed port usage. By testing known fixed-port instructions, you can calibrate your environment and then infer ports for more flexible ones. The end goal is not a perfect map but a practical understanding: which instruction mixes saturate which ports, and how that affects throughput.

How this fits on projects

You will use this in §3.2 to define the microbenchmark loops, in §5.10 Phase 2 to construct dependent and independent chains, and in §6.2 to define tests for contention.

Definitions & key terms

dependency chain -> a sequence where each instruction depends on the previous result
throughput -> maximum number of operations per cycle in steady state
latency -> cycles from issue to result availability
port map -> mapping of instructions to execution ports
contention -> multiple uOps competing for the same port

Mental model diagram (ASCII)

Chain: RAX -> ADD -> RAX -> ADD -> RAX (latency)
Independent: R1/R2/R3 -> ADDs -> ports 0+1 (throughput)

How it works (step-by-step, with invariants and failure modes)

Build a single-register dependency chain and measure cycles per op.
Build multiple independent chains and measure ops per cycle.
Compare results to infer port availability and sharing.
Validate with mixed-instruction loops.
Cross-check with perf counters or llvm-mca.

Invariants:

Dependency chains must enforce true RAW dependencies.
Independent chains must not share registers.

Failure modes:

Front-end limits cap throughput and hide port limits.
Unintended memory ops create cache bottlenecks.

Minimal concrete example

; Latency chain
add rax, 1
add rax, 1
add rax, 1

; Throughput test
add rax, 1
add rbx, 1
add rcx, 1
add rdx, 1

Common misconceptions

“One benchmark is enough” -> you need both latency and throughput tests.
“Ports are fixed for all instructions” -> many instructions are flexible.
“Throughput equals latency” -> only if there is a single port and no pipelining.

Check-your-understanding questions

Why do dependency chains reveal latency rather than throughput?
How can you tell if two instructions contend for the same port?
What front-end effect can make a throughput test misleading?

Check-your-understanding answers

Dependencies serialize execution, so throughput collapses to latency.
Mixed loops show reduced throughput compared to separate loops.
Decode or uOp cache limits can cap issue rate regardless of port capacity.

Real-world applications

Hand-optimizing hot loops in compilers or JITs
Understanding why vector code stalls despite high clock rates
Performance engineering for low-latency systems

Where you’ll apply it

In this project: see §5.4 Concepts You Must Understand First and §6.2 Critical Test Cases.
Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md, P11-the-uarch-aware-jit-engine.md.

References

“uops.info” instruction tables (public database)
“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3

Key insights

Port mapping is an experimental science built on controlled dependencies.

Summary

Dependency chains isolate latency; independent chains reveal throughput. Together, they let you map execution ports and explain why certain instruction mixes saturate the core.

Homework/Exercises to practice the concept

Build a latency chain for MUL and measure cycles per op.
Build an independent chain and compute maximum ops per cycle.

Solutions to the homework/exercises

The chain exposes MUL latency, often several cycles on modern cores.
The independent chain should approach the documented throughput (for example, one per cycle).

3. Project Specification

3.1 What You Will Build

A benchmark suite that generates independent instruction loops for different instruction classes (integer add, multiply, loads, stores, SIMD) and measures throughput. The output is a port pressure heat map showing cycles per instruction and inferred port bindings.

3.2 Functional Requirements

Instruction Loop Generator: Create loops for each instruction class.
Dependency Avoidance: Use multiple registers and unrolling to remove dependencies.
Timing Harness: Measure cycles per iteration and compute throughput.
Counter Integration: Optionally collect PMU counters per loop.
Report Generator: Produce a map of instruction class -> throughput.

3.3 Non-Functional Requirements

Performance: Each loop runs under 0.5 seconds.
Reliability: Stable results across trials.
Usability: CLI to select instruction classes and output formats.

3.4 Example Usage / Output

$ ./port_pressure --ops add,mul,load
op,cycles_per_iter,uops_per_cycle
add,0.25,4.0
mul,1.00,1.0
load,0.50,2.0

3.5 Data Formats / Schemas / Protocols

CSV output:

op,cycles_per_iter,uops_per_cycle
add,0.25,4.0
mul,1.00,1.0

3.6 Edge Cases

Instruction sequences accidentally dependent
Front-end bottleneck due to too many bytes
PMU events not available on target CPU

3.7 Real World Outcome

You will produce a port pressure report that helps decide which instruction mixes are optimal for throughput.

3.7.1 How to Run (Copy/Paste)

c++ -O2 -Wall -o port_pressure src/main.cpp
sudo taskset -c 2 ./port_pressure --ops add,mul,load --trials 5

3.7.2 Golden Path Demo (Deterministic)

Use --ops add and expect high throughput (close to 4 uOps/cycle on many cores).

3.7.3 If CLI: Exact Terminal Transcript

$ taskset -c 2 ./port_pressure --ops add,mul
add,0.25,4.0
mul,1.00,1.0

$ echo $?
0

Failure demo (unknown op):

$ ./port_pressure --ops foo
Error: unknown op 'foo'

$ echo $?
3

Exit codes:

0 success
2 PMU init error
3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+----------------+   +------------------+   +-----------------+
| Loop Generator |-> | Timing + Counters|-> | Report Builder  |
+----------------+   +------------------+   +-----------------+

4.2 Key Components

Component	Responsibility	Key Decisions
Loop Generator	Emit independent ops	Use unrolled templates
Timing Harness	Measure cycles	RDTSCP + fences
Counter Module	Collect PMU events	Small set of events

4.3 Data Structures (No Full Code)

struct Result { const char* op; double cycles; double uops_per_cycle; };

4.4 Algorithm Overview

Key Algorithm: Throughput Measurement

Warm up loop to steady state.
Time N iterations of unrolled loop.
Compute cycles per instruction and uOps per cycle.

Complexity Analysis:

Time: O(N) per op class
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

c++ --version
perf --version

5.2 Project Structure

port-pressure/
├── src/
│   ├── main.cpp
│   ├── loops.S
│   └── timing.cpp
└── README.md

5.3 The Core Question You’re Answering

“Which execution port is my bottleneck for this instruction mix?”

The answer shows how to design faster loops.

5.4 Concepts You Must Understand First

Port throughput vs latency
Dependency chains and unrolling
PMU counters and measurement

5.5 Questions to Guide Your Design

How will you ensure no dependencies?
How will you count uOps per iteration?
Which PMU counters are reliable on your CPU?

5.6 Thinking Exercise

Design a loop that mixes integer adds and loads. Predict whether throughput improves compared to pure adds.

5.7 The Interview Questions They’ll Ask

What is the difference between latency and throughput?
How do you detect port contention?
Why are PMU counters useful but imperfect?

5.8 Hints in Layers

Hint 1: Use multiple registers to avoid dependencies.

Hint 2: Unroll by 8 or 16 to reduce branch overhead.

Hint 3: Measure with and without PMU counters to validate.

5.9 Books That Will Help

Topic	Book	Chapter
Execution units	“Inside the Machine”	Ch. 4
Performance analysis	“Computer Architecture”	Ch. 3

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Implement loop templates for add and mul.
Checkpoint: loops compile and run.

Phase 2: Core Functionality (4-6 days)

Add timing harness and compute throughput.
Checkpoint: stable cycles per iter.

Phase 3: Validation (2-3 days)

Integrate PMU counters and compare.
Checkpoint: port pressure map generated.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Loop structure	inline asm vs intrinsics	inline asm	precise control
Reporting	CSV vs table	CSV	easy plotting

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Loop correctness	no dependencies
Integration Tests	End-to-end timing	add/mul loops
Edge Tests	Missing PMU event	fallback mode

6.2 Critical Test Cases

Pure add loop should hit maximum throughput.
Add+mul mix should show different port usage.
Missing PMU events should not crash.

6.3 Test Data

ops: add, mul, load

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Dependency chain	low throughput	use more registers
Front-end bottleneck	cycles too high	reduce instruction bytes
Counter mismatch	weird results	verify CPU model

7.2 Debugging Strategies

Compare with uops.info expected throughput.
Use perf to verify instructions retired.

7.3 Performance Traps

Excessive unrolling can overflow the uOp cache and skew results.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a visualization script for CSV output.

8.2 Intermediate Extensions

Include SIMD instruction classes.

8.3 Advanced Extensions

Auto-detect port bindings via linear programming.

9. Real-World Connections

9.1 Industry Applications

HPC kernel tuning
Compiler backend scheduling

uops.info: public port data
llvm-mca: throughput modeling

9.3 Interview Relevance

Port pressure and throughput are frequent topics in systems interviews.

10. Resources

10.1 Essential Reading

“Agner Fog’s Optimization Manuals”
“Inside the Machine” by Jon Stokes

10.2 Video Resources

“CPU Backend and Ports” lecture

10.3 Tools & Documentation

perf: PMU counter collection
pmu-tools: high-level counter analysis

Also: P09-l1-bandwidth-stressor-zen-5-focus.md
Earlier: P04-the-uop-cache-prober.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain how ports limit throughput.
I can design dependency-free loops.
I can interpret PMU counters.

11.2 Implementation

Port pressure map is produced.
Results are stable across trials.
PMU counters align with timing.

11.3 Growth

I can explain a port bottleneck in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Measure throughput for at least 3 instruction classes.
Produce a CSV report.

Full Completion:

Include PMU counter validation.

Excellence (Going Above & Beyond):

Build an automated port inference model.