Project 10: Lunar Lake P vs E Core Profiler

Build a profiler that compares performance characteristics of P-cores and E-cores on a heterogeneous CPU.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks
Main Programming Language	C++ (Alternatives: C, Rust)
Alternative Programming Languages	C, Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	2. The “Micro-SaaS / Pro Tool”
Prerequisites	C/C++, OS basics, perf tools, CPU affinity
Key Topics	heterogeneous cores, scheduling, frequency normalization, perf counters

1. Learning Objectives

By completing this project, you will:

Explain why heterogeneous cores exist and how they differ.
Pin benchmarks to specific core types and measure performance.
Normalize results for frequency and power states.
Compare IPC, branch, and cache metrics across cores.
Produce a structured report of P vs E core behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Heterogeneous Core Design and Scheduling

Fundamentals

Heterogeneous CPUs combine performance cores (P-cores) and efficiency cores (E-cores). P-cores are optimized for single-thread performance and high frequency; E-cores are optimized for efficiency and throughput per watt. The OS scheduler must decide which threads run on which cores. This affects performance dramatically. Understanding the design differences helps you interpret benchmark results and choose where to run workloads.

Additional fundamentals for Heterogeneous Core Design and Scheduling: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

P-cores typically have wider front-ends, larger reorder buffers, and more execution resources. They can execute more instructions per cycle and reach higher frequencies. E-cores, by contrast, are smaller and more energy-efficient, often with narrower pipelines and fewer execution units. They may not support the same ISA extensions (e.g., AVX-512) or may handle them differently. As a result, the same code can have different throughput, latency, and power characteristics on P vs E cores.

The OS scheduler tries to place threads based on priority, load, and energy policies. Modern operating systems may expose APIs to request certain core types, but the scheduler still has the final say. To measure differences accurately, you must pin your process to a specific core using sched_setaffinity and confirm the core type via lscpu or sysfs. Without pinning, your benchmark can migrate, mixing results.

Heterogeneous systems also have differences in cache hierarchy. P and E cores might share some cache levels but not others. This can influence memory latency and bandwidth. Additionally, frequency scaling can differ between core types; P-cores often have higher turbo frequencies, while E-cores may run lower but more efficiently. Therefore, comparisons must normalize for frequency or use metrics like cycles per operation instead of raw time.

In your profiler, you will run the same benchmark loop on a P-core and an E-core, collecting metrics like cycles, instructions retired, branch misses, and cache misses. The relative differences give insight into microarchitectural strengths. For example, P-cores may show higher IPC but also higher power usage. E-cores may have lower IPC but better efficiency for background tasks.

Additional deep dive considerations for Heterogeneous Core Design and Scheduling: In real designs, Heterogeneous Core Design and Scheduling is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Heterogeneous Core Design and Scheduling via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

Supplemental note for Heterogeneous Core Design and Scheduling: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to Heterogeneous Core Design and Scheduling and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like Heterogeneous Core Design and Scheduling often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.

How this fits on projects

You will apply this concept to design the core selection and pinning logic in §3.2 and to interpret results in §3.7.

Definitions & key terms

P-core -> performance-focused core
E-core -> efficiency-focused core
affinity -> binding a thread to a specific CPU core
heterogeneity -> mixing cores with different capabilities

Mental model diagram (ASCII)

[Scheduler] -> P-core (wide, fast)
           -> E-core (narrow, efficient)

How it works (step-by-step, with invariants and failure modes)

Identify P and E core IDs.
Pin benchmark to a core type.
Run identical workload and collect metrics.
Compare normalized results.

Invariants:

Benchmarks must run on a single core type.
Workload must be identical across runs.

Failure modes:

Thread migration mixes results.
OS power policy changes frequency mid-run.

Minimal concrete example

cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(core_id, &set);
sched_setaffinity(0, sizeof(set), &set);

Common misconceptions

“P-cores are always better” -> E-cores can be better for throughput per watt.
“Scheduler placement doesn’t matter” -> core type can change results drastically.

Check-your-understanding questions

Why do CPUs use heterogeneous cores?
Why is pinning necessary for profiling?
How might cache hierarchy differ between core types?

Check-your-understanding answers

To balance performance and energy efficiency.
To avoid thread migration and mixed results.
P and E cores may have different private caches or shared levels.

Real-world applications

Laptop power management
Server consolidation of background tasks

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 1.
Also used in: P11-the-uarch-aware-jit-engine.md.

References

“Operating Systems: Three Easy Pieces” scheduling chapters
Intel hybrid architecture whitepapers

Key insights

Core heterogeneity shifts performance trade-offs; measurement is the only way to know.

Summary

Heterogeneous cores provide flexibility. Your profiler reveals how they differ on real workloads.

Homework/Exercises to practice the concept

Identify P and E core IDs on your machine.
Pin a simple loop to each and compare cycles per iteration.

Solutions to the homework/exercises

Use lscpu and sysfs to map core IDs.
The P-core should show higher IPC and lower cycles per iteration.

2.2 Frequency Normalization and Performance Counters

Fundamentals

Raw timing comparisons between cores can be misleading because cores may run at different frequencies. Frequency normalization converts time into cycles or uses fixed-frequency counters to compare fairly. Performance counters such as cycles, instructions retired, and branch misses provide normalized metrics like IPC. By comparing IPC, you can see microarchitectural efficiency independent of frequency.

Additional fundamentals for Frequency Normalization and Performance Counters: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

Frequency scaling is dynamic. A P-core might turbo to a high frequency for a short time, while an E-core might remain at a lower steady frequency. If you measure wall-clock time, the P-core will appear faster partly due to frequency. To isolate microarchitectural efficiency, you measure cycles and instructions, then compute IPC (instructions per cycle). IPC tells you how effectively the core executes instructions, independent of frequency. However, IPC alone does not capture all differences because instruction mix and vector width can vary.

You can normalize results by using a fixed-frequency performance counter if available, or by reading the invariant TSC and computing cycles per operation. Another method is to lock CPU frequency via OS controls, but this is not always possible. When comparing P vs E cores, it is often best to report both cycles per iteration and wall-clock time, along with the observed frequency. This gives a complete picture: raw speed and efficiency.

Performance counters can also reveal why IPC differs. For example, if the E-core shows lower IPC, it might be due to fewer execution ports or smaller caches leading to more misses. Counters like branch-misses, cache-misses, and uops_retired provide additional signals. Using perf stat, you can capture these counters with minimal overhead. You should keep the number of counters small to avoid multiplexing, and you should repeat measurements to reduce noise.

Normalization also matters for power and energy analysis. E-cores may complete tasks slower but use less energy, which can be better for background workloads. While this project focuses on performance, you can optionally record power metrics if available (e.g., RAPL). This can extend the analysis into energy efficiency.

Additional deep dive considerations for Frequency Normalization and Performance Counters: In real designs, Frequency Normalization and Performance Counters is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Frequency Normalization and Performance Counters via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

Supplemental note for Frequency Normalization and Performance Counters: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to Frequency Normalization and Performance Counters and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like Frequency Normalization and Performance Counters often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.

How this fits on projects

You will apply this concept when computing IPC in §3.7 and when presenting normalized comparisons in §5.10 Phase 3.

Definitions & key terms

IPC -> instructions per cycle
invariant TSC -> timestamp counter that increments at constant rate
frequency normalization -> adjusting results to compare across frequencies
multiplexing -> counter sharing that reduces accuracy

Mental model diagram (ASCII)

cycles + instructions -> IPC
wall time + frequency -> normalized time

How it works (step-by-step, with invariants and failure modes)

Collect cycles and instructions for each core type.
Compute IPC and cycles per iteration.
Record wall-clock time and frequency.
Compare results with normalization.

Invariants:

Use the same workload and iteration count.
Avoid counter multiplexing.

Failure modes:

Frequency changes mid-run distort results.
Counter events unavailable on some cores.

Minimal concrete example

perf stat -e cycles,instructions,branch-misses ./core_bench

Common misconceptions

“Wall time is enough” -> frequency differences skew results.
“IPC is the only metric” -> cache and branch behavior matter too.

Check-your-understanding questions

Why is IPC a better comparison metric than raw time?
What does counter multiplexing do?
Why might IPC be similar but wall time differ?

Check-your-understanding answers

IPC normalizes for frequency.
It reduces counter accuracy by time-sharing.
Different frequencies can produce different wall times despite similar IPC.

Real-world applications

Scheduler policies for hybrid CPUs
Performance tuning across core types

Where you’ll apply it

In this project: see §3.7 Real World Outcome and §6.2 Critical Test Cases.
Also used in: P05-execution-port-pressure-map.md.

References

Linux perf documentation
“Operating Systems: Three Easy Pieces”

Key insights

Normalization turns noisy timing into comparable microarchitectural metrics.

Summary

To compare P and E cores fairly, normalize by cycles and IPC. Counters provide the evidence.

Homework/Exercises to practice the concept

Measure cycles and instructions for a simple loop on both core types.
Compute IPC and compare with wall-clock time.

Solutions to the homework/exercises

P-core usually shows higher IPC; E-core shows lower cycles per watt.
Wall time differences may be larger than IPC differences due to frequency.

2.3 Topology Discovery, Affinity, and Normalization

Fundamentals

A hybrid-core profiler is only meaningful if you can reliably identify core types and run workloads on the intended cores. That requires topology discovery (mapping logical CPUs to physical cores and clusters) and thread affinity (pinning). Without pinning, the OS may migrate your thread across core types, blending measurements. Normalization is the other half: P-cores and E-cores often run at different frequencies and have different turbo behavior. You need to report both raw IPC and normalized metrics such as cycles per operation or IPC per GHz. Otherwise, you may attribute frequency differences to microarchitecture differences.

Deep Dive into the concept

Topology discovery is platform-specific. On Linux, you can inspect /sys/devices/system/cpu and use lscpu -e to list cores, sockets, and (on recent kernels) core types. On Intel hybrid systems, CPUID leaf 0x1A provides core type information, but you must read it per logical CPU. On Windows, the GetLogicalProcessorInformationEx API exposes efficiency classes. On macOS, sysctl and performance counters can hint at core type, but explicit P/E labeling is limited. A robust profiler should support a manual mapping file so users can label cores if the OS does not expose it.

Affinity ensures measurement stability. On Linux, sched_setaffinity or pthread_setaffinity_np pins a thread to a specific CPU. You should also consider SMT: a logical CPU may share execution resources with a sibling thread. For clean measurements, run on a single physical core with its sibling idle. This can be enforced by pinning your benchmark thread and optionally placing a low-priority “spinner” on the sibling to keep it from being scheduled by the OS.

Normalization requires measuring actual frequency during the workload. Turbo and thermal throttling can change frequency mid-run. You can read the APERF/MPERF counters on x86 to estimate effective frequency. On Linux, perf can report cycles and instructions, but cycles alone do not reveal frequency changes if the clock changes. Therefore, report cycles per iteration and, if possible, average frequency to compute IPC per GHz. This gives you a fair comparison across core types.

Workload selection also interacts with normalization. A memory-bound workload will show similar IPC across core types because memory stalls dominate; a compute-bound workload will highlight width and port differences. To make the comparison meaningful, include at least one workload in each category and report them separately. Otherwise, you might draw the wrong conclusion about which core is “better.”

Finally, energy measurement is increasingly important. If your platform exposes RAPL or equivalent energy counters, you can report energy per instruction. This can show that E-cores are more efficient even when slower. The point of hybrid cores is to trade performance for efficiency; your profiler should capture both dimensions.

How this fits on projects

You will use this in §3.2 Functional Requirements to define core selection, in §5.10 Phase 1 for affinity setup, and in §6.2 to validate stable measurements.

Definitions & key terms

topology discovery -> identifying the mapping of logical CPUs to physical cores and types
affinity -> pinning a thread to a specific CPU
normalization -> adjusting metrics for frequency and other confounders
efficiency class -> OS label for core performance tier
APERF/MPERF -> counters used to estimate effective frequency

Mental model diagram (ASCII)

Logical CPU -> Physical Core -> Core Type (P/E) -> Frequency
           \-> Affinity pinning -> Stable measurement

How it works (step-by-step, with invariants and failure modes)

Enumerate logical CPUs and detect core types if possible.
Pin the benchmark thread to a chosen CPU.
Measure cycles, instructions, and frequency.
Compute IPC, cycles/op, and IPC/GHz.
Repeat for each core type and compare.

Invariants:

The workload must run on the intended core for the whole run.
Frequency must be measured or controlled.

Failure modes:

Thread migration mixes P and E results.
Turbo shifts frequency mid-run and skews IPC.

Minimal concrete example

// Linux: pin to CPU 4
cpu_set_t set; CPU_ZERO(&set); CPU_SET(4, &set);
sched_setaffinity(0, sizeof(set), &set);

Common misconceptions

“IPC is enough” -> without frequency normalization, IPC comparisons can mislead.
“OS labels are always correct” -> core type reporting can be missing or wrong.
“One workload is representative” -> compute and memory workloads behave differently.

Check-your-understanding questions

Why does thread migration invalidate hybrid-core measurements?
How do APERF/MPERF help normalize IPC?
Why include both compute-bound and memory-bound workloads?

Check-your-understanding answers

Because you no longer know which core type produced the metrics.
They let you estimate effective frequency during the run.
They reveal different bottlenecks and prevent overgeneralization.

Real-world applications

OS scheduler tuning for hybrid-core systems
Performance/efficiency profiling for laptop workloads
Benchmarking mobile or edge CPUs with heterogeneous cores

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 1.
Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md, P11-the-uarch-aware-jit-engine.md.

References

Intel Software Developer’s Manual, CPUID leaf 0x1A (hybrid core info)
“Operating Systems: Three Easy Pieces” by Arpaci-Dusseau, scheduling chapters

Key insights

Without affinity and normalization, hybrid-core comparisons are not trustworthy.

Summary

Hybrid-core profiling is as much about measurement hygiene as it is about microarchitecture. Detect core types, pin threads, and normalize for frequency to produce meaningful results.

Homework/Exercises to practice the concept

Pin a workload to two different logical CPUs and confirm no migration occurs.
Compute IPC and IPC/GHz for the same workload and compare conclusions.

Solutions to the homework/exercises

The run should report consistent CPU affinity and stable timing.
IPC/GHz often reduces the apparent gap between P and E cores.

3. Project Specification

3.1 What You Will Build

A profiling tool that runs a set of microbenchmarks on P-cores and E-cores, collects timing and PMU counters, and produces a normalized comparison report. The report includes IPC, branch misses, cache misses, and frequency information.

3.2 Functional Requirements

Core Detection: Identify P and E core IDs.
Affinity Pinning: Pin each benchmark run to a chosen core.
Benchmark Suite: Run a standard set of loops (ALU, branch, memory).
Counter Collection: Collect cycles, instructions, branch misses.
Report Generation: Output a comparison table.

3.3 Non-Functional Requirements

Performance: Complete full suite in under 5 seconds.
Reliability: Results stable across 3 runs.
Usability: CLI flags for core selection and output format.

3.4 Example Usage / Output

$ ./core_profiler --p-core 2 --e-core 8
metric,P-core,E-core
IPC,3.2,2.1
branch_misses,0.1%,0.3%
L1_miss_rate,2%,4%

3.5 Data Formats / Schemas / Protocols

CSV output:

metric,P-core,E-core
IPC,3.2,2.1

3.6 Edge Cases

Core ID does not exist
E-cores disabled in BIOS
Counters unavailable in user mode

3.7 Real World Outcome

You will produce a P vs E core comparison report that normalizes for frequency and highlights microarchitectural differences.

3.7.1 How to Run (Copy/Paste)

c++ -O2 -Wall -o core_profiler src/core_profiler.cpp
sudo ./core_profiler --p-core 2 --e-core 8 --trials 3

3.7.2 Golden Path Demo (Deterministic)

Use fixed core IDs and a fixed iteration count.
Expect consistent IPC differences across runs.

3.7.3 If CLI: Exact Terminal Transcript

$ sudo ./core_profiler --p-core 2 --e-core 8
metric,P-core,E-core
IPC,3.2,2.1
branch_misses,0.1%,0.3%

$ echo $?
0

Failure demo (bad core id):

$ ./core_profiler --p-core 99 --e-core 8
Error: core id 99 not available

$ echo $?
3

Exit codes:

0 success
2 perf init error
3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+---------------+   +------------------+   +-----------------+
| Core Selector |-> | Benchmark Runner |-> | Report Builder  |
+---------------+   +------------------+   +-----------------+

4.2 Key Components

Component	Responsibility	Key Decisions
Core Selector	Identify P/E cores	use sysfs or lscpu
Runner	Execute benchmarks	pin each run
Reporter	Normalize metrics	IPC-based comparison

4.3 Data Structures (No Full Code)

struct Metric { const char* name; double p; double e; };

4.4 Algorithm Overview

Key Algorithm: Core Comparison

Detect core IDs.
Run benchmark suite on P-core and collect metrics.
Run benchmark suite on E-core and collect metrics.
Normalize and report differences.

Complexity Analysis:

Time: O(B * T) where B is benchmarks, T trials
Space: O(B)

5. Implementation Guide

5.1 Development Environment Setup

c++ --version
perf --version

5.2 Project Structure

core-profiler/
├── src/
│   ├── core_profiler.cpp
│   ├── benchmarks.cpp
│   └── perf_wrap.cpp
└── README.md

5.3 The Core Question You’re Answering

“How do P-cores and E-cores differ on real microbenchmarks?”

5.4 Concepts You Must Understand First

Heterogeneous core design
Affinity pinning
IPC and normalization

5.5 Questions to Guide Your Design

How will you detect core types reliably?
How will you normalize results across frequencies?
Which metrics best highlight differences?

5.6 Thinking Exercise

If your E-core shows lower IPC but similar wall time on a memory-bound test, what does that imply?

5.7 The Interview Questions They’ll Ask

Why do CPUs use P and E cores?
How do you pin threads to specific cores?
Why is IPC a useful metric?

5.8 Hints in Layers

Hint 1: Use /sys/devices/system/cpu to map core IDs.

Hint 2: Use perf to collect cycles and instructions.

Hint 3: Report both IPC and wall time.

5.9 Books That Will Help

Topic	Book	Chapter
Scheduling	“Operating Systems: Three Easy Pieces”	scheduling
Performance counters	“Computer Architecture”	Ch. 3

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Detect core types and implement pinning.
Checkpoint: benchmark runs pinned to selected core.

Phase 2: Core Functionality (4-6 days)

Implement benchmark suite and counter collection.
Checkpoint: IPC metrics generated.

Phase 3: Analysis (2-3 days)

Normalize and report results.
Checkpoint: comparison table produced.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Core detection	lscpu vs sysfs	sysfs	machine-readable
Metrics	time only vs IPC	IPC + time	complete picture

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Core detection	parse sysfs
Integration Tests	End-to-end run	P-core vs E-core
Edge Tests	invalid core id	error path

6.2 Critical Test Cases

Pinning should keep the thread on the selected core.
IPC should differ between core types.
Results should be stable across trials.

6.3 Test Data

metrics: IPC, branch_misses

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Thread migration	inconsistent results	strict affinity
Frequency scaling	unstable IPC	set governor
Counter multiplexing	noisy stats	limit counters

7.2 Debugging Strategies

Log the CPU core ID at runtime.
Use perf stat to validate counters.

7.3 Performance Traps

Running in background with other tasks can skew results.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a JSON output format.

8.2 Intermediate Extensions

Include memory bandwidth tests.

8.3 Advanced Extensions

Add power/energy measurements if supported.

9. Real-World Connections

9.1 Industry Applications

OS scheduling policy tuning
Laptop performance profiling

schedtool: CPU affinity utilities
perf: counter collection

9.3 Interview Relevance

Heterogeneous scheduling is a modern systems topic.

10. Resources

10.1 Essential Reading

“Operating Systems: Three Easy Pieces”
Intel hybrid architecture docs

10.2 Video Resources

“Hybrid CPU Scheduling” lecture

10.3 Tools & Documentation

lscpu: core topology
perf: counters

Also: P09-l1-bandwidth-stressor-zen-5-focus.md
Final: P11-the-uarch-aware-jit-engine.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain P vs E core trade-offs.
I can normalize results by frequency.
I can interpret IPC differences.

11.2 Implementation

Results are stable across trials.
Report includes normalized metrics.
Core mapping is documented.

11.3 Growth

I can discuss heterogeneity in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

Run at least 2 benchmarks on P and E cores with normalized IPC.

Full Completion:

Include branch and cache metrics in report.

Excellence (Going Above & Beyond):

Add energy metrics and discuss efficiency trade-offs.