Final Overall Project: The uArch-Aware JIT Engine

Build a tiny JIT compiler that generates different machine code paths based on detected microarchitectural features and benchmarks itself to choose the fastest.

Quick Reference

Attribute	Value
Difficulty	Level 5: Expert
Time Estimate	2-4 weeks
Main Programming Language	C (Alternatives: C++, Rust)
Alternative Programming Languages	C++, Rust
Coolness Level	Level 5: Pure Magic
Business Potential	2. The “Micro-SaaS / Pro Tool”
Prerequisites	C, assembly basics, memory protection, CPU feature flags
Key Topics	JIT codegen, executable memory, feature detection, auto-tuning

1. Learning Objectives

By completing this project, you will:

Implement a minimal JIT pipeline from AST to machine code.
Safely allocate and execute RWX or RW->RX memory.
Detect CPU features (SIMD width, core type) at runtime.
Benchmark multiple code paths and choose the fastest.
Build an adaptive runtime that reacts to microarchitecture differences.

2. All Theory Needed (Per-Concept Breakdown)

2.1 JIT Compilation Pipeline and Executable Memory

Fundamentals

A JIT (Just-In-Time) compiler generates machine code at runtime instead of ahead of time. The pipeline usually includes parsing, building an abstract syntax tree (AST), lowering to an intermediate representation (IR), and then emitting machine code into memory that is later executed. Because modern OSes enforce W^X (write xor execute), you must manage memory protections carefully, usually allocating RW memory, writing code, then switching to RX before execution. Understanding the JIT pipeline is essential for building a correct and safe runtime.

Additional fundamentals for JIT Compilation Pipeline and Executable Memory: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

A minimal JIT starts with a small language, often arithmetic expressions. Parsing turns input text into an AST. The AST is then lowered into a linear IR or directly to machine code. For a tiny math language, you can generate code that loads constants, performs arithmetic, and returns a value. This is enough to demonstrate the full pipeline.

Executable memory is the major systems challenge. On Linux, you can use mmap to allocate memory pages with read/write permissions, emit code into the buffer, then call mprotect to mark it executable. Some systems allow RWX pages, but this is unsafe and may be blocked by policy. Therefore, the safest pattern is RW -> RX. You must also ensure instruction cache coherence. On x86, instruction cache is coherent with data cache, so after writing code, you can execute it directly. On some architectures, you may need an explicit cache flush.

Machine code generation must respect calling conventions. If your JIT function is called from C, you must place the result in the correct return register and preserve callee-saved registers. You must also align the stack if you call out to helper functions. For a simple expression evaluator, you can avoid calls and keep everything in registers, simplifying the design.

Correctness requires deterministic code emission. You should define a fixed mapping from AST nodes to instruction sequences and include a verification mode that disassembles or prints emitted bytes. This helps debug errors, as JIT bugs can crash the process. You should also include a safe interpreter for comparison; if the JIT result differs, you know the codegen is wrong.

Security is a key concern. A JIT that writes executable memory can be abused if input is not controlled. In this project, you are running local expressions, so the risk is lower, but you should still follow safe practices: validate input, avoid RWX pages, and keep the JIT code buffer separate from data.

Additional deep dive considerations for JIT Compilation Pipeline and Executable Memory: In real designs, JIT Compilation Pipeline and Executable Memory is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target JIT Compilation Pipeline and Executable Memory via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

How this fits on projects

You will use this concept to build the codegen pipeline in §3.1 and to implement memory allocation in §5.10 Phase 1.

Definitions & key terms

JIT -> just-in-time compiler
AST -> abstract syntax tree
IR -> intermediate representation
W^X -> write xor execute memory protection policy

Mental model diagram (ASCII)

Source -> AST -> IR -> Machine Code -> Execute
             (emit into RW, then RX)

How it works (step-by-step, with invariants and failure modes)

Parse source into AST.
Lower AST into IR or direct codegen.
Allocate RW memory and emit bytes.
Switch memory to RX and execute.

Invariants:

Emitted code must follow calling convention.
Memory must not remain RWX.

Failure modes:

Incorrect codegen yields crashes or wrong results.
Missing mprotect results in SIGSEGV.

Minimal concrete example

void* buf = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
// emit machine code into buf...
mprotect(buf, 4096, PROT_READ|PROT_EXEC);
int (*fn)() = (int(*)())buf;
int r = fn();

Common misconceptions

“JIT is just code generation” -> it also requires parsing and runtime safety.
“RWX pages are fine” -> they are a security risk and often blocked.

Check-your-understanding questions

Why is W^X important for JITs?
What role does the calling convention play?
Why keep an interpreter for comparison?

Check-your-understanding answers

It prevents writable executable memory, reducing exploitability.
It ensures the caller and callee agree on register and stack usage.
It provides a correctness baseline for JIT output.

Real-world applications

JavaScript engines
Dynamic language runtimes

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 1.
Also used in: P01-the-human-pipeline-trace.md for pipeline insight.

References

“Writing a C Compiler” by Nora Sandler
“Engineering a Compiler” by Cooper and Torczon

Key insights

A JIT is a full compiler plus a runtime memory system; both must be correct.

Summary

The JIT pipeline converts code to machine instructions at runtime, requiring careful memory protection and calling convention compliance.

Homework/Exercises to practice the concept

Write a tiny parser that builds an AST for “1+2*3”.
Emit a hardcoded machine code function that returns 42.

Solutions to the homework/exercises

The AST should encode multiplication precedence.
The machine code should move 42 into the return register and ret.

2.2 CPU Feature Detection and Dispatch

Fundamentals

To generate optimal code, a JIT must detect available CPU features at runtime. This includes SIMD width (AVX2, AVX-512), cache line size, and possibly core type. Feature detection is done via CPUID on x86 or OS APIs. Once features are known, the JIT can select the best code path or emit different variants. This is called dispatch.

Additional fundamentals for CPU Feature Detection and Dispatch: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

CPUID is the standard mechanism for querying CPU capabilities on x86. By invoking CPUID with different leaf values, you can detect supported instruction sets (SSE, AVX, AVX2, AVX-512), cache information, and vendor identifiers. For vector instructions, you must also check OS support via XGETBV to ensure the OS has enabled the necessary register state. Otherwise, executing an AVX instruction can crash. This is a common pitfall.

Feature detection should be abstracted into a small module that exposes capabilities like has_avx2, has_avx512, and cache_line_size. You can also detect core topology via lscpu or sysfs, but for a portable project, you can keep core type detection optional. On hybrid CPUs, you may need OS-specific APIs to detect whether you are running on a P-core or E-core.

Dispatch can be static (choose one code path at startup) or dynamic (select per call). For a JIT, a typical approach is to generate multiple code variants and choose the fastest based on a short benchmark. For example, generate scalar, AVX2, and AVX-512 versions of an expression kernel, then measure each and select the fastest. If the CPU lacks AVX-512, skip that variant. This is exactly the behavior you will implement.

Dispatch logic must be correct and deterministic. You should store the selected variant in a function pointer and reuse it. For more advanced systems, you could re-run benchmarks if the core changes (e.g., if the OS migrates the thread to an E-core), but for this project, a single selection is fine as long as you pin the thread.

Additional deep dive considerations for CPU Feature Detection and Dispatch: In real designs, CPU Feature Detection and Dispatch is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target CPU Feature Detection and Dispatch via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

Supplemental note for CPU Feature Detection and Dispatch: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to CPU Feature Detection and Dispatch and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like CPU Feature Detection and Dispatch often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.

How this fits on projects

You will implement feature detection in §5.10 Phase 2 and use it in dispatch logic in §3.7.

Definitions & key terms

CPUID -> x86 instruction for querying CPU features
XGETBV -> instruction to check OS support for extended registers
dispatch -> selecting a code path based on features
ISA extension -> additional instruction set (AVX2, AVX-512)

Mental model diagram (ASCII)

CPUID -> features -> generate variants -> benchmark -> select best

How it works (step-by-step, with invariants and failure modes)

Query CPUID and XGETBV.
Build a feature set structure.
Generate code variants for supported ISAs.
Benchmark variants and select the fastest.

Invariants:

Do not execute unsupported instructions.
Ensure OS supports the register state.

Failure modes:

Missing XGETBV check causes illegal instruction.
Incorrect feature detection selects wrong path.

Minimal concrete example

if (has_avx512 && os_supports_zmm) emit_avx512();
else if (has_avx2) emit_avx2();
else emit_scalar();

Common misconceptions

“CPUID alone is enough” -> OS support matters too.
“Wider SIMD is always best” -> frequency and port limits can make it slower.

Check-your-understanding questions

Why do you need XGETBV in addition to CPUID?
What happens if you run AVX-512 on a core that doesn’t support it?
Why might AVX-512 be slower than AVX2 on some cores?

Check-your-understanding answers

CPUID reports hardware support, XGETBV reports OS enablement.
You get an illegal instruction exception.
It can reduce frequency or saturate bandwidth.

Real-world applications

JITs in JavaScript and JVMs
SIMD dispatch in math libraries

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 2.
Also used in: P10-lunar-lake-p-vs-e-core-profiler.md.

References

Intel CPUID documentation
“Agner Fog’s Optimization Manuals”

Key insights

Feature detection is a correctness requirement, not just an optimization.

Summary

A JIT must detect CPU and OS features before emitting specialized code. Dispatch then chooses the best variant safely.

Homework/Exercises to practice the concept

Write a small CPUID wrapper that detects AVX2.
Add an XGETBV check and confirm it gates AVX usage.

Solutions to the homework/exercises

The wrapper should parse leaf 7 for AVX2.
The check ensures OS has enabled YMM/ZMM state.

2.3 Auto-Tuning and Self-Benchmarking

Fundamentals

Auto-tuning is the process of generating multiple implementations and benchmarking them to select the fastest. In a JIT, this allows the runtime to adapt to the actual microarchitecture rather than relying on theoretical models. Self-benchmarking runs small, deterministic microbenchmarks at startup or on first use, then selects the code path with the lowest cycles per operation. This is the heart of a uArch-aware JIT.

Additional fundamentals for Auto-Tuning and Self-Benchmarking: focus on the simplest mental model and the most common unit of measurement. Identify what changes state, what observes that state, and which constraints are non-negotiable. This keeps the concept grounded before moving to deeper microarchitectural details.

Deep Dive into the concept

Auto-tuning is necessary because microarchitectural details are complex and often undocumented. A code path that is optimal on one CPU may be slower on another due to differences in port mappings, cache bandwidth, or branch prediction. By measuring directly, the runtime can choose the best variant. The key is to design benchmarks that are short, deterministic, and representative of the workload.

A good self-benchmark warms caches and predictors, runs a fixed number of iterations, and uses a stable timing method such as RDTSCP. The measurements should be repeated and a median or trimmed mean should be used to reduce noise. Because auto-tuning runs during program startup, it must be fast; you should limit the number of variants and iterations. For example, a single expression kernel might have three variants (scalar, AVX2, AVX-512). Benchmark each for 1e6 iterations and choose the fastest.

Auto-tuning also requires careful handling of cold-start effects. The first run may be slower because caches are cold. You should include a warmup phase and discard the first timing. Another issue is frequency scaling. If the CPU boosts temporarily, the first measurement may be misleading. Running multiple trials and choosing the median helps mitigate this.

Determinism is critical. If the benchmark is noisy, you might select a suboptimal variant. Use fixed seeds and fixed inputs. The benchmark should also be isolated to a pinned core to avoid migration across core types. For hybrid CPUs, you may want to detect core type and run separate tuning for each, but for this project, a single pin and selection is sufficient.

Finally, the runtime should store the selected variant and reuse it. You can also provide a debug mode that reports all variant timings, which helps users understand the decision. This is a valuable feature for performance engineers.

Additional deep dive considerations for Auto-Tuning and Self-Benchmarking: In real designs, Auto-Tuning and Self-Benchmarking is rarely isolated; it interacts with pipeline depth, power management, compiler decisions, and even microcode updates. When you study this behavior, vary one knob at a time and hold everything else constant: pin the core, fix the frequency if possible, warm up caches and predictors, and record the exact compiler flags. Vendor manuals describe typical behavior, but the actual thresholds can shift across steppings or microcode revisions, so empirical measurement is the ground truth. If your results disagree with published numbers, investigate confounders such as alignment, instruction form, address mapping, or hidden dependencies introduced by the compiler. From a software perspective, compilers and JITs implicitly target Auto-Tuning and Self-Benchmarking via instruction selection, scheduling, and unrolling, so your measurements should be translated into actionable rules of thumb. Finally, validate with at least two workloads: a synthetic microbenchmark and a slightly more realistic kernel. If both show the same trend, you can trust that the effect is not an artifact of the test harness.

Supplemental note for Auto-Tuning and Self-Benchmarking: A practical way to validate your mental model is to construct a tiny A/B experiment that changes only one variable related to Auto-Tuning and Self-Benchmarking and keeps all others fixed. Run it several times, record the median, and look for monotonic trends rather than a single magic number. If the trend is unstable, check for hidden dependencies, compiler reordering, or OS activity. Also consider how this concept influences API and library design: low-level details like Auto-Tuning and Self-Benchmarking often shape high-level performance guidelines such as alignment requirements, preferred loop forms, or safe fallback paths. By documenting these findings in your report, you turn raw measurements into reusable engineering rules.

How this fits on projects

You will implement the auto-tuner in §3.7 and integrate it into the JIT runtime in §5.10 Phase 3.

Definitions & key terms

auto-tuning -> selecting the fastest implementation by measurement
warmup -> initial run to stabilize caches/predictors
median timing -> robust statistic against noise
variant -> alternative code path

Mental model diagram (ASCII)

Variants: scalar | AVX2 | AVX-512
Benchmark each -> choose min cycles -> cache selection

How it works (step-by-step, with invariants and failure modes)

Generate code variants based on features.
Warm up each variant.
Measure cycles per iteration for each variant.
Select the fastest and store function pointer.

Invariants:

Use identical input for all variants.
Pin to a core for consistency.

Failure modes:

Noisy measurements select wrong variant.
Overhead of benchmarking dominates startup time.

Minimal concrete example

best = scalar;
if (has_avx2) bench(avx2);
if (has_avx512) bench(avx512);
select min cycles

Common misconceptions

“Auto-tuning always finds best” -> noise can mislead.
“Benchmarks can be huge” -> startup cost must be low.

Check-your-understanding questions

Why use median instead of mean for timing?
How does pinning improve auto-tuning accuracy?
What trade-off does auto-tuning introduce at startup?

Check-your-understanding answers

Median reduces outlier impact from interrupts.
It prevents migration and frequency differences.
It adds startup time in exchange for better steady-state performance.

Real-world applications

BLAS libraries with multiple kernels
JITs in JS engines and VMs

Where you’ll apply it

In this project: see §3.7 Real World Outcome and §5.10 Phase 3.
Also used in: P09-l1-bandwidth-stressor-zen-5-focus.md.

References

“ATLAS” auto-tuning methodology papers
“Agner Fog” benchmarking guidelines

Key insights

Auto-tuning turns microarchitectural uncertainty into measured reality.

Summary

Self-benchmarking is the core of a uArch-aware JIT. It adapts code generation to the actual CPU rather than assumptions.

Homework/Exercises to practice the concept

Benchmark two variants of a simple loop and choose the faster.
Add a warmup phase and compare results with and without it.

Solutions to the homework/exercises

The chosen variant should have consistently lower cycles.
Warmup reduces variance and stabilizes selection.

2.4 Auto-Tuning Harness and Code Layout Discipline

Fundamentals

A uArch-aware JIT is only as good as its auto-tuning harness. The harness decides which code path is fastest on a given CPU, and it must do so reliably. That means controlling noise, warming caches, and repeating measurements. It also means being disciplined about code layout: alignment, loop body size, and branch structure can change the front-end behavior of your generated code. If the harness compares a well-aligned AVX2 loop against a misaligned AVX-512 loop, the results are meaningless. Good auto-tuning treats code layout as a first-class parameter and produces deterministic, reproducible timing results.

Deep Dive into the concept

Auto-tuning is an experiment. You have hypotheses (code path A is faster than B) and you need evidence. The evidence is timing data, but timing on modern CPUs is noisy. To reduce noise, you should pin to a core, disable turbo if possible, and run multiple trials. Use a warm-up phase to fill instruction caches and branch predictors, then measure multiple iterations and take the median. The median is robust against occasional OS interrupts. If you report only a single timing, you will sometimes select the wrong code path due to noise.

Code layout is the other crucial element. JITs allocate memory dynamically, which means code alignment is not guaranteed. But alignment affects the front-end: a loop that fits in the uOp cache with good alignment may run 10-20 percent faster than the same loop misaligned across a decode boundary. Your JIT should therefore align hot loops and, if necessary, pad with NOPs to enforce boundaries. This is not premature optimization; it is required for fair auto-tuning. The harness should know the alignment of each candidate and, ideally, test a few alignment variants for each strategy.

When generating multiple code paths, keep the dispatch cost low. A classic approach is: detect CPU features once, build a table of function pointers (scalar, AVX2, AVX-512), then run the auto-tuning benchmark to choose the best. After selection, avoid re-running the benchmark unless the environment changes. If you re-tune too often, you will pay a runtime cost that could offset the gain. The JIT should cache the selected path and reuse it.

The harness must also be deterministic. Fix random seeds, use fixed input data, and avoid allocations during measurement. Record the CPU model, microcode version, and key environment variables (like frequency governor) because those factors can change the outcome. In performance engineering, reproducibility is as important as absolute speed. By logging these details, you make the tuning decision auditable.

Finally, remember that a “fast” code path is not always the best. If two paths are within 1-2 percent, you may choose the simpler or safer one. For example, AVX-512 might be slightly faster but could downclock the CPU or increase power. A responsible JIT can include policy: prefer energy efficiency, avoid AVX-512 on thermally constrained systems, or select a path based on a user flag. This moves your project from a toy benchmark to a realistic runtime system.

How this fits on projects

You will use this in §3.1 What You Will Build to define the tuning harness, in §5.10 Phase 2 to implement timing loops, and in §7.3 to debug unstable tuning results.

Definitions & key terms

auto-tuning -> benchmarking multiple code paths to select the fastest
warm-up -> initial iterations to stabilize caches and predictors
alignment discipline -> ensuring hot loops start at known boundaries
median-of-N -> robust statistic for noisy timing data
dispatch cost -> overhead of selecting a code path at runtime

Mental model diagram (ASCII)

Detect -> Generate A/B/C -> Warm up -> Benchmark -> Select -> Cache choice

How it works (step-by-step, with invariants and failure modes)

Detect CPU features and generate candidate code paths.
Align code buffers and record offsets.
Warm up each candidate to prime caches.
Run multiple timed trials and compute median.
Select fastest path, cache result, and execute.

Invariants:

Timing loop must be identical across candidates except for the code path.
Warm-up iterations are excluded from measurements.

Failure modes:

Noise causes wrong selection.
Alignment differences skew results.
Dispatch overhead outweighs gains for small workloads.

Minimal concrete example

// Pseudocode
bench(A); bench(B); bench(C);
best = min(median_time(A,B,C));

Common misconceptions

“One run is enough” -> timing noise makes single runs unreliable.
“Alignment does not matter for JIT” -> it can change front-end throughput.
“Fastest is always best” -> power and stability can matter more than 1 percent speed.

Check-your-understanding questions

Why is the median preferred over the mean for timing results?
How can alignment skew auto-tuning decisions?
When should you avoid re-tuning a code path?

Check-your-understanding answers

The median is robust to outliers caused by interrupts or noise.
Misaligned loops may look slower even if the algorithm is better.
When the workload is stable and the tuning cost is significant.

Real-world applications

JITs in JavaScript, JVM, and database engines
BLAS libraries that auto-tune kernels at install time
SIMD math libraries selecting best code paths

Where you’ll apply it

In this project: see §5.10 Phase 2 and §7.3 Performance Traps.
Also used in: P04-the-uop-cache-prober.md, P05-execution-port-pressure-map.md.

References

“Optimizing Software in C++” by Agner Fog, benchmarking chapters
“Systems Performance” by Brendan Gregg, measurement methodology

Key insights

Auto-tuning is only as good as the rigor of its measurements.

Summary

A uArch-aware JIT must tune itself using careful, reproducible timing. Alignment, warm-up, and robust statistics turn auto-tuning from guesswork into engineering.

Homework/Exercises to practice the concept

Run a microbenchmark 20 times and compare mean vs median selection.
Align a code buffer to 64 bytes and compare to a random alignment.

Solutions to the homework/exercises

The median produces more stable selections across runs.
The aligned buffer shows lower variance and often higher throughput.

3. Project Specification

3.1 What You Will Build

A minimal JIT compiler for a tiny math language (expressions with +, -, *, /). The JIT emits different machine code variants (scalar, AVX2, AVX-512), benchmarks them at runtime, and selects the fastest. The output includes generated code bytes and performance metrics.

3.2 Functional Requirements

Parser and AST: Parse expressions into an AST.
Codegen: Emit machine code for each variant.
Executable Memory: Allocate RW memory, emit code, switch to RX.
Feature Detection: Detect CPU ISA support and core type.
Auto-Tuner: Benchmark variants and select fastest.
Interpreter Baseline: Provide a safe reference evaluator.

3.3 Non-Functional Requirements

Performance: JIT variant must outperform interpreter on medium-size inputs.
Reliability: Deterministic results for fixed inputs and seed.
Usability: CLI to input expression and view report.

3.4 Example Usage / Output

$ ./uarch_jit --run expr.txt
[Detect]
CPU: x86-64, AVX-512 available
[Benchmark]
scalar: 820 cycles
AVX2:   510 cycles
AVX-512: 360 cycles (selected)
Result: 42

3.5 Data Formats / Schemas / Protocols

Input format:

(3 + 4) * (5 + 2)

Output JSON (optional):

{ "selected": "avx512", "cycles": {"scalar":820,"avx2":510,"avx512":360} }

3.6 Edge Cases

Expression too large for code buffer
Unsupported ISA on target CPU
Division by zero in expression

3.7 Real World Outcome

You will have a JIT that adapts to the CPU and outperforms a simple interpreter.

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -o uarch_jit src/uarch_jit.c
./uarch_jit --run samples/expr.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed input file and seed.
Expect the same selected variant and result across runs.

3.7.3 If CLI: Exact Terminal Transcript

$ ./uarch_jit --run samples/expr.txt
[Detect]
CPU: x86-64, AVX2 available
[Benchmark]
scalar: 820 cycles
AVX2:   510 cycles (selected)
Result: 42

$ echo $?
0

Failure demo (division by zero):

$ ./uarch_jit --run samples/div_zero.txt
Error: division by zero at node 4

$ echo $?
2

Exit codes:

0 success
2 runtime evaluation error
3 invalid input or unsupported ISA

4. Solution Architecture

4.1 High-Level Design

+----------+   +-------+   +---------+   +----------+
| Parser   |-> | Codegen|-> | Memory  |-> | Execute  |
+----------+   +-------+   +---------+   +----------+
        \-> [Interpreter baseline]

4.2 Key Components

Component	Responsibility	Key Decisions
Parser	Build AST	simple expression grammar
Codegen	Emit machine code	scalar + SIMD variants
Memory Manager	RW->RX transitions	W^X compliance
Auto-Tuner	Benchmark variants	median selection

4.3 Data Structures (No Full Code)

struct Node { enum Op op; struct Node* left; struct Node* right; double value; };

4.4 Algorithm Overview

Key Algorithm: JIT Execute

Parse input into AST.
Emit code variants into executable buffers.
Benchmark each variant and select fastest.
Execute selected code and return result.

Complexity Analysis:

Time: O(N) for parse + codegen, plus benchmark time
Space: O(N) for AST and code buffers

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

uarch-jit/
├── src/
│   ├── parser.c
│   ├── codegen.c
│   ├── features.c
│   ├── autotune.c
│   └── main.c
└── README.md

5.3 The Core Question You’re Answering

“How do real-world runtimes adapt to the hardware they run on?”

5.4 Concepts You Must Understand First

JIT codegen and executable memory
CPU feature detection
Auto-tuning and benchmarking

5.5 Questions to Guide Your Design

What is your minimal expression grammar?
How will you ensure safe RW->RX transitions?
How will you keep benchmarks deterministic?

5.6 Thinking Exercise

Design two code variants for a*b + c and predict which will be faster on a wide SIMD core.

5.7 The Interview Questions They’ll Ask

How does a JIT differ from an interpreter?
Why is W^X important?
How do you choose the fastest variant at runtime?

5.8 Hints in Layers

Hint 1: Start with scalar codegen and verify correctness with an interpreter.

Hint 2: Add AVX2 codegen and use CPUID checks.

Hint 3: Benchmark with a fixed input and median timing.

5.9 Books That Will Help

Topic	Book	Chapter
JIT basics	“Writing a C Compiler”	codegen chapters
Compiler design	“Engineering a Compiler”	IR and codegen

5.10 Implementation Phases

Phase 1: Foundation (4-6 days)

Implement parser and interpreter.
Checkpoint: interpreter returns correct results.

Phase 2: Core Functionality (6-10 days)

Add scalar codegen and executable memory.
Checkpoint: JIT result matches interpreter.

Phase 3: Adaptation (4-6 days)

Add feature detection and auto-tuning.
Checkpoint: JIT selects best variant deterministically.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Code buffer	mmap vs malloc	mmap	control permissions
Variant selection	fastest single run vs median	median	stability

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Parser correctness	operator precedence
Integration Tests	JIT vs interpreter	random expressions
Edge Tests	Division by zero	error path

6.2 Critical Test Cases

JIT and interpreter outputs must match.
Unsupported ISA must fall back to scalar.
Auto-tuner must pick deterministic variant.

6.3 Test Data

expr: (3 + 4) * (5 + 2)

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Bad calling convention	crashes	follow ABI and preserve regs
RWX pages	security risk	use RW->RX transitions
Noisy tuning	inconsistent selection	use median and pinning

7.2 Debugging Strategies

Print emitted bytes and disassemble.
Compare JIT output to interpreter for random inputs.

7.3 Performance Traps

Overly large code buffers reduce locality and increase I-cache pressure.

8. Extensions & Challenges

8.1 Beginner Extensions

Add subtraction and division operators.

8.2 Intermediate Extensions

Add vectorized constant folding.

8.3 Advanced Extensions

Add a mini register allocator for fewer spills.

9. Real-World Connections

9.1 Industry Applications

Dynamic language runtimes
JIT-based ML frameworks

LuaJIT: JIT compiler for Lua
asmjit: JIT assembler library

9.3 Interview Relevance

JITs and codegen are advanced systems topics.

10. Resources

10.1 Essential Reading

“Writing a C Compiler” by Nora Sandler
“Engineering a Compiler” by Cooper and Torczon

10.2 Video Resources

“JIT Compilation” lecture series

10.3 Tools & Documentation

objdump: inspect machine code
mmap/mprotect: executable memory

Prereq: P05-execution-port-pressure-map.md
Also: P10-lunar-lake-p-vs-e-core-profiler.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain the JIT pipeline end-to-end.
I can describe safe executable memory handling.
I can justify the auto-tuning selection.

11.2 Implementation

JIT output matches interpreter.
Auto-tuning chooses deterministic fastest path.
Feature detection is correct.

11.3 Growth

I can explain JIT trade-offs in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Parse expressions and execute scalar JIT code.
Match interpreter output.

Full Completion:

Add feature detection and auto-tuning across two variants.

Excellence (Going Above & Beyond):

Support AVX-512 and include performance report across variants.