Project 1: The Human Pipeline Trace

Build a cycle-accurate, human-readable pipeline timeline that makes hazards, stalls, and forwarding visible.

Quick Reference

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Assembly, Rust)
Alternative Programming Languages	Assembly, Rust
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Prerequisites	C basics, reading simple assembly, basic CPU pipeline ideas
Key Topics	5-stage pipeline, hazards, forwarding, bubble insertion, dependency analysis

1. Learning Objectives

By completing this project, you will:

Explain how a classic 5-stage pipeline schedules instructions across IF/ID/EX/MEM/WB.
Implement RAW hazard detection and forwarding for a small instruction subset.
Insert bubbles deterministically and explain why each stall exists.
Compute IPC, total cycles, and stall breakdown from a trace.
Validate your simulator with hand-traced examples and unit tests.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Classic 5-Stage Pipeline and Hazards

Fundamentals

A classic 5-stage pipeline splits instruction processing into five stages: Instruction Fetch (IF), Instruction Decode/Register Read (ID), Execute (EX), Memory Access (MEM), and Write Back (WB). The pipeline overlaps these stages for consecutive instructions, ideally completing one instruction per cycle. Hazards arise when this ideal overlap conflicts with real data or resource constraints. Data hazards happen when one instruction needs a value that another has not written yet. Control hazards occur when the next instruction depends on a branch outcome. Structural hazards occur when two stages need the same hardware in the same cycle. The pipeline is a scheduling system, so understanding it means tracking when values become available, not just which instruction comes next.

Deep Dive into the concept

The five stages are a teaching abstraction, but they encode real timing rules. IF pulls bytes from instruction memory and updates the program counter. ID decodes the opcode and reads the register file, but the value read here is not necessarily the final value that should be used if a previous instruction writes the same register later in the pipeline. EX performs ALU operations and computes effective addresses for memory ops. MEM accesses the data cache or memory, and WB commits results to the register file. The key is that write-back is late in the pipeline, so a following instruction might read a stale value in ID if you do not bypass. A RAW (read-after-write) hazard is the most common; it means instruction i+1 needs a value produced by instruction i that is still in EX or MEM. A WAR hazard (write-after-read) does not occur in a simple in-order pipeline because writes happen after reads and are ordered. A WAW hazard does not occur because in-order pipelines write in order, but it matters in out-of-order systems.

Forwarding (also called bypassing) is the mechanism that short-circuits this delay. Instead of waiting for the register file to be updated in WB, the pipeline can take the result from EX or MEM and feed it directly into the EX stage of a dependent instruction. This requires a small network that compares destination registers from in-flight instructions with source registers of younger instructions. The network controls multiplexers that choose between register file outputs and forwarded values. A load-use hazard is a special case: the data from a load is not available until after MEM, so an instruction that uses that data in the next cycle cannot be satisfied by forwarding from EX. The pipeline must insert a bubble (a stall) to delay the dependent instruction by one or more cycles. The stall itself is implemented by holding IF and ID steady, inserting a NOP into EX, and letting the older instructions continue. This is a scheduling decision, not a functional change, but it has a direct performance cost.

Control hazards are handled with simple schemes in the 5-stage model: assume not-taken, then flush if wrong; or stall until the branch resolves. Branches typically resolve in EX, so a taken branch means the instruction fetched after it is wrong and must be squashed. In a pedagogical simulator, you can model a branch as creating a one or two-cycle bubble and updating the PC accordingly. The key is deterministic timing: the same instruction stream should always produce the same timeline given the same hazard rules.

When you build a timeline simulator, you are effectively running a scoreboard. Each instruction occupies a stage per cycle, and constraints must be checked: each stage can hold at most one instruction, data dependencies must be respected, and bubbles must be inserted when a dependency cannot be satisfied by forwarding. This is a constrained scheduling problem. You must model when each instruction produces its result (EX for ALU, MEM for loads) and when a consumer needs it (EX). This timing is the foundation of hazard detection. If you mis-model even one cycle, your simulator will either under-stall (incorrectly optimistic) or over-stall (pessimistic). The discipline here is to define a small set of rules and apply them consistently, then confirm the output with hand-traced examples.

How this fits on projects

You will use this concept to define the stage occupancy table in §3.1 and to implement the hazard rules that drive bubble insertion in §5.10 Phase 2. It also informs your validation tests in §6.2.

Definitions & key terms

pipeline stage -> a time slice where a specific part of instruction processing occurs
RAW hazard -> a read after write dependency that can cause a stall
forwarding/bypassing -> taking a result from an internal pipeline stage instead of the register file
bubble -> an inserted NOP that occupies a stage to delay a younger instruction
control hazard -> a dependency on branch outcome that may require a flush or stall

Mental model diagram (ASCII)

Cycle ->   1   2   3   4   5   6
I1:        IF  ID  EX  MEM WB
I2:            IF  ID  EX  MEM WB
I3:                IF  ID  EX  MEM WB
Stall:                 --  --

How it works (step-by-step, with invariants and failure modes)

Parse instruction stream into a list of ops with source and destination registers.
For each cycle, attempt to advance each instruction one stage forward.
Check for RAW hazards: if a consumer needs a value not yet produced, decide if forwarding can satisfy it.
If forwarding cannot satisfy, insert a bubble and hold younger stages.
For branches, resolve at EX and flush IF/ID if taken.

Invariants:

At most one instruction per stage per cycle.
Results are only available after their producing stage completes.
The timeline must be deterministic for a fixed input and hazard policy.

Failure modes:

Under-stalling produces impossible timelines.
Over-stalling hides achievable IPC.
Wrong branch handling creates negative or duplicated instructions.

Minimal concrete example

// Simple dependent sequence: R1 = R2 + R3; R4 = R1 + R5
ADD R1, R2, R3
ADD R4, R1, R5
// Expect a forwarding path from EX/MEM of I1 to EX of I2

Common misconceptions

“All hazards are fixed by forwarding” -> Load-use hazards still need a stall.
“Branches always stall” -> Simple prediction can avoid full stalls.
“ID reads are final” -> Forwarded values can override ID reads.

Check-your-understanding questions

Why does a load-use hazard typically require a bubble in a 5-stage pipeline?
At which stage is an ALU result available for forwarding?
What changes in the timeline when a branch is taken and resolved in EX?

Check-your-understanding answers

The load data is not ready until MEM completes, so the dependent EX stage is too early; a bubble delays it.
The ALU result is produced at the end of EX and can be forwarded to the next cycle’s EX.
The instruction fetched after the branch is squashed, creating a bubble and a PC redirect.

Real-world applications

Pipeline design and verification in CPU microarchitecture teams
Compiler scheduling to avoid hazards
Performance analysis for tight loops on embedded cores

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §5.10 Phase 2.
Also used in: P04-the-uop-cache-prober.md, P08-macro-op-fusion-detector.md.

References

“Computer Organization and Design” by Patterson and Hennessy, Ch. 4
“Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron, Ch. 4

Key insights

Pipeline performance is a schedule governed by data availability, not just instruction order.

Summary

The 5-stage pipeline is a constrained scheduling system. Hazards are timing conflicts, and forwarding is a precise fix that only works when data becomes available early enough.

Homework/Exercises to practice the concept

Hand-trace a 6-instruction sequence with one load-use hazard and one taken branch.
Identify which hazards could be resolved with forwarding and which require stalls.

Solutions to the homework/exercises

The load-use pair requires one bubble; the taken branch squashes the next fetched instruction.
ALU-to-ALU RAW hazards can be forwarded; load-use hazards need a stall.

2.2 Dependency Analysis and Forwarding Logic

Fundamentals

Dependency analysis is the act of determining which registers or memory locations each instruction reads and writes. In a pipeline simulator, this is the information that drives hazard detection. Forwarding logic then uses this dependency data to decide if a consumer can use a producer’s value early. For an in-order 5-stage pipeline, the forwarding paths are limited: EX/MEM and MEM/WB can forward to EX, and sometimes MEM/WB can forward to MEM for stores. You do not need to simulate actual wires, but you must implement the same decision rules. The core idea is that dependencies are static, but availability is dynamic and depends on the stage.

Deep Dive into the concept

A practical dependency analyzer begins with instruction decoding. You parse each instruction into an opcode, destination register (if any), and source registers. For a simplified ISA, you can assume that arithmetic ops read two registers and write one. Loads read a base register and write a destination; stores read a base register and a value register but write no register. Branches read registers for comparison and write no register. Once you have this metadata, you can treat the pipeline as a timeline of register versions. A RAW hazard exists if an instruction reads register R that is written by an older instruction that has not yet produced its result. The timing of that production depends on instruction type: ALU results are ready after EX, loads after MEM, and stores typically do not produce a register value. This gives you a simple model: each instruction has a “ready stage” for each destination.

Forwarding logic is a set of priority rules. If two older instructions write the same register (which can happen if you have a sequence like ADD R1, …, then SUB R1, …), the younger consumer should receive the most recent older value, not the older one. In an in-order pipeline, that usually means the instruction immediately before the consumer if it writes the same register. Your simulator can implement this by scanning backward from the consumer to find the nearest writer and then determining whether its value is ready by the time the consumer reaches EX. If the value is ready in EX/MEM, you forward from that stage. If it is only ready in MEM/WB and the consumer is already in EX, you must stall. For loads, even MEM/WB might be one cycle too late for the immediate next instruction, which is why load-use hazards create a bubble even with forwarding.

Forwarding is also a performance optimization. Without forwarding, every RAW dependency would force a stall until WB. That would collapse IPC for most code. Forwarding allows the pipeline to approach ideal throughput for arithmetic sequences. Your simulator should be able to toggle forwarding on and off to show the difference in timeline and IPC. This is a powerful teaching tool because it makes the microarchitectural benefit visible.

Implementing forwarding correctly requires consistent timing. A common error is to allow a value to be forwarded in the same cycle it is produced, which is not always valid. For example, if an ALU result is produced at the end of EX in cycle N, the consumer in EX during cycle N cannot see it; it can see it in cycle N+1. This is a subtle but critical detail. When you write your scheduler, explicitly model when values become usable. The easiest way is to track a “ready cycle” for each register value and compare it to the consumer’s EX cycle. If ready_cycle <= ex_cycle, the value can be used without stalling; otherwise insert a bubble.

Memory dependencies are trickier and usually beyond the scope of a beginner pipeline simulator. For this project, you can treat memory as serialized or ignore aliasing. If you include simple load/store hazards, define clear rules. For example: a load that follows a store to the same address stalls until the store reaches MEM. That can be a stretch goal. The goal is a consistent, deterministic model that can be hand-validated.

How this fits on projects

This concept drives your dependency parsing in §5.2 and your hazard rules in §3.2. You will also use it in §6.2 to design test cases that target specific hazards.

Definitions & key terms

dependency -> a relationship where one instruction needs data produced by another
ready cycle -> the earliest cycle when a value can be consumed
forwarding priority -> rule that chooses which producer to forward from when multiple match
load-use hazard -> dependency on a load whose data is not ready for the next instruction

Mental model diagram (ASCII)

Producer I1: EX (result) -> MEM -> WB
Consumer I2:       EX (needs result)
Forwarding path: EX/MEM -> EX

How it works (step-by-step, with invariants and failure modes)

Decode each instruction into sources and destinations.
For each instruction, compute the earliest stage when its destination is ready.
For each consumer, find the nearest older writer of each source register.
If the writer’s ready stage occurs after the consumer’s EX stage, stall.
If ready, mark the source as forwarded and continue.

Invariants:

A consumer must see the latest older writer, not an older one.
No value can be consumed before it is produced.

Failure modes:

Forwarding from the wrong producer yields incorrect results.
Allowing same-cycle forwarding hides necessary stalls.

Minimal concrete example

// EX-ready ALU result forwarded next cycle
ADD R1, R2, R3
ADD R4, R1, R5  // should be 1-cycle apart with forwarding

Common misconceptions

“Forwarding eliminates all stalls” -> Load-use still stalls.
“Scanning all previous instructions is enough” -> You must pick the nearest older writer.

Check-your-understanding questions

Why must you pick the nearest older writer when multiple instructions write the same register?
How do you model the ready cycle for a load instruction?
What happens if you allow forwarding in the same cycle a value is produced?

Check-your-understanding answers

Because the nearest older writer is the value that would architecturally be seen.
The value is ready after MEM completes, usually one cycle later than an ALU result.
You allow impossible schedules that undercount stalls and inflate IPC.

Real-world applications

Hardware hazard detection units in CPUs
Compiler scheduling and instruction reordering
Pipeline verification test benches

Where you’ll apply it

In this project: see §5.4 Concepts You Must Understand First and §5.10 Phase 2.
Also used in: P07-the-reorder-buffer-rob-boundary-finder.md.

References

“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 3
“Digital Design and Computer Architecture” by Harris and Harris, Ch. 7

Key insights

Forwarding is a precise timing rule, not a guess; it must obey ready cycles.

Summary

Dependency analysis turns assembly into dataflow, and forwarding logic schedules that dataflow through the pipeline while honoring timing constraints.

Homework/Exercises to practice the concept

Create a sequence with two writes to the same register and a later read; identify the correct producer.
Add a load-use pair and decide how many bubbles are needed.

Solutions to the homework/exercises

The read should use the most recent older write, not the earliest.
A single bubble is required in a 5-stage pipeline with MEM-stage load data.

2.3 Pipeline Timing Diagrams and IPC Accounting

Fundamentals

A pipeline timing diagram is a concrete schedule, not an illustration. It answers two questions: when does each instruction occupy each stage, and how much useful work did you actually retire per cycle. IPC (instructions per cycle) is the headline number, but it is meaningless unless you also classify why cycles were lost. In a simple single-issue pipeline, an ideal loop gets IPC ~1 once the pipeline is full. Any RAW hazard, control hazard, or structural conflict inserts bubbles that reduce IPC. A good simulator does not just draw the chart; it computes a cycle-by-cycle accounting that attributes each bubble to a specific reason. This accounting is what lets you test hypotheses like “forwarding removed ALU stalls but load-use still costs one cycle” or “branch penalties dominate in short loops.” In other words, IPC is a summary; the timing diagram is the evidence.

Deep Dive into the concept

A correct timing diagram must capture three phases: the prologue (fill), steady state (overlap), and epilogue (drain). The prologue is where the pipeline is filling and IPC is temporarily below the steady-state average because the first instruction spends cycles alone. The epilogue is similar; once the last instruction is fetched, the remaining stages still execute without new fetches. If you do not model these phases, you will miscompute total cycles and will report an IPC that is too optimistic for short kernels. The standard formula for total cycles is: cycles = N + (pipeline_depth - 1) + bubbles, where N is the number of instructions and bubbles is the sum of all inserted stalls. That formula assumes a single-issue pipeline, and it is a great sanity check for your simulator output.

Beyond totals, the diagram should preserve causality. A RAW hazard should manifest as a stall that delays the dependent instruction’s EX stage until the producer result is available. That implies the following invariant: the consumer’s EX cycle must be greater than or equal to the producer’s ready cycle. The ready cycle depends on instruction type; for a load it is after MEM, for an ALU op it is after EX. When you apply forwarding, you are effectively reducing the ready cycle by one stage. The diagram is the proof that you applied that timing rule consistently. If you see a dependent instruction advancing into EX before the ready cycle, your simulator is wrong regardless of IPC.

Stall attribution is the hidden power of a timing diagram. You can group bubbles into categories: load-use stalls, branch penalty stalls, structural stalls (for example, if you choose to model a single memory port), and “other.” The category is not just for reporting; it guides what you optimize. If your pipeline spends 40 percent of cycles stalled on load-use hazards, you can add forwarding from MEM or reorder independent instructions. If stalls are dominated by branches, then branch prediction or delay slots (in other architectures) are the lever. This is the same logic used in modern Top-Down Microarchitecture Analysis (TMAM), just applied to a small teaching model.

Timing diagrams also force you to be explicit about resource constraints. If you decide to model a single MEM stage that can only service one instruction per cycle, then two back-to-back memory operations create a structural hazard. Your diagram should show that only one of those instructions occupies MEM each cycle, and the other stalls in EX. This is not merely a detail; it demonstrates how functional unit availability shapes throughput even in an in-order pipeline. In later projects you will simulate front-end and back-end limits with perf counters, but the mental model starts here.

Finally, a timing diagram is a debugging tool. You can create “golden” diagrams for a few hand-traced instruction sequences and compare your simulator output line by line. A single off-by-one error in the EX-to-MEM transition will create a cascade of incorrect bubbles. By writing a diagram generator that emits both a table and a short stall report, you make those errors obvious. A strong approach is to write a verifier that recomputes the constraints for each cycle and flags any violations (two instructions in the same stage, or a consumer using a value before it is ready). That verifier becomes your test oracle for all later changes.

How this fits on projects

You will use this in §3.2 to define the output format, in §5.10 Phase 2 to compute IPC and stall breakdowns, and in §6.2 to create test cases with known bubble counts.

Definitions & key terms

timing diagram -> a cycle-by-cycle schedule that shows stage occupancy
IPC -> instructions per cycle, a throughput metric
bubble -> an inserted idle slot in a pipeline stage
prologue/epilogue -> fill and drain phases around steady state
stall attribution -> classification of lost cycles by cause

Mental model diagram (ASCII)

Cycle:   1   2   3   4   5   6   7
I1:      IF  ID  EX  MEM WB
I2:          IF  ID  --  EX  MEM WB
I3:              IF  ID  EX  MEM WB
Bubbles:             ^ load-use stall

How it works (step-by-step, with invariants and failure modes)

Compute an earliest-stage schedule assuming no hazards.
For each cycle, verify data readiness for each instruction entering EX.
If not ready, insert a bubble and shift younger instructions right.
After scheduling, compute totals: cycles, IPC, and stall categories.
Emit a diagram plus a stall report.

Invariants:

Only one instruction per stage per cycle.
A consumer’s EX cycle >= producer ready cycle.
Total cycles = N + (pipeline_depth - 1) + bubbles.

Failure modes:

Off-by-one timing produces impossible forwardings.
Ignoring prologue/epilogue inflates IPC for short loops.
Double-occupancy of a stage breaks the model.

Minimal concrete example

I1: LOAD R1, [R2]
I2: ADD  R3, R1, R4

Expected: I2 stalls one cycle because R1 is ready after MEM.

Common misconceptions

“IPC is enough to validate the simulator” -> IPC can match even with wrong timing.
“Bubbles are just NOPs” -> Bubbles are schedule delays with a specific cause.
“The first instruction completes in 5 cycles” -> Only if no stalls and the pipeline depth is 5.

Check-your-understanding questions

Why does a short loop often have lower IPC than a long loop with the same body?
How do you prove that a given schedule is valid for a load-use pair?
Why must you account for prologue and epilogue cycles when reporting IPC?

Check-your-understanding answers

The pipeline fill and drain overhead is a larger fraction of total cycles.
Show that the consumer’s EX cycle is at or after the load’s MEM completion.
Because IPC is total instructions divided by total cycles, and those phases add cycles.

Real-world applications

Teaching pipelines in computer architecture courses
Verifying simple CPU simulators and FPGA soft cores
Explaining why small kernels underperform on real hardware

Where you’ll apply it

In this project: see §3.4 Example Usage and §5.10 Phase 3.
Also used in: P05-execution-port-pressure-map.md, P07-the-reorder-buffer-rob-boundary-finder.md.

References

“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson, Ch. 2-3
“Structured Computer Organization” by Tanenbaum, Ch. 4

Key insights

A timing diagram is a correctness proof, not just a picture.

Summary

IPC without timing is a guess. A cycle-accurate diagram forces you to model when results are truly available and makes every stall accountable, which is the foundation of trustworthy performance reasoning.

Homework/Exercises to practice the concept

Build a 6-instruction sequence with two independent ALU ops and one load-use pair, then compute total cycles by hand.
Draw the timing diagram and compute IPC for both with and without forwarding.

Solutions to the homework/exercises

The load-use pair adds one bubble; total cycles = 6 + 4 + 1 = 11 for a 5-stage pipeline.
With forwarding, only the load-use bubble remains; without forwarding, each RAW hazard adds stalls.

3. Project Specification

3.1 What You Will Build

A CLI tool named pipe_trace that reads a small assembly file, simulates a 5-stage in-order pipeline, and prints a cycle-by-cycle timeline. It highlights stalls, bubbles, and forwarding decisions, and outputs summary statistics (total cycles, IPC, hazard counts). The tool supports a small instruction subset (ADD, SUB, MUL, LOAD, STORE, BRANCH, NOP) and a consistent timing model. It is intentionally not a full CPU simulator; no real memory is emulated beyond dependency timing.

3.2 Functional Requirements

Instruction Parsing: Parse a simplified assembly format into an internal instruction list with opcode, sources, and destination.
Pipeline Scheduling: Simulate IF/ID/EX/MEM/WB stages and advance each instruction per cycle.
Hazard Detection: Detect RAW hazards and insert bubbles when forwarding is disabled or insufficient.
Forwarding Support: When enabled, allow EX/MEM and MEM/WB forwarding to satisfy dependencies.
Control Hazards: Model branch resolution in EX and flush misfetched instructions.
Reporting: Output a timeline table and a summary of hazards and IPC.

3.3 Non-Functional Requirements

Performance: Handle at least 5,000 instructions in under 1 second on a laptop.
Reliability: Deterministic output for the same input and options.
Usability: Clear error messages for unknown instructions or malformed lines.

3.4 Example Usage / Output

$ ./pipe_trace loops.s --forwarding=on --format=table

Cycle | IF | ID | EX | MEM | WB | Notes
------+----+----+----+-----+----+---------------------------
1     | I1 |    |    |     |    | Fetch ADD
2     | I2 | I1 |    |     |    | Decode ADD
3     | I3 | I2 | I1 |     |    | EX for I1
4     | I4 | I3 | I2 | I1  |    | Forward R1 to I2
5     | I5 | I4 | I3 | I2  | I1 | No stall

[Stats]
Instructions: 5
Cycles: 9
IPC: 0.56
RAW hazards: 1 (forwarded)
Stalls: 0

3.5 Data Formats / Schemas / Protocols

Input assembly format (one instruction per line):

ADD R1, R2, R3
LOAD R4, 0(R1)
STORE R4, 8(R1)
BRANCH R4, R0, label

Internal instruction shape:

struct Instr {
  opcode: enum {ADD,SUB,MUL,LOAD,STORE,BRANCH,NOP}
  src1: Reg
  src2: Reg
  dst: Reg|None
  imm: int
}

3.6 Edge Cases

Unknown opcode or malformed register name
Load-use dependency immediately after load
Back-to-back branches
NOPs and empty lines
A branch target that does not exist

3.7 Real World Outcome

You will have a tool that makes pipeline timing visible. You will be able to point to a specific stall in the timeline and explain the dependency that caused it.

3.7.1 How to Run (Copy/Paste)

# Build
cc -O2 -Wall -o pipe_trace src/pipe_trace.c

# Run on a sample file
./pipe_trace samples/loop.s --forwarding=on --format=table

3.7.2 Golden Path Demo (Deterministic)

Use the provided samples/loop.s and run with --forwarding=on.
Expect exactly 9 cycles for 5 instructions and zero stalls.

3.7.3 If CLI: Exact Terminal Transcript

$ ./pipe_trace samples/loop.s --forwarding=on --format=table

[Pipeline Trace]
Cycle | IF | ID | EX | MEM | WB | Notes
------+----+----+----+-----+----+---------------------------
1     | I1 |    |    |     |    | Fetch ADD
2     | I2 | I1 |    |     |    | Decode ADD
3     | I3 | I2 | I1 |     |    | EX for I1
4     | I4 | I3 | I2 | I1  |    | Forward R1 to I2
5     | I5 | I4 | I3 | I2  | I1 | No stall

[Stats]
Instructions: 5
Cycles: 9
IPC: 0.56
RAW hazards: 1 (forwarded)
Stalls: 0

$ echo $?
0

Failure demo (bad opcode):

$ ./pipe_trace samples/bad.s
Error: unknown opcode 'ADDD' on line 3

$ echo $?
2

Exit codes:

0 success
2 parse error
3 invalid argument

4. Solution Architecture

4.1 High-Level Design

+-----------------+      +-------------------+      +------------------+
| Assembly Parser | ---> | Pipeline Scheduler| ---> | Trace Formatter  |
+-----------------+      +-------------------+      +------------------+
            |                       |
            v                       v
     Dependency Map         Hazard/Forwarding

4.2 Key Components

Component	Responsibility	Key Decisions
Parser	Tokenize and decode instructions	Keep the ISA small and deterministic
Scheduler	Advance instructions per cycle	Use ready-cycle rules for hazards
Hazard Unit	Detect RAW and decide stalls	Forwarding on/off switch
Reporter	Render table and stats	Stable table width for diffing

4.3 Data Structures (No Full Code)

typedef struct {
  int id;
  enum Opcode op;
  int src1, src2, dst;
  int ready_cycle;
} Instr;

typedef struct {
  int if_id, id_ex, ex_mem, mem_wb, wb;
} PipelineState;

4.4 Algorithm Overview

Key Algorithm: Cycle Scheduler

Start with all stages empty and cycle = 1.
Attempt to advance WB, MEM, EX, ID, IF in order.
If ID->EX would violate a hazard, insert bubble and hold IF/ID.
Record stage occupancy for the cycle.

Complexity Analysis:

Time: O(N * C) where C is cycles, typically O(N)
Space: O(N) for instruction list and timeline

5. Implementation Guide

5.1 Development Environment Setup

# Build tools
cc --version

# Optional: run with sanitizer for easier debugging
CFLAGS="-O0 -g -fsanitize=address,undefined"

5.2 Project Structure

pipe-trace/
├── src/
│   ├── main.c
│   ├── parser.c
│   ├── scheduler.c
│   └── report.c
├── samples/
│   ├── loop.s
│   └── hazards.s
├── tests/
│   └── test_scheduler.c
└── README.md

5.3 The Core Question You’re Answering

“Why does adding a single instruction sometimes make my loop twice as slow?”

The answer is timing. The CPU must wait for values to become available. Your timeline makes that waiting visible and countable.

5.4 Concepts You Must Understand First

Stop and research these before coding:

Pipeline stage timing
- When is a value produced vs consumed?
- How does stage latency map to cycles?
- Book: “Computer Systems” Ch. 4
RAW hazards and forwarding
- Why is load-use special?
- How do forward paths reduce stalls?
Branch resolution timing
- When is a branch decision known?
- Why does a taken branch flush IF/ID?

5.5 Questions to Guide Your Design

How will you represent a bubble in the timeline?
What constitutes a “ready” value for each opcode?
How will you detect and report hazards without slowing the simulator?

5.6 Thinking Exercise

Hand-Trace a Hazard

Trace this sequence by hand and mark stalls:

LOAD R1, 0(R2)
ADD  R3, R1, R4
ADD  R5, R3, R6

Questions:

Where does the first bubble appear?
How many total cycles for 3 instructions?

5.7 The Interview Questions They’ll Ask

What is a load-use hazard and how is it resolved?
Why is forwarding critical for IPC?
What would change in a 7-stage pipeline?

5.8 Hints in Layers

Hint 1: Track ready cycles Store a ready_cycle per destination register and compare against the consumer EX cycle.

Hint 2: Bubble is just a NOP Insert a NOP into EX and hold IF/ID for one cycle.

Hint 3: Validate with tiny programs Start with 2-instruction sequences where you can predict the timeline exactly.

5.9 Books That Will Help

Topic	Book	Chapter
Basic pipeline timing	“Computer Systems”	Ch. 4
Hazard detection	“Computer Organization and Design”	Ch. 4
Pipeline examples	“Digital Design and Computer Architecture”	Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals: Parse assembly and represent instructions. Tasks:

Build a line parser for opcodes and registers.
Create a vector of instructions with src/dst metadata. Checkpoint: Parsing samples/loop.s yields correct instruction list.

Phase 2: Core Functionality (4-6 days)

Goals: Schedule pipeline stages and hazards. Tasks:

Implement stage advancement and bubble insertion.
Add forwarding rules and a toggle to disable forwarding. Checkpoint: Hand-traced examples match the timeline.

Phase 3: Polish & Edge Cases (2-3 days)

Goals: Reporting, stats, and error handling. Tasks:

Add hazard summary and IPC calculation.
Implement clear error messages for bad inputs. Checkpoint: Tool produces deterministic output for all samples.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Hazard model	Simple RAW only vs RAW+control	RAW + control	Enough for learning with manageable complexity
Output format	Table vs JSON	Table with optional JSON	Table is intuitive; JSON supports tooling

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Parser and scheduler correctness	Parse registers, detect hazards
Integration Tests	Full pipeline on sample programs	loop.s, hazards.s
Edge Case Tests	Invalid opcodes, empty file	bad.s, empty.s

6.2 Critical Test Cases

Load-use hazard: Expect exactly one bubble.
ALU forwarding: No stall when forwarding is on.
Taken branch: Squash next instruction and count bubble.

6.3 Test Data

LOAD R1, 0(R2)
ADD  R3, R1, R4

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Same-cycle forwarding	Too few stalls	Enforce ready-cycle rule
Wrong producer	Incorrect results	Pick nearest older writer
Branch flush not modeled	Timeline skips	Insert bubble and redirect PC

7.2 Debugging Strategies

Trace small sequences: use 2-3 instruction inputs.
Compare with hand-traced tables: print cycle by cycle.

7.3 Performance Traps

Overly complex dependency scanning; use small fixed look-back window or per-register tracking.

8. Extensions & Challenges

8.1 Beginner Extensions

Add support for immediate operands.
Output JSON timeline.

8.2 Intermediate Extensions

Add branch prediction (always-taken vs not-taken).
Add simple cache miss penalties.

8.3 Advanced Extensions

Extend to a 7-stage pipeline.
Add memory dependency checks for loads/stores.

9. Real-World Connections

9.1 Industry Applications

Compiler optimization: instruction scheduling for pipeline efficiency.
Embedded systems: cycle-accurate performance estimation on microcontrollers.

gem5: full-system simulator; your project is a minimal conceptual slice.
llvm-mca: microarchitectural analysis tool that uses similar dependency modeling.

9.3 Interview Relevance

Pipeline hazards, forwarding, and stalls are common interview topics for systems and CPU roles.

10. Resources

10.1 Essential Reading

“Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - Ch. 4
“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson - Ch. 3

10.2 Video Resources

“Pipelining Basics” - university architecture lecture series

10.3 Tools & Documentation

objdump: inspect compiler output
gdb: step through assembly for validation

Next: P02-the-branch-predictor-torture-test.md - learn control hazards
Later: P05-execution-port-pressure-map.md - execution backend

11. Self-Assessment Checklist

11.1 Understanding

I can explain why load-use hazards require a bubble.
I can describe when forwarding is possible.
I can trace a pipeline timeline by hand.

11.2 Implementation

All functional requirements are met.
All test cases pass.
Output is deterministic and readable.

11.3 Growth

I can explain pipeline timing in an interview.
I documented at least one design trade-off.

12. Submission / Completion Criteria

Minimum Viable Completion:

Parses a sample assembly file and prints a timeline.
Correctly inserts a bubble for a load-use hazard.
Outputs IPC and total cycles.

Full Completion:

All requirements plus: forwarding toggle and branch flushing.

Excellence (Going Above & Beyond):

7-stage pipeline option and JSON output with diff-friendly formatting.