Project 7: ARM vs RISC-V Benchmark Suite

Run identical rendering kernels on both ISA modes and quantify real-world performance differences.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2-3 weeks
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 5: Pure Magic
Business Potential	1. The “Resume Gold” Level
Prerequisites	Project 1-3, toolchain setup for ARM and RISC-V
Key Topics	Benchmark methodology, cycle counters, ISA differences

1. Learning Objectives

By completing this project, you will:

Build and run firmware for both ARM Cortex-M33 and Hazard3 RISC-V.
Measure cycle counts for rendering kernels with consistent methodology.
Identify how instruction density and FPU support affect performance.
Present benchmark results on the LCD in a clear dashboard.
Explain fairness pitfalls in embedded benchmarking.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Benchmark Methodology and Fairness

Fundamentals

A benchmark is only useful if it is fair. Fairness means both systems run the same algorithm under the same conditions: same clock, same compiler optimizations, same memory layout, and no extra debugging overhead. For microcontrollers, small changes like interrupts or clock scaling can skew results. A good benchmark fixes these variables and repeats measurements to reduce noise. You must also separate CPU-bound operations from memory-bound ones to understand where each ISA excels.

Deep Dive into the concept

Benchmarking on embedded systems is difficult because the environment is noisy. Interrupts, DMA activity, and peripheral latency can affect timing. To create a fair benchmark, you must disable interrupts or at least ensure they run equally in both measurements. You must fix the system clock and verify it with a known timer. Compiler settings also matter: -O0 vs -O2 changes performance dramatically. Use the same optimization level and equivalent toolchain flags. If one ISA has hardware FPU and the other doesn’t, your benchmark should be either integer-only or include two versions (with/without float). Otherwise, you’re comparing apples to oranges.

Another issue is warm-up effects. The first run may include cache fills (if caches exist) or instruction fetch overhead. For consistency, run the benchmark multiple times and discard the first run. Also, choose representative workloads: pixel fills, memcpy, and sprite blits are typical for graphics. Don’t optimize one path more than the other; if you use inline assembly, you must do so for both. Record not just cycles but also derived metrics like cycles per pixel and bandwidth in MB/s. Finally, report variance: if results fluctuate by more than a few percent, your benchmark is unstable.

How this fits on projects

This concept is central to Section 3.2 and Section 5.10 Phase 2. It also informs Project 5 (multicore) and Project 10 (bare-metal boot) where timing is sensitive. Also used in: Project 5, Project 10.

Definitions & key terms

Fairness -> Identical conditions for comparison.
Warm-up -> Initial runs that may be slower or inconsistent.
CPU-bound -> Limited by compute, not memory.
Memory-bound -> Limited by data movement, not compute.

Mental model diagram (ASCII)

[ISA A] --same clock--> [Kernel] --measure--> cycles
[ISA B] --same clock--> [Kernel] --measure--> cycles

How it works (step-by-step)

Fix system clocks and disable interrupts.
Compile with identical optimization flags.
Run kernel N times, discard first run.
Record cycles and compute averages/variance.

Failure modes:

Interrupts active -> noise.
Different optimization -> unfair results.
Different memory alignment -> bias.

Minimal concrete example

uint32_t t0 = get_cycle_count();
run_kernel();
uint32_t t1 = get_cycle_count();
uint32_t cycles = t1 - t0;

Common misconceptions

“One run is enough.” -> Always run multiple iterations.
“Higher clock = higher performance.” -> Memory bottlenecks can dominate.

Check-your-understanding questions

Why disable interrupts during benchmarks?
Why discard the first run?
What is the difference between CPU-bound and memory-bound?

Check-your-understanding answers

Interrupts add unpredictable delays.
Warm-up effects distort results.
CPU-bound is limited by compute; memory-bound by data movement.

Real-world applications

Comparing MCU families for product selection
Performance tuning for embedded graphics

Where you’ll apply it

This project: Section 3.2, Section 5.10 Phase 2
Also used in: Project 5

References

“Computer Architecture” (Hennessy/Patterson) Ch. 1
RP2350 datasheet cycle counters

Key insights

Benchmarks are measurements, not truth–control your variables.

Summary

Fair benchmarking requires consistent settings, repeated runs, and careful reporting.

Homework/Exercises to practice the concept

Measure a memcpy kernel 10 times and compute variance.
Run the same kernel with interrupts on vs off.
Compare results at two different clock speeds.

Solutions to the homework/exercises

Report mean and standard deviation.
Interrupts increase variance and average time.
CPU-bound kernels scale with clock; memory-bound may not.

2.2 Cycle Counters and Performance Metrics

Fundamentals

Cycle counters are hardware registers that increment every CPU clock. They let you measure execution time precisely. On ARM, the DWT cycle counter is commonly used; on RISC-V, the mcycle CSR provides similar data. By reading the counter before and after a kernel, you get precise cycles. From cycles, you can compute throughput: cycles per pixel, pixels per second, or MB/s.

Deep Dive into the concept

Cycle counters must be enabled and read correctly. On ARM Cortex-M33, the DWT cycle counter may require enabling in the debug module. On RISC-V, mcycle is always present but may need privilege access. You must ensure that the counter doesn’t overflow during your measurement; if it can, you should use 64-bit or handle wraparound. The key is to isolate the kernel: disable interrupts, avoid system calls, and ensure the compiler doesn’t optimize away your test. Use volatile pointers or checksum outputs to keep the compiler honest.

Once you have cycles, you can compute metrics that are easy to compare. For example, cycles per pixel = cycles / pixels. If you know the clock rate, you can compute seconds = cycles / clock_hz. If you render a 172x320 fill, you can compute bytes transferred and estimate bandwidth. These derived metrics are often more informative than raw cycle counts. Finally, always report both cycles and derived metrics to make comparisons meaningful.

How this fits on projects

Cycle counters are used in Section 3.2 and Section 5.10 Phase 1. They are also used in Project 9 (system monitor) and Project 5 (dual-core metrics). Also used in: Project 9, Project 5.

Definitions & key terms

DWT -> Data Watchpoint and Trace unit (ARM cycle counter).
mcycle -> RISC-V cycle count register.
Throughput -> Work per unit time (pixels/sec).
Variance -> Measure of spread in timing results.

Mental model diagram (ASCII)

cycles = end_count - start_count
seconds = cycles / clock_hz

How it works (step-by-step)

Enable the cycle counter.
Read start value.
Execute kernel.
Read end value.
Compute cycles and derived metrics.

Failure modes:

Counter disabled -> always 0.
Optimized-away kernel -> false low numbers.

Minimal concrete example

uint32_t start = dwt_get_cycle_count();
run_kernel();
uint32_t end = dwt_get_cycle_count();
uint32_t cycles = end - start;

Common misconceptions

“Cycle counts equal real time.” -> Only if you know clock speed.
“One measurement is enough.” -> Use multiple samples.

Check-your-understanding questions

What does DWT do on ARM?
Why use volatile or checksums?
How do you compute pixels/sec from cycles?

Check-your-understanding answers

It provides cycle counter access.
To prevent compiler removing the kernel.
pixels/sec = pixels * clock_hz / cycles.

Real-world applications

MCU performance tuning
Algorithm selection for embedded graphics

Where you’ll apply it

This project: Section 3.2, Section 5.10 Phase 1
Also used in: Project 9

References

ARM Cortex-M33 DWT documentation
RISC-V privileged spec (mcycle)

Key insights

Cycle counters are the fastest path to truth in embedded performance.

Summary

Use hardware counters for precise, repeatable timing.

Homework/Exercises to practice the concept

Measure a loop with 1000 iterations and verify cycles.
Force an overflow and detect wraparound.
Compute MB/s for a buffer copy.

Solutions to the homework/exercises

Cycles should scale linearly with iteration count.
Use 64-bit accumulation or handle wrap.
MB/s = bytes * clock_hz / (cycles * 1e6).

3. Project Specification

3.1 What You Will Build

A benchmark harness that runs identical rendering kernels on ARM and RISC-V modes, then displays cycle counts, FPS, and throughput comparisons on the LCD.

3.2 Functional Requirements

Dual build system for ARM and RISC-V.
Benchmark kernels: fill, memcpy, sprite blit.
Cycle counter readings on both ISAs.
Result dashboard on LCD.

3.3 Non-Functional Requirements

Performance: measurement overhead <5%.
Reliability: results stable within +/-2% across runs.
Usability: simple CLI or menu to select benchmarks.

3.4 Example Usage / Output

Kernel: Fill 172x320
ARM: 1.2M cycles (92 fps)
RISC-V: 1.6M cycles (70 fps)

3.5 Data Formats / Schemas / Protocols

Results stored as struct {kernel_id, cycles, fps}

3.6 Edge Cases

Different clock rates between ISAs
Counter overflow on long runs
Compiler optimizing away kernels

3.7 Real World Outcome

The LCD shows a side-by-side comparison chart with bars for ARM and RISC-V cycles per pixel. Results are stable across runs.

3.7.1 How to Run (Copy/Paste)

# ARM build
cd LEARN_RP2350_LCD_DEEP_DIVE/benchmark
mkdir -p build_arm
cd build_arm
cmake -DCPU=ARM ..
make -j4
cp bench_arm.uf2 /Volumes/RP2350

# RISC-V build
cd ../build_rv
cmake -DCPU=RISCV ..
make -j4
cp bench_rv.uf2 /Volumes/RP2350

3.7.2 Golden Path Demo (Deterministic)

Run fill benchmark 10 times; average cycles displayed.
Bars show ARM faster for float-heavy kernel, RISC-V close for integer-only.

3.7.3 Failure Demo (Deterministic)

Build ARM with -O0 and RISC-V with -O2.
Results show exaggerated ARM slowness.
Fix: use identical optimization flags.

4. Solution Architecture

4.1 High-Level Design

[Kernel Runner] -> [Cycle Counter] -> [Results Store] -> [LCD Dashboard]

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct { uint8_t id; uint32_t cycles; float fps; } bench_result_t;

4.4 Algorithm Overview

Key Algorithm: Benchmark Loop

Warm-up run.
Execute kernel N times.
Record cycles and compute average.
Display results.

Complexity Analysis:

Time: O(N * kernel)
Space: O(num_kernels)

5. Implementation Guide

5.1 Development Environment Setup

# Install ARM + RISC-V toolchains

5.2 Project Structure

benchmark/
- src/
  - kernels.c
  - bench.c
  - main.c

5.3 The Core Question You’re Answering

“How does ISA choice change real-world performance on the same silicon?”

5.4 Concepts You Must Understand First

Fair benchmark methodology
Cycle counters and timing
ISA differences (FPU, instruction density)

5.5 Questions to Guide Your Design

Which kernels represent real graphics workloads?
How will you ensure identical conditions?
How many runs are needed for stable averages?

5.6 Thinking Exercise

List all variables that must be fixed for a fair benchmark.

5.7 The Interview Questions They’ll Ask

Why is benchmarking hard?
What does instruction density affect?
Why disable interrupts during measurement?

5.8 Hints in Layers

Hint 1: Start with memcpy and fill.
Hint 2: Use a fixed clock and disable interrupts.
Hint 3: Display results on LCD to avoid serial overhead.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | ISA basics | “Computer Organization and Design” | Ch. 1-3 | | Performance | “Computer Architecture” | Ch. 1 |

5.10 Implementation Phases

Phase 1: Dual Toolchain (3-4 days)

Goals: Build for ARM and RISC-V. Tasks: Setup CMake toolchain options. Checkpoint: Both binaries boot.

Phase 2: Kernel Suite (4-5 days)

Goals: Implement fill, blit, memcpy. Tasks: Add timing hooks. Checkpoint: Stable cycle counts.

Phase 3: Dashboard (4-5 days)

Goals: Visualize results. Tasks: Render bar chart and numbers. Checkpoint: Results readable on LCD.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Repeatability: 10 runs within +/-2%.
Correctness: kernel output matches expected buffer.
Fairness: identical clock + flags.

6.3 Test Data

Kernel: fill 172x320 with 0xF800

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Display a checksum of rendered buffers.
Repeat tests and log variance.

7.3 Performance Traps

Using serial output during timing; it dominates runtime.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a simple integer-only kernel.

8.2 Intermediate Extensions

Add float-heavy kernel to compare FPU impact.

8.3 Advanced Extensions

Add cache/memory bandwidth tests if supported.

9. Real-World Connections

9.1 Industry Applications

MCU selection for graphics products

Benchmark suites for microcontrollers

9.3 Interview Relevance

Benchmark methodology and ISA trade-offs are advanced interview topics.

10. Resources

10.1 Essential Reading

RP2350 datasheet (cycle counters)
RISC-V privileged spec

10.2 Video Resources

Performance measurement talks

10.3 Tools & Documentation

ARM and RISC-V toolchains

Project 10 for toolchain details.

11. Self-Assessment Checklist

11.1 Understanding

I can explain fairness in benchmarks.
I can read cycle counters on both ISAs.

11.2 Implementation

Benchmarks run on ARM and RISC-V.
Results are stable and displayed on LCD.

11.3 Growth

I can explain ISA trade-offs in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

One kernel benchmark with cycles reported on LCD.

Full Completion:

Multiple kernels and side-by-side comparison.

Excellence (Going Above & Beyond):

Detailed report with variance and analysis.