Project 7: ARM vs RISC-V Benchmark Suite
Run identical rendering kernels on both ISA modes and quantify real-world performance differences.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 weeks |
| Main Programming Language | C (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 5: Pure Magic |
| Business Potential | 1. The “Resume Gold” Level |
| Prerequisites | Project 1-3, toolchain setup for ARM and RISC-V |
| Key Topics | Benchmark methodology, cycle counters, ISA differences |
1. Learning Objectives
By completing this project, you will:
- Build and run firmware for both ARM Cortex-M33 and Hazard3 RISC-V.
- Measure cycle counts for rendering kernels with consistent methodology.
- Identify how instruction density and FPU support affect performance.
- Present benchmark results on the LCD in a clear dashboard.
- Explain fairness pitfalls in embedded benchmarking.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Benchmark Methodology and Fairness
Fundamentals
A benchmark is only useful if it is fair. Fairness means both systems run the same algorithm under the same conditions: same clock, same compiler optimizations, same memory layout, and no extra debugging overhead. For microcontrollers, small changes like interrupts or clock scaling can skew results. A good benchmark fixes these variables and repeats measurements to reduce noise. You must also separate CPU-bound operations from memory-bound ones to understand where each ISA excels.
Deep Dive into the concept
Benchmarking on embedded systems is difficult because the environment is noisy. Interrupts, DMA activity, and peripheral latency can affect timing. To create a fair benchmark, you must disable interrupts or at least ensure they run equally in both measurements. You must fix the system clock and verify it with a known timer. Compiler settings also matter: -O0 vs -O2 changes performance dramatically. Use the same optimization level and equivalent toolchain flags. If one ISA has hardware FPU and the other doesn’t, your benchmark should be either integer-only or include two versions (with/without float). Otherwise, you’re comparing apples to oranges.
Another issue is warm-up effects. The first run may include cache fills (if caches exist) or instruction fetch overhead. For consistency, run the benchmark multiple times and discard the first run. Also, choose representative workloads: pixel fills, memcpy, and sprite blits are typical for graphics. Don’t optimize one path more than the other; if you use inline assembly, you must do so for both. Record not just cycles but also derived metrics like cycles per pixel and bandwidth in MB/s. Finally, report variance: if results fluctuate by more than a few percent, your benchmark is unstable.
How this fits on projects
This concept is central to Section 3.2 and Section 5.10 Phase 2. It also informs Project 5 (multicore) and Project 10 (bare-metal boot) where timing is sensitive. Also used in: Project 5, Project 10.
Definitions & key terms
- Fairness -> Identical conditions for comparison.
- Warm-up -> Initial runs that may be slower or inconsistent.
- CPU-bound -> Limited by compute, not memory.
- Memory-bound -> Limited by data movement, not compute.
Mental model diagram (ASCII)
[ISA A] --same clock--> [Kernel] --measure--> cycles
[ISA B] --same clock--> [Kernel] --measure--> cycles
How it works (step-by-step)
- Fix system clocks and disable interrupts.
- Compile with identical optimization flags.
- Run kernel N times, discard first run.
- Record cycles and compute averages/variance.
Failure modes:
- Interrupts active -> noise.
- Different optimization -> unfair results.
- Different memory alignment -> bias.
Minimal concrete example
uint32_t t0 = get_cycle_count();
run_kernel();
uint32_t t1 = get_cycle_count();
uint32_t cycles = t1 - t0;
Common misconceptions
- “One run is enough.” -> Always run multiple iterations.
- “Higher clock = higher performance.” -> Memory bottlenecks can dominate.
Check-your-understanding questions
- Why disable interrupts during benchmarks?
- Why discard the first run?
- What is the difference between CPU-bound and memory-bound?
Check-your-understanding answers
- Interrupts add unpredictable delays.
- Warm-up effects distort results.
- CPU-bound is limited by compute; memory-bound by data movement.
Real-world applications
- Comparing MCU families for product selection
- Performance tuning for embedded graphics
Where you’ll apply it
- This project: Section 3.2, Section 5.10 Phase 2
- Also used in: Project 5
References
- “Computer Architecture” (Hennessy/Patterson) Ch. 1
- RP2350 datasheet cycle counters
Key insights
Benchmarks are measurements, not truth–control your variables.
Summary
Fair benchmarking requires consistent settings, repeated runs, and careful reporting.
Homework/Exercises to practice the concept
- Measure a memcpy kernel 10 times and compute variance.
- Run the same kernel with interrupts on vs off.
- Compare results at two different clock speeds.
Solutions to the homework/exercises
- Report mean and standard deviation.
- Interrupts increase variance and average time.
- CPU-bound kernels scale with clock; memory-bound may not.
2.2 Cycle Counters and Performance Metrics
Fundamentals
Cycle counters are hardware registers that increment every CPU clock. They let you measure execution time precisely. On ARM, the DWT cycle counter is commonly used; on RISC-V, the mcycle CSR provides similar data. By reading the counter before and after a kernel, you get precise cycles. From cycles, you can compute throughput: cycles per pixel, pixels per second, or MB/s.
Deep Dive into the concept
Cycle counters must be enabled and read correctly. On ARM Cortex-M33, the DWT cycle counter may require enabling in the debug module. On RISC-V, mcycle is always present but may need privilege access. You must ensure that the counter doesn’t overflow during your measurement; if it can, you should use 64-bit or handle wraparound. The key is to isolate the kernel: disable interrupts, avoid system calls, and ensure the compiler doesn’t optimize away your test. Use volatile pointers or checksum outputs to keep the compiler honest.
Once you have cycles, you can compute metrics that are easy to compare. For example, cycles per pixel = cycles / pixels. If you know the clock rate, you can compute seconds = cycles / clock_hz. If you render a 172x320 fill, you can compute bytes transferred and estimate bandwidth. These derived metrics are often more informative than raw cycle counts. Finally, always report both cycles and derived metrics to make comparisons meaningful.
How this fits on projects
Cycle counters are used in Section 3.2 and Section 5.10 Phase 1. They are also used in Project 9 (system monitor) and Project 5 (dual-core metrics). Also used in: Project 9, Project 5.
Definitions & key terms
- DWT -> Data Watchpoint and Trace unit (ARM cycle counter).
- mcycle -> RISC-V cycle count register.
- Throughput -> Work per unit time (pixels/sec).
- Variance -> Measure of spread in timing results.
Mental model diagram (ASCII)
cycles = end_count - start_count
seconds = cycles / clock_hz
How it works (step-by-step)
- Enable the cycle counter.
- Read start value.
- Execute kernel.
- Read end value.
- Compute cycles and derived metrics.
Failure modes:
- Counter disabled -> always 0.
- Optimized-away kernel -> false low numbers.
Minimal concrete example
uint32_t start = dwt_get_cycle_count();
run_kernel();
uint32_t end = dwt_get_cycle_count();
uint32_t cycles = end - start;
Common misconceptions
- “Cycle counts equal real time.” -> Only if you know clock speed.
- “One measurement is enough.” -> Use multiple samples.
Check-your-understanding questions
- What does DWT do on ARM?
- Why use volatile or checksums?
- How do you compute pixels/sec from cycles?
Check-your-understanding answers
- It provides cycle counter access.
- To prevent compiler removing the kernel.
- pixels/sec = pixels * clock_hz / cycles.
Real-world applications
- MCU performance tuning
- Algorithm selection for embedded graphics
Where you’ll apply it
- This project: Section 3.2, Section 5.10 Phase 1
- Also used in: Project 9
References
- ARM Cortex-M33 DWT documentation
- RISC-V privileged spec (mcycle)
Key insights
Cycle counters are the fastest path to truth in embedded performance.
Summary
Use hardware counters for precise, repeatable timing.
Homework/Exercises to practice the concept
- Measure a loop with 1000 iterations and verify cycles.
- Force an overflow and detect wraparound.
- Compute MB/s for a buffer copy.
Solutions to the homework/exercises
- Cycles should scale linearly with iteration count.
- Use 64-bit accumulation or handle wrap.
- MB/s = bytes * clock_hz / (cycles * 1e6).
3. Project Specification
3.1 What You Will Build
A benchmark harness that runs identical rendering kernels on ARM and RISC-V modes, then displays cycle counts, FPS, and throughput comparisons on the LCD.
3.2 Functional Requirements
- Dual build system for ARM and RISC-V.
- Benchmark kernels: fill, memcpy, sprite blit.
- Cycle counter readings on both ISAs.
- Result dashboard on LCD.
3.3 Non-Functional Requirements
- Performance: measurement overhead <5%.
- Reliability: results stable within +/-2% across runs.
- Usability: simple CLI or menu to select benchmarks.
3.4 Example Usage / Output
Kernel: Fill 172x320
ARM: 1.2M cycles (92 fps)
RISC-V: 1.6M cycles (70 fps)
3.5 Data Formats / Schemas / Protocols
- Results stored as struct {kernel_id, cycles, fps}
3.6 Edge Cases
- Different clock rates between ISAs
- Counter overflow on long runs
- Compiler optimizing away kernels
3.7 Real World Outcome
The LCD shows a side-by-side comparison chart with bars for ARM and RISC-V cycles per pixel. Results are stable across runs.
3.7.1 How to Run (Copy/Paste)
# ARM build
cd LEARN_RP2350_LCD_DEEP_DIVE/benchmark
mkdir -p build_arm
cd build_arm
cmake -DCPU=ARM ..
make -j4
cp bench_arm.uf2 /Volumes/RP2350
# RISC-V build
cd ../build_rv
cmake -DCPU=RISCV ..
make -j4
cp bench_rv.uf2 /Volumes/RP2350
3.7.2 Golden Path Demo (Deterministic)
- Run fill benchmark 10 times; average cycles displayed.
- Bars show ARM faster for float-heavy kernel, RISC-V close for integer-only.
3.7.3 Failure Demo (Deterministic)
- Build ARM with -O0 and RISC-V with -O2.
- Results show exaggerated ARM slowness.
- Fix: use identical optimization flags.
4. Solution Architecture
4.1 High-Level Design
[Kernel Runner] -> [Cycle Counter] -> [Results Store] -> [LCD Dashboard]
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Kernel suite | Workloads | Fill, blit, memcpy | | Timer | Cycle measurement | DWT or mcycle | | Renderer | Display results | Bar charts + numbers |
4.3 Data Structures (No Full Code)
typedef struct { uint8_t id; uint32_t cycles; float fps; } bench_result_t;
4.4 Algorithm Overview
Key Algorithm: Benchmark Loop
- Warm-up run.
- Execute kernel N times.
- Record cycles and compute average.
- Display results.
Complexity Analysis:
- Time: O(N * kernel)
- Space: O(num_kernels)
5. Implementation Guide
5.1 Development Environment Setup
# Install ARM + RISC-V toolchains
5.2 Project Structure
benchmark/
- src/
- kernels.c
- bench.c
- main.c
5.3 The Core Question You’re Answering
“How does ISA choice change real-world performance on the same silicon?”
5.4 Concepts You Must Understand First
- Fair benchmark methodology
- Cycle counters and timing
- ISA differences (FPU, instruction density)
5.5 Questions to Guide Your Design
- Which kernels represent real graphics workloads?
- How will you ensure identical conditions?
- How many runs are needed for stable averages?
5.6 Thinking Exercise
List all variables that must be fixed for a fair benchmark.
5.7 The Interview Questions They’ll Ask
- Why is benchmarking hard?
- What does instruction density affect?
- Why disable interrupts during measurement?
5.8 Hints in Layers
- Hint 1: Start with memcpy and fill.
- Hint 2: Use a fixed clock and disable interrupts.
- Hint 3: Display results on LCD to avoid serial overhead.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | ISA basics | “Computer Organization and Design” | Ch. 1-3 | | Performance | “Computer Architecture” | Ch. 1 |
5.10 Implementation Phases
Phase 1: Dual Toolchain (3-4 days)
Goals: Build for ARM and RISC-V. Tasks: Setup CMake toolchain options. Checkpoint: Both binaries boot.
Phase 2: Kernel Suite (4-5 days)
Goals: Implement fill, blit, memcpy. Tasks: Add timing hooks. Checkpoint: Stable cycle counts.
Phase 3: Dashboard (4-5 days)
Goals: Visualize results. Tasks: Render bar chart and numbers. Checkpoint: Results readable on LCD.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Kernel types | Synthetic vs real | Mix | Balanced insight | | Output | Serial vs LCD | LCD | Avoid overhead |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Kernel correctness | Verify fill pattern | | Integration Tests | Timing | Repeatability checks | | Regression Tests | Toolchain | ARM vs RISC-V build checks |
6.2 Critical Test Cases
- Repeatability: 10 runs within +/-2%.
- Correctness: kernel output matches expected buffer.
- Fairness: identical clock + flags.
6.3 Test Data
Kernel: fill 172x320 with 0xF800
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Different flags | Skewed results | Match build flags | | Interrupt noise | High variance | Disable interrupts | | Optimized-away kernel | Near-zero cycles | Use volatile checksum |
7.2 Debugging Strategies
- Display a checksum of rendered buffers.
- Repeat tests and log variance.
7.3 Performance Traps
- Using serial output during timing; it dominates runtime.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a simple integer-only kernel.
8.2 Intermediate Extensions
- Add float-heavy kernel to compare FPU impact.
8.3 Advanced Extensions
- Add cache/memory bandwidth tests if supported.
9. Real-World Connections
9.1 Industry Applications
- MCU selection for graphics products
9.2 Related Open Source Projects
- Benchmark suites for microcontrollers
9.3 Interview Relevance
- Benchmark methodology and ISA trade-offs are advanced interview topics.
10. Resources
10.1 Essential Reading
- RP2350 datasheet (cycle counters)
- RISC-V privileged spec
10.2 Video Resources
- Performance measurement talks
10.3 Tools & Documentation
- ARM and RISC-V toolchains
10.4 Related Projects in This Series
- Project 10 for toolchain details.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain fairness in benchmarks.
- I can read cycle counters on both ISAs.
11.2 Implementation
- Benchmarks run on ARM and RISC-V.
- Results are stable and displayed on LCD.
11.3 Growth
- I can explain ISA trade-offs in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- One kernel benchmark with cycles reported on LCD.
Full Completion:
- Multiple kernels and side-by-side comparison.
Excellence (Going Above & Beyond):
- Detailed report with variance and analysis.