Project 5: Dual-Core Rendering Engine

Use both cores to render faster: one core composes graphics while the other streams frames and handles input.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2-4 weeks
Main Programming Language	C (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 5: Pure Magic
Business Potential	2. The “Diagnostics” Level
Prerequisites	Projects 1-3, DMA display pipeline, basic multicore concepts
Key Topics	Multicore synchronization, work queues, memory barriers

1. Learning Objectives

By completing this project, you will:

Split rendering and display tasks across two cores.
Implement a lock-free or low-lock work queue for rendering tasks.
Use memory barriers to ensure safe buffer swaps.
Measure scaling benefits and identify multicore bottlenecks.
Build a deterministic rendering pipeline that avoids races.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Multicore Synchronization and Work Queues

Fundamentals

When two cores operate on shared data, they must coordinate to prevent races. A work queue is a data structure that allows one core to produce tasks and another to consume them. On microcontrollers, you often use ring buffers with head/tail indices. The challenge is updating shared indices safely without heavy locks. If both cores write to the same buffer without coordination, you will corrupt memory or lose tasks. The simplest solution uses a single producer/single consumer ring with atomic index updates and memory barriers.

Deep Dive into the concept

A single-producer single-consumer (SPSC) ring buffer is ideal for a dual-core rendering pipeline: Core 0 produces render commands (draw line, blit sprite) and Core 1 consumes them to build the frame or push to DMA. The ring buffer uses a fixed-size array, with head (write index) and tail (read index). The producer advances head after writing a task; the consumer advances tail after reading. The critical detail is memory ordering: you must ensure that the task data is written before the head index is updated, and the consumer must read the task after observing the head update. This is done with memory barriers or atomic operations.

You also need a backpressure mechanism. If the queue is full, the producer must wait or drop tasks. For rendering, waiting is safer. Another consideration is task granularity. Too fine-grained (one task per pixel) creates overhead; too coarse-grained (one task per frame) underutilizes the consumer. A balanced approach is to batch tasks per sprite or per rectangle. You can also use double-buffered queues: while one queue is being consumed, the producer fills the other. This yields deterministic frame boundaries.

How this fits on projects

Work queues are used in Section 4.2 and Section 5.10 Phase 2. They are required for Project 13 (task scheduling) and helpful for Project 9 (system monitor). Also used in: Project 13, Project 9.

Definitions & key terms

Race condition -> Outcome depends on timing of concurrent access.
SPSC queue -> Single producer, single consumer queue.
Memory barrier -> Prevents compiler/CPU reordering.
Backpressure -> Producer slows when consumer lags.

Mental model diagram (ASCII)

Core0 (producer) -> [Ring Buffer] -> Core1 (consumer)
     write tasks                     read tasks

How it works (step-by-step)

Producer writes task into buffer at head index.
Producer issues memory barrier.
Producer increments head.
Consumer checks head != tail.
Consumer reads task, then increments tail.

Failure modes:

No barriers -> consumer reads stale or partial task.
Full queue -> overwrite unread tasks.
Poor granularity -> overhead or underutilization.

Minimal concrete example

// SPSC enqueue
bool enqueue(task_t t) {
  uint32_t next = (head + 1) % QSIZE;
  if (next == tail) return false; // full
  queue[head] = t;
  __dmb(); // memory barrier
  head = next;
  return true;
}

Common misconceptions

“Two cores automatically speed everything up.” -> Not if synchronization is poor.
“Locks are always bad.” -> Sometimes a small lock is fine.

Check-your-understanding questions

Why is an SPSC queue simpler than a general queue?
What does a memory barrier guarantee?
What happens if the queue is full?

Check-your-understanding answers

Only one writer and one reader simplifies concurrency.
It prevents reordering of writes/reads across the barrier.
Tasks are dropped or producer must wait.

Real-world applications

Audio pipelines with producer/consumer cores
Networking stacks on dual-core MCUs

Where you’ll apply it

This project: Section 4.2, Section 5.10 Phase 2
Also used in: Project 13

References

“Making Embedded Systems” concurrency sections
RP2350 multicore SDK docs

Key insights

A good queue design determines whether multicore is a win or a mess.

Summary

Use simple SPSC queues with barriers to safely coordinate between cores.

Homework/Exercises to practice the concept

Implement a ring buffer and test with two cores.
Add counters for dropped tasks and log them.
Experiment with task granularity (per sprite vs per frame).

Solutions to the homework/exercises

Use head/tail indices and barriers.
Increment a counter when enqueue fails.
Per sprite is usually a good balance.

2.2 Memory Barriers and Cache/Bus Ordering

Fundamentals

Even on microcontrollers without caches, the CPU and compiler can reorder memory operations. A memory barrier ensures that writes complete before later operations are visible to other cores. Without barriers, one core may see stale data or partially written structures. This is critical when swapping framebuffers or signaling DMA completion.

Deep Dive into the concept

Memory ordering is subtle. Compilers can reorder instructions for optimization, and the CPU may buffer writes. On a single core, this rarely matters; on two cores, it can cause invisible bugs. For example, Core 0 might write a new framebuffer pointer, then set a “ready” flag. If the writes are reordered, Core 1 may see the flag before the pointer update, causing it to read the old buffer. The solution is to use memory barriers (__dmb() on ARM) or atomic operations that enforce ordering. These are small but critical details in multicore firmware.

In addition to barriers, you should consider bus arbitration. Two cores and DMA may contend for memory bandwidth. This is not just a performance issue; it can cause jitter. You can measure this by toggling GPIOs around render loops and DMA operations, then observing timing. If you see jitter, you may need to reduce concurrent memory access or stagger operations.

How this fits on projects

Barriers are required in Section 5.10 and Section 4.4. They also apply to Project 3 (DMA) and Project 13 (context switching). Also used in: Project 3, Project 13.

Definitions & key terms

Memory barrier (DMB) -> Ensures ordering of memory operations.
Atomic -> Operation that is indivisible and ordered.
Bus contention -> Multiple masters accessing memory.

Mental model diagram (ASCII)

Core0: write data -> [barrier] -> set flag
Core1: wait flag -> [barrier] -> read data

How it works (step-by-step)

Core0 writes buffer data.
Core0 issues barrier.
Core0 updates shared flag.
Core1 sees flag, issues barrier.
Core1 reads buffer safely.

Failure modes:

No barrier -> stale data.
Excess barriers -> performance loss.

Minimal concrete example

// Core 0
front = back;
__dmb();
frame_ready = true;

Common misconceptions

“No cache means no ordering issues.” -> Reordering still exists.
“Barriers are only for OS kernels.” -> They matter in firmware too.

Check-your-understanding questions

Why do you need a barrier after writing shared data?
What is bus contention?
When can too many barriers hurt performance?

Check-your-understanding answers

To ensure all writes are visible before signaling.
Multiple masters accessing memory simultaneously.
Barriers serialize operations and slow throughput.

Real-world applications

Shared-memory multicore systems
DMA synchronization in embedded systems

Where you’ll apply it

This project: Section 5.10 Phase 2
Also used in: Project 3, Project 13

References

ARM memory barrier documentation
RP2350 multicore docs

Key insights

Correct ordering is the hidden foundation of reliable multicore code.

Summary

Use barriers at buffer swaps and shared flags to prevent races.

Homework/Exercises to practice the concept

Remove barriers and observe occasional rendering glitches.
Add barriers and confirm stability.
Measure timing jitter with GPIO toggles.

Solutions to the homework/exercises

You may see intermittent wrong frames.
Barriers restore consistent behavior.
Jitter indicates contention; adjust workload.

3. Project Specification

3.1 What You Will Build

A dual-core rendering pipeline where Core 0 renders scene updates into a back buffer and Core 1 streams the front buffer to the LCD via DMA, coordinating buffer swaps safely.

3.2 Functional Requirements

Core partitioning: Core 0 renders; Core 1 handles DMA + display.
Work queue: SPSC ring buffer for render tasks.
Buffer swap: atomic, tear-free swaps on DMA completion.
Metrics: FPS and CPU usage per core displayed.

3.3 Non-Functional Requirements

Performance: 40+ FPS with simple scenes.
Reliability: No races or tearing over 1-hour run.
Usability: Clear logging of dropped tasks.

3.4 Example Usage / Output

Core0: Render 5.2 ms
Core1: DMA 22.1 ms
FPS: 45
Dropped tasks: 0

3.5 Data Formats / Schemas / Protocols

Render task struct with opcode and parameters
Shared ring buffer indexes

3.6 Edge Cases

Queue full under heavy load
Core 1 stalls DMA due to SPI errors
Buffer swap during partial render

3.7 Real World Outcome

The LCD shows a smooth animation while a status overlay displays per-core metrics. CPU load is balanced and no tearing is visible.

3.7.1 How to Run (Copy/Paste)

cd LEARN_RP2350_LCD_DEEP_DIVE/dual_core_render
mkdir -p build
cd build
cmake ..
make -j4
cp dual_core_render.uf2 /Volumes/RP2350

3.7.2 Golden Path Demo (Deterministic)

Scene: two sprites moving with fixed velocities.
Core0 render time stable within +/-0.5 ms.
Core1 DMA time stable, no tearing.

3.7.3 Failure Demo (Deterministic)

Remove memory barrier around buffer swap.
Expected: occasional corrupted frames.
Fix: restore barrier.

4. Solution Architecture

4.1 High-Level Design

Core0: Render -> Back Buffer -> Work Queue
Core1: Consume Queue -> Front Buffer -> DMA -> LCD

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct { uint8_t op; int16_t x,y,w,h; uint16_t color; } task_t;

4.4 Algorithm Overview

Key Algorithm: Task Consumption

Core1 checks queue for tasks.
Executes render tasks on back buffer.
When frame done, swaps buffers and triggers DMA.

Complexity Analysis:

Time: O(tasks per frame)
Space: O(queue size)

5. Implementation Guide

5.1 Development Environment Setup

# Use pico-sdk multicore examples

5.2 Project Structure

dual_core_render/
- src/
  - core0_render.c
  - core1_display.c
  - queue.c
  - main.c

5.3 The Core Question You’re Answering

“How can I use both cores without corrupting shared graphics data?”

5.4 Concepts You Must Understand First

SPSC queue design
Memory barriers and atomic flags
DMA buffer swap timing

5.5 Questions to Guide Your Design

What tasks should be in the queue?
How will you detect queue overflow?
When do you swap buffers?

5.6 Thinking Exercise

Draw a timeline of Core0 render vs Core1 DMA over 3 frames.

5.7 The Interview Questions They’ll Ask

Why use an SPSC queue?
What is a memory barrier?
How do you prevent tearing in multicore systems?

5.8 Hints in Layers

Hint 1: Start with one task type (fill rectangle).
Hint 2: Add a render complete flag.
Hint 3: Swap buffers only after DMA done.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Concurrency | “Making Embedded Systems” | Ch. 10 |

5.10 Implementation Phases

Phase 1: Multicore Bring-up (1 week)

Goals: Run code on both cores. Tasks: Start core1 and blink LED. Checkpoint: Both cores run independently.

Phase 2: Work Queue (1 week)

Goals: Pass render tasks safely. Tasks: Implement SPSC queue with barriers. Checkpoint: Tasks consumed correctly.

Phase 3: Full Pipeline (1-2 weeks)

Goals: Render and display concurrently. Tasks: Integrate DMA and buffer swap. Checkpoint: Smooth animations, stable FPS.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Queue Full: producer blocks or drops with log.
Buffer Swap: no tearing under heavy load.
Race Injection: disable barrier and see failure.

6.3 Test Data

Task stream: 100 rectangles, 10 sprites, 1 text update per frame

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Toggle GPIOs around queue operations to measure timing.
Log queue depth and dropped tasks.

7.3 Performance Traps

Too many fine-grained tasks create overhead.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a second queue for input events.

8.2 Intermediate Extensions

Use DMA chaining with multicore swaps.

8.3 Advanced Extensions

Implement a render command compiler.

9. Real-World Connections

9.1 Industry Applications

Dual-core microcontroller UI pipelines
Real-time dashboards with sensor data

LVGL multicore discussions

9.3 Interview Relevance

Concurrency, memory ordering, and data races are key topics.

10. Resources

10.1 Essential Reading

RP2350 multicore documentation
ARM memory barrier docs

10.2 Video Resources

Concurrency basics for embedded systems

10.3 Tools & Documentation

Logic analyzer or scope for timing

Project 3 for DMA foundation.

11. Self-Assessment Checklist

11.1 Understanding

I can explain SPSC queue behavior.
I can place memory barriers correctly.

11.2 Implementation

Dual-core rendering runs for 1 hour without errors.
FPS and per-core metrics display correctly.

11.3 Growth

I can explain multicore design choices in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Two cores run distinct tasks without conflicts.

Full Completion:

Render pipeline stable with DMA and no tearing.

Excellence (Going Above & Beyond):

Dynamic load balancing between cores.