ARM ASSEMBLY LEARNING PROJECTS
ARM Assembly Programming: Deep Understanding Through Projects
Goal: Master ARM assembly language from bare-metal to operating system level—understanding not just how to write assembly, but why processors work the way they do, how hardware communicates with software, and why this knowledge is the foundation of all systems programming.
Why ARM Assembly Matters
ARM processors power over 99% of smartphones, run in billions of embedded devices, and are now challenging x86 in laptops and servers (Apple M-series, AWS Graviton, Microsoft Surface). Understanding ARM assembly isn’t just academic—it’s understanding the dominant computing architecture of the 21st century.
But more importantly: ARM assembly teaches you how computers really work.
When you write ARM assembly, you’re not hiding behind abstractions. You’re directly:
- Loading values from memory addresses
- Manipulating CPU registers
- Controlling peripheral hardware through memory-mapped I/O
- Managing interrupts and exceptions
- Understanding every single clock cycle your code consumes
This knowledge transforms how you write all code, in any language.
What High-Level Languages Hide From You
High-level code: What actually happens at the CPU:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
let x = a + b; → LDR r0, [r7, #8] // Load 'a' from stack
LDR r1, [r7, #12] // Load 'b' from stack
ADD r0, r0, r1 // Add them in register
STR r0, [r7, #4] // Store result as 'x'
gpio.set_high(25); → LDR r0, =0xD0000000 // Load SIO base address
MOV r1, #1 // Prepare bit value
LSL r1, r1, #25 // Shift to bit 25
STR r1, [r0, #0x14] // Write to GPIO_OUT_SET
print("Hello"); → (Hundreds of instructions for UART init,
character transmission, buffer management,
interrupt handling, etc.)
Every line of Python, Rust, or C++ compiles down to sequences like these. When you understand assembly, you understand what your code actually does.
The ARM Landscape: Understanding the Family Tree
ARM isn’t one architecture—it’s a family of related architectures optimized for different use cases:
ARM Architecture Evolution:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌─────────────────────────────────────────────┐
│ ARM Holdings (IP owner) │
│ Designs architectures, licenses to others │
└─────────────────────┬───────────────────────┘
│
┌────────────────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌────────────────┐ ┌────────────────┐
│ M-Profile │ │ A-Profile │ │ R-Profile │
│ Microcontrollers │ Applications │ │ Real-Time │
│ (Embedded) │ │ (Phones, PCs) │ │ (Automotive) │
└──────────────┘ └────────────────┘ └────────────────┘
│ │
┌──────┴──────┐ ┌────────────┼────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────┐ ┌────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Cortex │ │Cortex │ │Cortex-A7│ │Cortex-A │ │Cortex-A │
│ -M0+ │ │-M4/M7 │ │Cortex-A9│ │53/55/72 │ │76/78/X │
│ │ │ │ │(32-bit) │ │(64-bit) │ │(64-bit) │
└───────┘ └────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │ │
│ │ │ │ │
Thumb Thumb-2 ARM32 AArch64 AArch64
only + DSP + Thumb + NEON + SVE2
+ FPU
┌──────────────────────────────────────────────────────────────────────────────┐
│ YOUR TARGETS: │
│ │
│ Raspberry Pi Pico (RP2040) Raspberry Pi 3/4/5 │
│ ├─ Dual Cortex-M0+ cores ├─ Cortex-A53/A72/A76 cores │
│ ├─ ARMv6-M architecture ├─ ARMv8-A architecture (AArch64) │
│ ├─ Thumb instruction set ├─ A64 instruction set │
│ ├─ 16 registers (r0-r15) ├─ 31 registers (x0-x30) │
│ ├─ 133 MHz max clock ├─ 1.5-2.4 GHz clock │
│ └─ 264 KB RAM, no OS └─ 1-8 GB RAM, Linux capable │
└──────────────────────────────────────────────────────────────────────────────┘
Key insight: The Pico and the Raspberry Pi 4 speak different assembly languages. Cortex-M0+ code won’t run on Cortex-A72, and vice versa. You’re learning two dialects of the same language family.
ARM Cortex-M Architecture Deep Dive (Raspberry Pi Pico)
The Cortex-M series is designed for microcontrollers: simple, low-power, deterministic. The M0+ in the Pico is the simplest member of this family.
The Register Model: Your Working Memory
Cortex-M0+ Register File:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
32 bits wide
◀──────────────────▶
┌───────────────────────────────┐
r0 │ General Purpose (argument 1) │ ─┐
├───────────────────────────────┤ │
r1 │ General Purpose (argument 2) │ │ Low registers:
├───────────────────────────────┤ │ - All Thumb instructions work
r2 │ General Purpose (argument 3) │ │ - Used for function arguments
├───────────────────────────────┤ │ and return values
r3 │ General Purpose (argument 4) │ │ - Caller-saved (scratch)
├───────────────────────────────┤ │
r4 │ General Purpose (preserved) │ │
├───────────────────────────────┤ │
r5 │ General Purpose (preserved) │ │
├───────────────────────────────┤ │
r6 │ General Purpose (preserved) │ │
├───────────────────────────────┤ │
r7 │ General Purpose (frame ptr) │ ─┘
├───────────────────────────────┤
r8 │ General Purpose (preserved) │ ─┐ High registers:
├───────────────────────────────┤ │ - Only some instructions work
r9 │ General Purpose (preserved) │ │ - Must move to low reg for
├───────────────────────────────┤ │ most operations
r10 │ General Purpose (preserved) │ │ - Callee-saved
├───────────────────────────────┤ │
r11 │ General Purpose (preserved) │ │
├───────────────────────────────┤ │
r12 │ Intra-Procedure Call scratch │ ─┘
├═══════════════════════════════┤ ──── SPECIAL REGISTERS ────
r13 │ Stack Pointer (SP) │ Points to top of stack
├───────────────────────────────┤ (actually 2 SPs: MSP and PSP)
r14 │ Link Register (LR) │ Return address for functions
├───────────────────────────────┤
r15 │ Program Counter (PC) │ Address of next instruction
└───────────────────────────────┘
┌───────────────────────────────┐
xPSR│ N│Z│C│V│ ... │ Exception # │ Program Status Register:
└─┬─┴─┴─┴─────────────────────┴─┘ N = Negative, Z = Zero
│ C = Carry, V = Overflow
└─ Condition flags set by
arithmetic operations
IMPORTANT: Cortex-M0+ has NO program counter-relative addressing for
data. You MUST use literal pools or calculate addresses manually!
Why only 16 registers? Because Thumb instructions use 3-bit register fields (2³ = 8 values for low registers). High registers require special instruction forms. This constraint forces efficient register usage.
The Memory Map: Where Everything Lives
Cortex-M Memory Map (4GB address space):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
0xFFFFFFFF ┌─────────────────────────────────────────┐
│ Vendor-Specific │
0xE0100000 ├─────────────────────────────────────────┤
│ Private Peripheral Bus │ ← NVIC lives here
│ (Internal peripherals) │ at 0xE000E000
0xE0000000 ├─────────────────────────────────────────┤
│ │
│ External Device │ ← Memory-mapped
│ (Peripherals, etc.) │ devices
│ │
0xA0000000 ├─────────────────────────────────────────┤
│ │
│ External RAM │
│ │
0x60000000 ├─────────────────────────────────────────┤
│ │
│ Peripheral │ ← GPIO, UART, SPI,
│ (On-chip I/O) │ I2C, PWM, etc.
│ │
0x40000000 ├─────────────────────────────────────────┤
│ │
│ SRAM │ ← Variables, stack,
│ (On-chip RAM) │ heap
│ │
0x20000000 ├─────────────────────────────────────────┤
│ │
│ Code │ ← Flash/ROM with
│ (Flash/ROM) │ your program
│ │
0x00000000 └─────────────────────────────────────────┘
RP2040-Specific Memory Map:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Address │ Size │ Contents
────────────────┼────────────┼─────────────────────────────────────
0x10000000 │ 2 MB │ External Flash (XIP)
│ │ ↳ Your code runs from here
────────────────┼────────────┼─────────────────────────────────────
0x20000000 │ 256 KB │ Main SRAM (4 banks × 64KB)
│ │ ↳ Variables, stack, heap
0x20040000 │ 4 KB │ SRAM4 (for USB)
0x20041000 │ 4 KB │ SRAM5 (for USB)
────────────────┼────────────┼─────────────────────────────────────
0x40000000 │ - │ APB Peripherals
│ │ ↳ UART, SPI, I2C, PWM...
────────────────┼────────────┼─────────────────────────────────────
0x50000000 │ - │ AHB-Lite Peripherals
│ │ ↳ DMA, USB, PIO...
────────────────┼────────────┼─────────────────────────────────────
0xD0000000 │ - │ SIO (Single-cycle I/O)
│ │ ↳ GPIO (fast access!)
────────────────┼────────────┼─────────────────────────────────────
0xE0000000 │ - │ Cortex-M0+ internal
│ │ ↳ NVIC, SysTick, Debug
Critical insight: GPIO on the RP2040 is at 0xD0000000 (SIO block) for single-cycle access, NOT at 0x40000000 like most peripherals. This is why you can toggle GPIOs so fast!
The Thumb Instruction Set: Compact and Constrained
Cortex-M0+ uses the Thumb instruction set, which encodes most instructions in 16 bits:
Thumb Instruction Encoding Examples:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
16-bit Thumb instruction: MOV r0, #42
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ 0│ 1│ 0│ 0│ Rd │ imm8 (immediate) │
│ 0│ 0│ 1│ 0│ 0│0│0│0│ 0│ 0│ 1│ 0│ 1│ 0│ 1│ 0│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
│r0 │ 42 = 0x2A
│ │
▼ ▼
Encodes to: 0x202A (little-endian: 2A 20)
32-bit Thumb-2 instruction: LDR r0, [r1, #offset] (when offset > 31)
┌──────────────────────────────────────────────────────────────────┐
│ First halfword (16 bits) │ Second halfword (16 bits) │
│ encoding prefix + Rn │ Rt + imm12 offset │
└──────────────────────────────────────────────────────────────────┘
Common Thumb Instructions You'll Use:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Data Movement:
MOV Rd, #imm8 Move 8-bit immediate to register
MOV Rd, Rm Move register to register
LDR Rt, [Rn, #off] Load word from memory
STR Rt, [Rn, #off] Store word to memory
PUSH {reglist} Push registers to stack
POP {reglist} Pop registers from stack
Arithmetic:
ADD Rd, Rn, #imm3 Add 3-bit immediate
ADD Rd, #imm8 Add 8-bit immediate to Rd
SUB Rd, Rn, #imm3 Subtract 3-bit immediate
SUBS Rd, Rn, Rm Subtract with flags update
Logic:
AND Rd, Rm Bitwise AND
ORR Rd, Rm Bitwise OR
EOR Rd, Rm Bitwise XOR (exclusive OR)
LSL Rd, Rm, #imm5 Logical shift left
LSR Rd, Rm, #imm5 Logical shift right
Control Flow:
B label Unconditional branch
BEQ label Branch if equal (Z=1)
BNE label Branch if not equal (Z=0)
BL function Branch with link (function call)
BX Rm Branch to address in register
BLX Rm Branch with link to address in register
LIMITATION: Cortex-M0+ is MISSING many instructions!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✗ No hardware divide (UDIV, SDIV) → Must use software division
✗ No bit-field instructions (BFI) → Must use shift/mask sequences
✗ No conditional execution (IT block) → Must use branches
✗ Limited addressing modes → Can't do [Rn, Rm, LSL #2]
✗ No saturation arithmetic → Must check overflow manually
The Boot Process: From Power-On to Your Code
Cortex-M Boot Sequence:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Power Applied
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 1. CPU comes out of reset │
│ - All registers undefined (except SP and PC) │
│ - Processor in Thread mode, privileged │
│ - Using Main Stack Pointer (MSP) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. CPU reads address 0x00000000 (or VTOR) │
│ - Loads INITIAL STACK POINTER value │
│ - This value goes into SP/r13 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. CPU reads address 0x00000004 │
│ - Loads RESET HANDLER address │
│ - This value goes into PC/r15 │
│ - Bit 0 MUST be 1 (Thumb mode indicator) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. Execution begins at Reset_Handler │
│ - Your code starts running! │
│ - Stack is ready to use │
│ - All peripherals need initialization │
└─────────────────────────────────────────────────────────────────┘
Vector Table Structure (first 16 entries are standard Cortex-M):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Offset │ Exception # │ Contents
─────────┼───────────────┼────────────────────────────────────────
0x0000 │ - │ Initial Stack Pointer value
0x0004 │ 1 (Reset) │ Reset_Handler address (| 1 for Thumb)
0x0008 │ 2 (NMI) │ NMI_Handler address
0x000C │ 3 (HardFault)│ HardFault_Handler address
0x0010 │ 4 │ Reserved (M0+ doesn't use)
... │ ... │ ...
0x003C │ 15 (SysTick) │ SysTick_Handler address
0x0040 │ 16 (IRQ0) │ First peripheral interrupt
0x0044 │ 17 (IRQ1) │ Second peripheral interrupt
... │ ... │ (RP2040 has 26 IRQs)
Example minimal vector table in assembly:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
.section .vectors, "a"
.align 2
.word _stack_top // 0x00: Initial SP
.word Reset_Handler + 1 // 0x04: Reset (bit 0 = Thumb)
.word NMI_Handler + 1 // 0x08: NMI
.word HardFault_Handler+1 // 0x0C: HardFault
.word 0 // 0x10: Reserved
// ... more entries ...
NOTE: On RP2040, flash is at 0x10000000, so your vector table
lives there. The boot ROM copies the SP and PC from flash.
AArch64 Architecture Deep Dive (Raspberry Pi 3/4/5)
The A-profile ARM processors use the full 64-bit AArch64 instruction set, which is fundamentally different from Thumb.
The 64-bit Register Model: Abundance of Registers
AArch64 Register File:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
64 bits wide (X registers)
◀──────────────────────────────────────────▶
│ 32 bits (W register alias) │
│ ◀─────────────────────┤
┌───────────────────────────────────────────────────────────────┐
x0 │ Argument 1 / Return value │ w0
├───────────────────────────────────────────────────────────────┤
x1 │ Argument 2 / Return value (for 128-bit returns) │ w1
├───────────────────────────────────────────────────────────────┤
x2 │ Argument 3 │ w2
├───────────────────────────────────────────────────────────────┤
... ...
├───────────────────────────────────────────────────────────────┤
x7 │ Argument 8 │ w7
├───────────────────────────────────────────────────────────────┤
x8 │ Indirect result location (for large struct returns) │ w8
├───────────────────────────────────────────────────────────────┤
x9 │ Temporary / Caller-saved │ w9
├───────────────────────────────────────────────────────────────┤
... │ x9-x15: Temporaries (caller-saved) │
├───────────────────────────────────────────────────────────────┤
x16 │ IP0 - Intra-procedure-call scratch (PLT, veneers) │ w16
├───────────────────────────────────────────────────────────────┤
x17 │ IP1 - Intra-procedure-call scratch │ w17
├───────────────────────────────────────────────────────────────┤
x18 │ Platform register (reserved on some OSes) │ w18
├───────────────────────────────────────────────────────────────┤
x19 │ Callee-saved (must preserve across calls) │ w19
├───────────────────────────────────────────────────────────────┤
... │ x19-x28: Callee-saved │
├═══════════════════════════════════════════════════════════════┤
x29 │ Frame Pointer (FP) │ w29
├───────────────────────────────────────────────────────────────┤
x30 │ Link Register (LR) - return address │ w30
├───────────────────────────────────────────────────────────────┤
SP │ Stack Pointer (not a GPR, dedicated register) │ wsp
├───────────────────────────────────────────────────────────────┤
PC │ Program Counter (not directly accessible like ARM32!) │
└───────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
XZR │ Zero Register (reads as 0, writes discarded) │ wzr
└───────────────────────────────────────────────────────────────┘
^ This is REVOLUTIONARY - no wasted instruction to clear!
SIMD/Floating-Point Registers (32 × 128-bit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌───────────────────────────────────────────────────────────────┐
v0 │ B0│B1│B2│...│B15│ ← 16 bytes = 128 bits (Q0/V0) │
│ H0│H1│...│H7 │ ← 8 halfwords │
│ S0│S1│S2│S3 │ ← 4 singles (float) │
│ D0│D1 │ ← 2 doubles │
└───────────────────────────────────────────────────────────────┘
Used for: floating-point, SIMD (NEON), and crypto operations
Key Differences from ARM32:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• 31 GPRs (vs 16) → Less register pressure, fewer spills
• PC not directly readable/writable → Use ADR/ADRP for addresses
• Zero register (xzr/wzr) → MOV x0, xzr instead of MOV r0, #0
• No conditional execution → Use CSEL, CSINC instead
• 64-bit addresses → Can address all of RAM directly
• All instructions 32-bit → No 16-bit Thumb encoding
Exception Levels: Privilege and Security
AArch64 Exception Levels:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌─────────────────────────────────────────────────────────────────┐
│ EL3: Secure Monitor │
│ - Highest privilege, manages secure/non-secure worlds │
│ - TrustZone firmware lives here │
├─────────────────────────────────────────────────────────────────┤
│ EL2: Hypervisor │
│ - Virtualization support │
│ - Controls virtual machines │
├─────────────────────────────────────────────────────────────────┤
│ EL1: OS Kernel │
│ - Where Linux kernel runs │
│ - Your bare-metal code runs here! │
├─────────────────────────────────────────────────────────────────┤
│ EL0: User Applications │
│ - Lowest privilege │
│ - Normal programs run here under Linux │
└─────────────────────────────────────────────────────────────────┘
On Raspberry Pi boot:
┌──────────────────────────────────────────────────────────────┐
│ GPU firmware starts at EL3, then drops to EL2, │
│ loads your kernel8.img, and jumps to 0x80000 at EL2. │
│ Your bare-metal code typically runs at EL1 after setup. │
└──────────────────────────────────────────────────────────────┘
Memory Barriers: Ordering Matters
Why Memory Barriers Are Needed:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Modern CPUs reorder memory accesses for performance. This is usually
invisible to single-threaded code, but becomes critical when:
1. Communicating with peripherals (they have side effects!)
2. Multi-core systems (other cores see different ordering)
3. DMA operations (hardware sees memory, not caches)
Example WITHOUT barrier (BROKEN):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
You write: CPU might execute as:
────────────────────── ────────────────────────────────
mailbox_buffer[0] = cmd mailbox_write = buffer_addr ← FIRST!
mailbox_buffer[1] = arg mailbox_buffer[0] = cmd ← TOO LATE
mailbox_write = buffer_addr mailbox_buffer[1] = arg
The peripheral reads garbage because the buffer wasn't filled yet!
ARM Memory Barrier Instructions:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
DMB (Data Memory Barrier)
├── Ensures all previous memory accesses complete before
│ subsequent memory accesses begin
├── Does NOT affect instruction execution order
└── Use between: data writes and peripheral write
DSB (Data Synchronization Barrier)
├── Like DMB, but also waits for all previous instructions
│ to complete (stronger than DMB)
└── Use before: peripheral access that must be visible
ISB (Instruction Synchronization Barrier)
├── Flushes the instruction pipeline
├── Ensures previous context changes take effect
└── Use after: changing system registers, enabling MMU
Correct Pattern for Peripheral Access:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// Fill mailbox buffer
str w1, [x0] // Write data to buffer
str w2, [x0, #4] // Write more data
dsb sy // ← BARRIER: Complete all writes
str w3, [x4] // Now write to mailbox register
// Hardware now sees complete buffer
Memory-Mapped I/O: Talking to Hardware
Both Cortex-M and AArch64 use memory-mapped I/O. Peripherals appear at specific addresses, and you control them by reading/writing to those addresses.
Memory-Mapped I/O Concept:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Normal Memory: Peripheral Register:
────────────── ────────────────────
LDR r0, [addr] LDR r0, [UART_DATA]
│ │
▼ ▼
Read from RAM Read TRIGGERS HARDWARE!
Data was sitting there Byte removed from RX FIFO
Memory unchanged Status flags updated
STR r0, [addr] STR r0, [GPIO_OUT]
│ │
▼ ▼
Write to RAM Write CAUSES ACTION!
Data now stored there Pin voltage changes
Can read it back May not read same value back
Example: GPIO Control on RP2040:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SIO Base: 0xD0000000
Offset │ Register │ Purpose
────────┼────────────────┼──────────────────────────────────
0x000 │ CPUID │ Processor ID (read-only)
0x004 │ GPIO_IN │ Read current GPIO input state
0x010 │ GPIO_OUT │ Read/write GPIO output state
0x014 │ GPIO_OUT_SET │ Set bits in GPIO_OUT (write-only)
0x018 │ GPIO_OUT_CLR │ Clear bits in GPIO_OUT (write-only)
0x01C │ GPIO_OUT_XOR │ Toggle bits in GPIO_OUT (write-only)
0x020 │ GPIO_OE │ Output enable (1=output, 0=input)
0x024 │ GPIO_OE_SET │ Set bits in GPIO_OE
0x028 │ GPIO_OE_CLR │ Clear bits in GPIO_OE
To turn ON GPIO25 (Pico's LED):
─────────────────────────────────────────────────────────────────
LDR r0, =0xD0000000 // SIO base address
MOV r1, #1
LSL r1, r1, #25 // r1 = 0x02000000 (bit 25)
STR r1, [r0, #0x024] // GPIO_OE_SET: enable output
STR r1, [r0, #0x014] // GPIO_OUT_SET: set high → LED ON!
Why SET/CLR registers instead of just GPIO_OUT?
─────────────────────────────────────────────────────────────────
Without SET/CLR (DANGEROUS):
┌────────────────────────────────────────────────────────────────┐
│ LDR r1, [r0, #GPIO_OUT] // Read current value │
│ ORR r1, r1, #(1<<25) // Set bit 25 │
│ STR r1, [r0, #GPIO_OUT] // Write back │
│ │
│ PROBLEM: If another core or interrupt modifies GPIO_OUT │
│ between the LDR and STR, those changes are LOST! │
│ This is a classic "read-modify-write race condition." │
└────────────────────────────────────────────────────────────────┘
With SET/CLR (ATOMIC and SAFE):
┌────────────────────────────────────────────────────────────────┐
│ MOV r1, #(1<<25) │
│ STR r1, [r0, #GPIO_OUT_SET] // Hardware atomically sets bit │
│ │
│ Other bits are UNAFFECTED - hardware handles it! │
└────────────────────────────────────────────────────────────────┘
Interrupts and Exceptions: Responding to the World
Interrupt Flow on Cortex-M:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Main Code Running
│
│ ← UART receives byte
│ Hardware sets interrupt flag
│ NVIC sees enabled interrupt
▼
┌──────────────────────────────────────────────────────────────────┐
│ AUTOMATIC HARDWARE ACTIONS (you don't write code for this): │
│ 1. Finish current instruction │
│ 2. Push 8 registers to stack: r0-r3, r12, LR, PC, xPSR │
│ 3. Load new PC from vector table (exception #) │
│ 4. Load 0xFFFFFFF9 into LR (EXC_RETURN) │
│ 5. Enter Handler mode (privileged) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ YOUR ISR EXECUTES: │
│ - Must save r4-r11 if you use them (push {r4-r7}) │
│ - Read UART data register (clears interrupt flag) │
│ - Process byte (store in buffer, set flag, etc.) │
│ - Restore r4-r11 if saved │
│ - Return with: BX LR (the magic EXC_RETURN value) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ AUTOMATIC HARDWARE ACTIONS: │
│ 1. Hardware detects EXC_RETURN in LR │
│ 2. Pop 8 registers from stack │
│ 3. Resume execution exactly where interrupted │
│ 4. Return to Thread mode │
└──────────────────────────────────────────────────────────────────┘
│
▼
Main Code Continues (unaware anything happened!)
Stack During Interrupt:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
BEFORE interrupt: AFTER entry, BEFORE ISR code:
┌──────────────┐ ┌──────────────┐
│ (old data) │ │ (old data) │
│ │ ├──────────────┤
│ │ │ xPSR │ ← +0x1C from new SP
│ │ ├──────────────┤
│ │ │ PC (return) │ ← +0x18
│ │ ├──────────────┤
│ │ │ LR │ ← +0x14
│ │ ├──────────────┤
│ │ │ r12 │ ← +0x10
│ │ ├──────────────┤
│ │ │ r3 │ ← +0x0C
│ │ ├──────────────┤
│ │ │ r2 │ ← +0x08
│ │ ├──────────────┤
│ │ │ r1 │ ← +0x04
│ │ ├──────────────┤
SP →│ │ SP→│ r0 │ ← +0x00 (new SP)
└──────────────┘ └──────────────┘
The 32 bytes (8 × 4) are pushed automatically by hardware!
Concept Summary Table
| Concept Cluster | What You Must Internalize |
|---|---|
| Registers as fast storage | Registers are inside the CPU, memory is external. Every memory access costs ~2-100+ cycles. Keep hot data in registers. |
| Thumb vs AArch64 | Thumb uses 16-bit instructions for code density (Cortex-M). AArch64 uses 32-bit instructions with 31 GPRs (Cortex-A). Different assembly syntax, different constraints. |
| Memory-mapped I/O | Peripherals appear as memory addresses. Writing to 0xD0000014 doesn’t store data—it changes a GPIO pin’s voltage. Reading 0x40034000 doesn’t return stored data—it pulls a byte from a UART FIFO. |
| Vector tables | The CPU needs to know where your code is. On reset, it reads the initial SP and PC from fixed addresses. For interrupts, it looks up handler addresses in a table. |
| Calling conventions | Functions need agreements: which registers hold arguments, which must be preserved, where the return address lives. Break these and your stack corrupts. |
| Memory barriers | CPUs reorder memory accesses for speed. When talking to peripherals or other cores, you need explicit barriers to ensure order. |
| Interrupt latency | Time from interrupt signal to ISR executing. Cortex-M is designed for low latency (12 cycles typical). Keep ISRs short to maintain responsiveness. |
| Bit manipulation | Setting bit 25? That’s 1 << 25 = 0x02000000. Testing bit 5? Use TST reg, #0x20. Mastering hex and binary is mandatory. |
Deep Dive Reading By Concept
ARM Architecture Fundamentals
| Concept | Book & Chapter |
|---|---|
| ARM history and processor families | Introduction to Computer Organization: ARM Edition by Robert G. Plantz — Ch. 1 |
| Register model and data types | The Art of ARM Assembly, Volume 1 by Randall Hyde — Ch. 3 |
| Instruction encoding and formats | The Art of ARM Assembly, Volume 1 by Randall Hyde — Ch. 4-6 |
| Memory hierarchy concepts | Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 6 |
Cortex-M Specifics (Pico)
| Concept | Book & Chapter |
|---|---|
| Thumb instruction set | The Art of ARM Assembly, Volume 1 by Randall Hyde — Ch. 4-6 |
| Cortex-M architecture overview | ARM Cortex-M0+ Devices Generic User Guide — Ch. 2-3 |
| Boot sequence and vectors | Making Embedded Systems, 2nd Edition by Elecia White — Ch. 2 |
| NVIC and interrupt handling | Making Embedded Systems, 2nd Edition by Elecia White — Ch. 4 |
| Memory-mapped peripherals | RP2040 Datasheet — Section 2 (SIO, GPIO) |
| Linker scripts and memory layout | Bare Metal Programming Guide |
AArch64 Specifics (Raspberry Pi 3/4/5)
| Concept | Book & Chapter |
|---|---|
| AArch64 register model | Modern Arm Assembly Language Programming by Daniel Kusswurm — Ch. 1-2 |
| A64 instruction set | The Art of ARM Assembly, Volume 1 by Randall Hyde — Ch. 16 |
| Exception levels (EL0-EL3) | ARM Architecture Reference Manual — Section D1 |
| Memory barriers | ARM Architecture Reference Manual — Section B2.3 |
| NEON/SIMD programming | Modern Arm Assembly Language Programming by Daniel Kusswurm — Ch. 8-11 |
| Raspberry Pi boot process | rpi4os.com — Part 1-2 |
Embedded Systems Context
| Concept | Book & Chapter |
|---|---|
| Hardware/software interface | Making Embedded Systems, 2nd Edition by Elecia White — Ch. 5 (I/O) |
| Serial communication (UART) | Making Embedded Systems, 2nd Edition by Elecia White — Ch. 8 |
| Real-time constraints | Making Embedded Systems, 2nd Edition by Elecia White — Ch. 10 |
| Debugging embedded systems | Making Embedded Systems, 2nd Edition by Elecia White — Ch. 3 |
| Low-level C and assembly | Low-Level Programming by Igor Zhirkov — Ch. 4-6 |
Tool-Specific References
| Tool | Resource |
|---|---|
| GNU Assembler (GAS) | GNU AS Manual — online |
| GNU Linker (LD) | GNU LD Manual — Linker Scripts chapter |
| GDB/LLDB debugging | The Art of Debugging with GDB, DDD, and Eclipse by Matloff & Salzman — Ch. 1-5 |
| OpenOCD | OpenOCD User’s Guide — online |
| ARM development | ARM System Developer’s Guide by Sloss, Symes, Wright — Ch. 1-6 |
Essential Reading Order
For mastering ARM assembly, read in this sequence:
- Foundation (Week 1-2):
- The Art of ARM Assembly, Volume 1 Ch. 1-3 (overview, registers)
- Making Embedded Systems Ch. 2 (embedded architecture)
- RP2040 Datasheet sections 1-2 (memory map, SIO)
- Instruction Set (Week 3-4):
- The Art of ARM Assembly, Volume 1 Ch. 4-6 (Thumb instructions)
- Introduction to Computer Organization: ARM Edition Ch. 9-11 (assembly programming)
- Hardware Interaction (Week 5-6):
- Making Embedded Systems Ch. 4-5 (interrupts, I/O)
- RP2040 Datasheet sections 4-5 (UART, GPIO details)
- 64-bit Transition (Week 7-8):
- Modern Arm Assembly Language Programming Ch. 1-4 (AArch64)
- BCM2711 Peripherals Manual (Raspberry Pi 4)
- Advanced Topics (Week 9+):
- Computer Systems: A Programmer’s Perspective Ch. 6-9 (memory, linking)
- ARM Architecture Reference Manual (definitive reference)
Project 1: Bare-Metal LED Blinker (Raspberry Pi Pico)
- File: ARM_ASSEMBLY_LEARNING_PROJECTS.md
- Programming Language: ARM Assembly
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Embedded Systems
- Software or Tool: Raspberry Pi Pico
- Main Book: “The Art of ARM Assembly, Volume 1” by Randall Hyde
What you’ll build: A program that blinks the onboard LED without any SDK, OS, or C runtime—pure assembly from reset vector to GPIO toggle.
Why it teaches ARM assembly: This is your “Hello World” that forces you to understand everything the SDK hides from you: how the processor boots, how to configure clocks, and how memory-mapped peripherals work.
Core challenges you’ll face:
- Writing a proper vector table (maps to understanding Cortex-M boot sequence)
- Initializing clocks and PLLs without SDK (maps to understanding peripheral registers)
- Manipulating GPIO registers with load/store instructions (maps to memory-mapped I/O)
- Creating precise delays without a timer library (maps to instruction timing)
- Writing a working linker script (maps to understanding memory layout)
Key Concepts:
- Cortex-M Boot Sequence: “Making Embedded Systems, 2nd Edition” Ch. 2 - Elecia White
- Thumb Instruction Encoding: “The Art of ARM Assembly, Volume 1” Ch. 4-6 - Randall Hyde
- Memory-Mapped I/O: “Introduction to Computer Organization: ARM Edition” Ch. 14 - Robert G. Plantz
- Linker Scripts: Bare Metal Programming Guide - cpq
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic understanding of binary/hex, any programming experience
Real world outcome:
- You will see the green LED on your Pico blinking at a rate you control
- You’ll have a
blink.sfile under 100 lines that you understand completely - You can modify the delay and see the blink rate change immediately
Learning milestones:
- LED blinks → You understand vector tables and the boot process
- Blink rate is predictable → You understand instruction cycles and timing
- You can add a second GPIO output → You’ve internalized register manipulation
Real World Outcome
Exact physical behavior:
- The green LED (GPIO 25 on the Pico) will turn ON for exactly 500ms, then OFF for exactly 500ms, repeating forever
- LED brightness: full intensity (3.3V output driving through a resistor)
- Power consumption: ~20mA during ON state, ~2mA during OFF state
- The LED will start blinking approximately 1-3ms after power-on (depends on your boot code efficiency)
Terminal output when connected via SWD debugger:
OpenOCD:
Info : RP2040 Core 0 halted
Info : Loaded 256 bytes from blink.elf
Info : Starting execution at 0x10000000
Binary characteristics:
$ arm-none-eabi-size blink.elf
text data bss dec hex filename
256 0 0 256 100 blink.elf
Your entire program fits in 256 bytes—no bloated runtime, no hidden libraries.
Memory map (from your linker script):
FLASH (rx) : ORIGIN = 0x10000000, LENGTH = 2048K
RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 264K
Sections:
.vectors : 0x10000000 - 0x100000C0 (vector table, 48 entries)
.text : 0x100000C0 - 0x100001C0 (your actual code)
What changes when you modify the delay:
- Change delay loop count from 0x100000 to 0x80000 → LED blinks twice as fast (250ms on/off)
- Change to 0x200000 → LED blinks twice as slow (1000ms on/off)
- Each loop iteration takes ~4 CPU cycles at 125MHz = 32ns per iteration
The Core Question You’re Answering
“What actually happens between pressing ‘reset’ and executing my first instruction?”
This project answers the fundamental mystery that all high-level languages hide from you: the boot sequence. You’ll understand:
- How does the ARM processor know where to find your code?
- What is the very first thing that executes when power is applied?
- Why does your code need to be at a specific memory address?
- How does the processor transition from reset state to running your application?
Concepts You Must Understand First
- Binary Number Representation
- Can you convert 0b1010_1100 to hex? (Answer: 0xAC)
- Why do we use hex for memory addresses?
- Book reference: “Introduction to Computer Organization: ARM Edition” Ch. 1-2 (Plantz)
- Memory-Mapped I/O Concept
- What does it mean for a register to be “at address 0x40014000”?
- How is writing to a memory address different from writing to a hardware register?
- If GPIO is at 0xD0000000, what happens when you write 0x0200_0000 to that address?
- Book reference: “Making Embedded Systems” Ch. 5 (White)
- ARM Register Model
- Can you name the general-purpose registers? (r0-r12)
- What makes r13 (SP), r14 (LR), and r15 (PC) special?
- Why can’t you use r13 as a scratch register?
- Book reference: “The Art of ARM Assembly, Volume 1” Ch. 3 (Hyde)
- Thumb Instruction Set Basics
- What is the difference between ARM and Thumb mode?
- Why does Cortex-M0+ only support Thumb?
- What is the encoding size of a Thumb instruction? (16-bit or 32-bit)
- Book reference: “The Art of ARM Assembly, Volume 1” Ch. 4 (Hyde)
- Linker and Loader Concepts
- What is the difference between a .s file, .o file, and .elf file?
- What does a linker script do?
- Why does code have both a “load address” and a “run address”?
- Book reference: Bare Metal Programming Guide Section 2
Questions to Guide Your Design
- Where does execution start after reset?
- The RP2040 loads the stack pointer from address 0x10000000 and the reset handler address from 0x10000004. How will you ensure these values are correct?
- How do you create a delay without a timer?
- If your clock is 125MHz, how many instructions execute per second? How many loop iterations for a 500ms delay?
- What registers control GPIO 25?
- You need to find GPIO25’s control registers in the RP2040 datasheet. What is the base address for GPIO? What offset is the output enable register?
- How do you toggle a bit without affecting other bits?
- If register GPIO_OUT contains 0x0000_0000 and you want to set bit 25 to 1, what value do you write? What about toggling bit 25?
- What happens if you forget the vector table?
- Without a proper vector table at 0x10000000, what will the processor do when it boots?
- How do you know your program size fits in flash?
- The RP2040 has 2MB of flash. How do you verify your binary doesn’t overflow?
Thinking Exercise
Before writing any code, trace this on paper:
- Draw the memory map (addresses 0x0000_0000 to 0x2004_0000):
0x10000000: [FLASH START - draw a box] 0x10000000: [Vector table entry 0: initial SP value] 0x10000004: [Vector table entry 1: Reset_Handler address] 0x10000008-0x100000C0: [remaining vector entries] 0x100000C0: [Your code starts here] 0x20000000: [RAM START - draw another box] - Write out the boot sequence in pseudocode:
POWER ON → Processor reads address 0x10000000, loads value into SP → Processor reads address 0x10000004, loads value into PC → PC now points to Reset_Handler → Reset_Handler executes: what do you need to do here? - Calculate delay loop iterations:
Target delay: 500ms = 0.5 seconds CPU frequency: 125 MHz = 125,000,000 Hz Cycles per second: 125,000,000 Cycles for 500ms: 125,000,000 × 0.5 = 62,500,000 cycles Your loop body: subs r0, r0, #1 ; 1 cycle bne delay_loop ; 3 cycles if taken, 1 if not Total per iteration: ~4 cycles Loop iterations needed: 62,500,000 / 4 = 15,625,000 = 0xEE6B28 - Trace register values through GPIO initialization:
; Initial state: What's in r0, r1? ldr r0, =GPIO_BASE ; r0 = ? ldr r1, =GPIO25_MASK ; r1 = ? str r1, [r0, #GPIO_OE] ; What value writes to what address?
The Interview Questions They’ll Ask
- “Explain the ARM Cortex-M boot sequence from power-on.”
- Expected answer: Processor loads initial SP from 0x00000000, then loads Reset_Handler address from 0x00000004 and begins execution there. On RP2040, flash is mapped to 0x10000000, so vector table must be there.
- “What is the difference between Thumb and Thumb-2 instruction sets?”
- Expected answer: Thumb is 16-bit encoding for reduced code size. Thumb-2 extends it with optional 32-bit encodings for more complex operations. Cortex-M0+ supports Thumb-2 subset.
- “How would you toggle a GPIO pin without modifying other pins in the same port?”
- Expected answer: Use bit-set and bit-clear registers (GPIO_OUT_SET/GPIO_OUT_CLR on RP2040) or read-modify-write with XOR. Never use simple assignment to avoid race conditions.
- “Why can’t you use a simple software delay in production code?”
- Expected answer: CPU frequency changes affect timing, compiler optimization might remove the loop, doesn’t account for interrupts, blocks other tasks, not accurate. Should use hardware timers.
- “What is a linker script and why do you need one?”
- Expected answer: Defines memory layout, section placement (.text, .data, .bss), and entry point. Without it, the linker doesn’t know where to place code in flash vs RAM or how to arrange the vector table.
- “Explain what happens during a ‘str’ instruction at the hardware level.”
- Expected answer: Generates address from base register + offset, puts data on bus, peripheral decodes address, if it’s a GPIO register the hardware updates the pin state, otherwise writes to RAM.
Hints in Layers
Hint 1: Vector Table Structure Your vector table needs at minimum:
.section .vectors
.word _stack_top // Initial stack pointer
.word Reset_Handler // Reset vector
The RP2040 expects this at address 0x10000000.
Hint 2: GPIO Register Addresses From the RP2040 datasheet:
.equ SIO_BASE, 0xd0000000
.equ GPIO_OUT_SET, 0x014
.equ GPIO_OUT_CLR, 0x018
.equ GPIO_OE_SET, 0x024
.equ GPIO25_BIT, 25
Hint 3: Setting up GPIO as Output
ldr r0, =SIO_BASE
mov r1, #1
lsl r1, r1, #GPIO25_BIT // r1 = 0x02000000 (bit 25 set)
str r1, [r0, #GPIO_OE_SET] // Enable output on GPIO25
Hint 4: Complete Toggle Loop
main_loop:
ldr r0, =SIO_BASE
mov r1, #1
lsl r1, r1, #GPIO25_BIT
// Turn ON
str r1, [r0, #GPIO_OUT_SET]
ldr r2, =0x100000
bl delay
// Turn OFF
str r1, [r0, #GPIO_OUT_CLR]
ldr r2, =0x100000
bl delay
b main_loop
delay:
subs r2, r2, #1
bne delay
bx lr
Hint 5: Complete Linker Script
MEMORY {
FLASH (rx) : ORIGIN = 0x10000000, LENGTH = 2048K
RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 264K
}
SECTIONS {
.text : {
KEEP(*(.vectors))
*(.text*)
} > FLASH
}
_stack_top = ORIGIN(RAM) + LENGTH(RAM);
Hint 6: Build Commands
arm-none-eabi-as -mcpu=cortex-m0plus -mthumb blink.s -o blink.o
arm-none-eabi-ld -T linker.ld blink.o -o blink.elf
arm-none-eabi-objcopy -O binary blink.elf blink.bin
picotool load blink.bin
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Cortex-M Boot Process | “Making Embedded Systems” by Elecia White | Ch. 2 - Anatomy of an Embedded System |
| Vector Tables & Exceptions | “The Art of ARM Assembly, Volume 1” by Randall Hyde | Ch. 14 - Exception Handling |
| Thumb Instruction Set | “The Art of ARM Assembly, Volume 1” by Randall Hyde | Ch. 4-6 - Instruction Set |
| Memory-Mapped I/O | “Introduction to Computer Organization: ARM Edition” by Robert G. Plantz | Ch. 14 - I/O Interfacing |
| GPIO Register Details | RP2040 Datasheet | Section 2.3.1 - GPIO Functions |
| Linker Scripts | “Bare Metal Programming Guide” (GitHub) | Section 2 - Linker Scripts |
| RP2040 Boot Sequence | “RP2040 Assembly Language Programming” by Stephen Smith | Ch. 3 - Getting Started |
| SIO Peripheral | RP2040 Datasheet | Section 2.3.1.7 - SIO |
Project 2: UART Driver from Scratch (Raspberry Pi Pico)
- File: ARM_ASSEMBLY_LEARNING_PROJECTS.md
- Programming Language: ARM Assembly
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Embedded Systems / Drivers
- Software or Tool: Raspberry Pi Pico
- Main Book: “Making Embedded Systems” by Elecia White
What you’ll build: A serial driver in pure assembly that lets you send “Hello, World!” to your computer over USB-serial, with both polling and interrupt-driven modes.
Why it teaches ARM assembly: UART forces you to deal with multiple peripheral registers, bit manipulation for configuration, and eventually interrupts—all while debugging blind (until it works!).
Core challenges you’ll face:
- Configuring UART peripheral registers (baud rate divisors, word length, parity)
- Implementing polling-based TX/RX by checking status flags
- Setting up the NVIC (Nested Vectored Interrupt Controller) for UART interrupts
- Writing an interrupt service routine (ISR) that preserves registers correctly
- Handling the RP2040’s dual-core memory considerations
Key Concepts:
- UART Protocol & Registers: RP2040 Datasheet Chapter 4 (UART) - Raspberry Pi Foundation
- ARM Procedure Call Standard: “Modern Arm Assembly Language Programming” Ch. 2 - Daniel Kusswurm
- NVIC & Interrupt Handling: “Making Embedded Systems, 2nd Edition” Ch. 4 - Elecia White
- Status Register Polling: STM32 Cortex M0 Bare Metal Tutorial - Martin Hubacek
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 completed, basic serial terminal usage
Real world outcome:
- Open a serial terminal (minicom, screen, PuTTY) and see “Hello from bare-metal assembly!” appear
- Type characters and see them echo back
- Watch interrupt-driven reception handle characters while your main loop does other work
Learning milestones:
- Single character TX works → You understand basic peripheral register access
- Full string TX works → You understand loops and memory access in assembly
- Interrupt-driven RX works → You’ve mastered the Cortex-M exception model
Real World Outcome
Terminal output (exact bytes you’ll see):
$ screen /dev/ttyACM0 115200
Hello from bare-metal assembly!
Character count: 35
Serial protocol on the wire (logic analyzer view):
Baud rate: 115200 bps
Data format: 8N1 (8 data bits, No parity, 1 stop bit)
Voltage levels: 0V (logic 0), 3.3V (logic 1)
Sending 'H' (0x48):
Start bit: 0
Data bits: 0-0-0-1-0-0-1-0 (LSB first, so: 0,0,0,1,0,0,1,0)
Stop bit: 1
Duration per bit: 8.68 μs (1/115200)
Total time per byte: ~87 μs
Polling mode behavior:
; Your code blocks here until TX FIFO has space
poll_loop:
ldr r1, [r0, #UART_FR] ; Read UART flags register
tst r1, #TXFF ; Test TX FIFO full flag
bne poll_loop ; Loop if full
; Continue only when space available
While transmitting, your CPU is 100% busy waiting. LED blinking stops during transmission.
Interrupt mode behavior:
Terminal output:
H.e.l.l.o. .f.r.o.m. .I.S.R.!.
(dots represent LED blinks between characters)
With interrupts, your main loop continues running (blinking LED) while UART ISR handles transmission in the background.
Binary size comparison:
$ arm-none-eabi-size uart.elf
text data bss dec hex filename
512 35 8 555 22b uart.elf (polling mode)
768 35 64 867 363 uart.elf (interrupt mode)
Interrupt handling adds ~256 bytes for ISR and ~56 bytes for RX buffer.
Power consumption difference:
- Polling mode: 20-25mA (CPU always running)
- Interrupt mode: 5-10mA average (CPU can enter WFI between characters)
The Core Question You’re Answering
“How does a processor communicate with the outside world, and what’s the trade-off between polling and interrupts?”
This project reveals the fundamental I/O patterns all drivers use:
- How do you configure a peripheral to speak a protocol (UART)?
- How does the CPU know when a peripheral is ready for more data?
- What’s the difference between blocking I/O and asynchronous I/O?
- How do interrupts let you do two things “at once” on a single-core processor?
- Why do embedded systems need interrupt service routines (ISRs)?
Concepts You Must Understand First
- UART Protocol Fundamentals
- What does “8N1” mean in serial communication?
- Why does UART need both transmitter and receiver to agree on baud rate?
- How is data transmitted: LSB first or MSB first?
- What is the purpose of start and stop bits?
- Book reference: “Making Embedded Systems” Ch. 8 (White)
- Status Flags and Polling
- What is a “flag register”?
- How do you check if bit 5 is set in a register without modifying other bits?
- What’s the difference between
tstandandinstructions? - Why is polling considered “blocking”?
- Book reference: “The Art of ARM Assembly, Volume 1” Ch. 7 (Hyde)
- Baud Rate Calculation
- If system clock is 125MHz and you want 115200 baud, what divisor do you need?
- Formula:
Divisor = ClockFreq / (16 × BaudRate) - Can you calculate the actual error percentage?
- Book reference: RP2040 Datasheet Section 4.2.7
- ARM Exception Model (Cortex-M)
- What is an exception vs an interrupt?
- How does the processor know which ISR to call?
- What registers are automatically saved during exception entry?
- What does the NVIC do?
- Book reference: “Making Embedded Systems” Ch. 4 (White)
- Stack Frames and ISR Calling Convention
- Which registers must an ISR preserve (r4-r11)?
- Which registers are caller-saved (r0-r3)?
- What is the “EXC_RETURN” value in LR?
- Why can’t you use
bx lrin an ISR like a normal function? - Book reference: “Modern Arm Assembly Language Programming” Ch. 2 (Kusswurm)
Questions to Guide Your Design
- How do you calculate the baud rate divisor?
- RP2040’s UART clock is derived from the system clock. At 125MHz, what values go into UARTIBRD and UARTFBRD registers for 115200 baud?
- Which GPIO pins are UART0 TX and RX?
- The RP2040 has flexible GPIO function select. How do you configure GPIO0 and GPIO1 for UART0? What register controls this?
- How do you know if the TX FIFO is full?
- The UART_FR (flags register) has a TXFF bit. At what offset? How do you test just that bit without affecting others?
- What’s the difference between UART_DR and UART_FR?
- UART_DR is the data register (offset 0x000). UART_FR is flags register (offset 0x018). Why do you write to DR but read from FR?
- How do you enable UART interrupts?
- Three steps: (1) Set interrupt mask in UART_IMSC, (2) Enable UART IRQ in NVIC, (3) Write ISR and add to vector table. What are the exact register addresses?
- What happens if you don’t preserve registers in your ISR?
- If your ISR modifies r4-r11 without saving them, what will happen to the interrupted code when it resumes?
- How do you clear a UART interrupt?
- UART interrupts are level-triggered. You must clear the condition (e.g., read from RX FIFO) or write to UART_ICR. What happens if you forget?
Thinking Exercise
Before writing code, trace this on paper:
- UART Transmission Timing Diagram:
Draw the voltage levels for sending ‘A’ (0x41):
Idle: ___________ Start bit: |___ Bit 0 (1): |‾‾ Bit 1 (0): |__ Bit 2 (0): |__ Bit 3 (0): |__ Bit 4 (0): |__ Bit 5 (0): |__ Bit 6 (1): |‾‾ Bit 7 (0): |__ Stop bit: |‾‾‾ Total time at 115200 baud: 10 bits × 8.68μs = 86.8μs - Register Configuration Sequence:
Step 1: Disable UART (clear UARTEN bit in UARTCR) Step 2: Wait for current transmission to complete (check BUSY flag) Step 3: Configure baud rate: UARTIBRD = floor(125000000 / (16 × 115200)) = ? UARTFBRD = round(fraction × 64) = ? Step 4: Set format (8N1) in UARTLCR_H Step 5: Enable UART and TX/RX (set UARTEN, TXE, RXE in UARTCR) - Interrupt Flow Diagram:
Main code running → UART RX FIFO gets byte → Hardware sets RX interrupt flag → NVIC sees enabled interrupt → CPU finishes current instruction → CPU automatically: - Saves r0-r3, r12, LR, PC, xPSR to stack - Loads ISR address from vector table - Sets LR to EXC_RETURN (0xFFFFFFF9) → ISR executes: - Must save r4-r11 if used - Read UART_DR (clears interrupt) - Restore r4-r11 if saved - Return with bx lr (EXC_RETURN) → CPU automatically: - Restores r0-r3, r12, LR, PC, xPSR from stack → Main code resumes exactly where it left off - Calculate actual baud rate error:
Target: 115200 baud Clock: 125 MHz Divisor = 125,000,000 / (16 × 115,200) = 67.8168... Integer part: 67 → UARTIBRD Fractional: 0.8168 × 64 = 52.27 → UARTFBRD = 52 Actual divisor: 67 + 52/64 = 67.8125 Actual baud: 125,000,000 / (16 × 67.8125) = 115,207.37 Error: (115,207.37 - 115,200) / 115,200 = 0.0064% ✓ (< 2% acceptable)
The Interview Questions They’ll Ask
- “Explain the UART frame format and why we need start/stop bits.”
- Expected answer: UART is asynchronous (no clock line). Start bit (0) signals beginning of byte, synchronizes receiver. 8 data bits sent LSB first. Stop bit (1) returns line to idle state and provides gap between frames. Receiver samples in middle of each bit period.
- “What’s the difference between polling and interrupt-driven I/O? When would you use each?”
- Expected answer: Polling continuously checks a status flag, blocking CPU. Simple but wastes power and CPU time. Interrupts let hardware notify CPU when ready, allowing CPU to do other work. Use polling for simple, fast operations; interrupts for slow peripherals or multi-tasking.
- “Walk me through what happens during ARM Cortex-M exception entry.”
- Expected answer: CPU finishes current instruction, stacks r0-r3, r12, LR, PC, xPSR (8 registers), loads ISR address from vector table, sets LR to EXC_RETURN magic value, jumps to ISR. Return: bx lr with EXC_RETURN triggers automatic unstacking.
- “Why must ISRs be fast, and what happens if they’re not?”
- Expected answer: ISRs block other interrupts of same/lower priority. Long ISR causes interrupt latency, potentially missing events (e.g., UART RX overrun). Best practice: ISR sets flag/queues data, main loop processes. Keep ISRs under 10-20μs.
- “How do you prevent race conditions between ISR and main code?”
- Expected answer: Shared data needs protection. Options: (1) Disable interrupts during critical section (CPSID I / CPSIE I), (2) Use atomic operations, (3) Use volatile keyword (in C) to prevent compiler optimization. ARM Cortex-M has LDREX/STREX for atomic read-modify-write.
- “What is the NVIC and how do you configure it?”
- Expected answer: Nested Vectored Interrupt Controller. Manages interrupts for Cortex-M. Configure by: (1) Set priority in NVIC_IPRx, (2) Enable specific interrupt in NVIC_ISER. Each interrupt has an IRQ number that maps to vector table position.
Hints in Layers
Hint 1: UART Register Base Addresses
.equ UART0_BASE, 0x40034000
.equ UARTDR, 0x000 // Data register
.equ UARTFR, 0x018 // Flag register
.equ UARTIBRD, 0x024 // Integer baud rate
.equ UARTFBRD, 0x028 // Fractional baud rate
.equ UARTLCR_H, 0x02C // Line control
.equ UARTCR, 0x030 // Control register
.equ UARTIMSC, 0x038 // Interrupt mask
.equ UARTICR, 0x044 // Interrupt clear
Hint 2: Baud Rate Configuration
// For 115200 baud at 125MHz system clock:
// Divisor = 125000000 / (16 * 115200) = 67.8168
ldr r0, =UART0_BASE
mov r1, #67 // Integer part
str r1, [r0, #UARTIBRD]
mov r1, #52 // Fractional: 0.8168 * 64 ≈ 52
str r1, [r0, #UARTFBRD]
Hint 3: Polling TX Function
// Send character in r1
uart_putc:
ldr r0, =UART0_BASE
poll_tx:
ldr r2, [r0, #UARTFR]
tst r2, #0x20 // Test TXFF (bit 5)
bne poll_tx // Wait while full
str r1, [r0, #UARTDR] // Write character
bx lr
Hint 4: NVIC Configuration for Interrupts
.equ NVIC_ISER, 0xE000E100 // Interrupt Set Enable Register
.equ UART0_IRQ, 20 // UART0 IRQ number
// Enable UART0 interrupt
ldr r0, =NVIC_ISER
mov r1, #1
lsl r1, r1, #UART0_IRQ
str r1, [r0]
Hint 5: Complete ISR Structure
// In vector table (position 20 + 16 = 36):
.word uart0_isr
// ISR implementation:
uart0_isr:
push {r4-r7, lr} // Save registers ISR will use
ldr r4, =UART0_BASE
ldr r5, [r4, #UARTDR] // Read character (clears interrupt)
// Store character in buffer (your code here)
ldr r6, =rx_buffer
ldr r7, =rx_write_ptr
// ... buffer management ...
pop {r4-r7, pc} // Restore and return
Hint 6: String Transmission Loop
// Send null-terminated string pointed to by r0
uart_puts:
push {r4, lr}
mov r4, r0 // Save string pointer
loop:
ldrb r1, [r4] // Load byte
cmp r1, #0 // Check for null
beq done
bl uart_putc // Send character
add r4, r4, #1 // Next character
b loop
done:
pop {r4, pc}
Hint 7: Complete UART Initialization
uart_init:
ldr r0, =UART0_BASE
// Disable UART
mov r1, #0
str r1, [r0, #UARTCR]
// Set baud rate (115200)
mov r1, #67
str r1, [r0, #UARTIBRD]
mov r1, #52
str r1, [r0, #UARTFBRD]
// 8N1 format: 8 bits, no parity, 1 stop, FIFOs enabled
mov r1, #0x70 // WLEN=11 (8 bits), FEN=1 (FIFO enable)
str r1, [r0, #UARTLCR_H]
// Enable UART, TX, and RX
mov r1, #0x301 // UARTEN=1, TXE=1, RXE=1
str r1, [r0, #UARTCR]
bx lr
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| UART Protocol Overview | “Making Embedded Systems” by Elecia White | Ch. 8 - Communication Interfaces |
| Serial Communication Timing | “RP2040 Assembly Language Programming” by Stephen Smith | Ch. 10 - UART Communication |
| UART Register Details | RP2040 Datasheet | Section 4.2 - UART |
| Baud Rate Calculation | RP2040 Datasheet | Section 4.2.7 - Baud Rate Generation |
| Polling vs Interrupts | “Making Embedded Systems” by Elecia White | Ch. 4 - Interrupts |
| Cortex-M Exception Model | “The Art of ARM Assembly, Volume 1” by Randall Hyde | Ch. 14 - Exception Handling |
| NVIC Programming | ARM Cortex-M0+ Devices Generic User Guide | Section 4.2 - NVIC |
| ISR Best Practices | “Making Embedded Systems” by Elecia White | Ch. 4.4 - Interrupt Service Routines |
| AAPCS Register Usage | “Modern Arm Assembly Language Programming” by Daniel Kusswurm | Ch. 2 - ARM Calling Convention |
| Bit Manipulation | “The Art of ARM Assembly, Volume 1” by Randall Hyde | Ch. 7 - Bit Operations |
| GPIO Function Select | RP2040 Datasheet | Section 2.19.2 - Function Select |
Project 3: Bare-Metal Framebuffer Graphics (Raspberry Pi 4/5)
- File: ARM_ASSEMBLY_LEARNING_PROJECTS.md
- Programming Language: ARM Assembly
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Graphics / Embedded Systems
- Software or Tool: Raspberry Pi 4 / VideoCore
- Main Book: “Modern Arm Assembly Language Programming” by Daniel Kusswurm
What you’ll build: A bare-metal program in AArch64 assembly that initializes the GPU mailbox, requests a framebuffer, and draws graphics directly to the screen—no Linux, no libraries.
Why it teaches ARM assembly: This forces you into the full 64-bit AArch64 instruction set with its different register conventions, while dealing with real hardware initialization that requires understanding mailbox protocols and memory barriers.
Core challenges you’ll face:
- Writing AArch64 boot code that sets up the stack and clears BSS
- Understanding the VideoCore mailbox interface for framebuffer allocation
- Implementing memory barriers (
dmb,dsb) for peripheral access - Drawing pixels by calculating memory offsets from (x, y) coordinates
- Optimizing drawing loops with AArch64’s larger register file
Resources for key challenges:
- Writing a “bare metal” OS for Raspberry Pi 4 - Complete walkthrough for RPi4 bare metal
- dwelch67/raspberrypi - Extensive bare metal examples including framebuffer
Key Concepts:
- AArch64 Architecture: “Modern Arm Assembly Language Programming” Ch. 1, 17 - Daniel Kusswurm
- 64-bit Calling Convention: “The Art of ARM Assembly, Volume 1” Ch. 16 - Randall Hyde
- Memory Barriers: ARM Architecture Reference Manual (ARM DDI 0487) - Section B2.3
- Raspberry Pi Mailbox: OSDev Wiki - Raspberry Pi Bare Bones
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Comfort with 32-bit ARM concepts, understanding of pointers and memory
Real world outcome:
- Your Raspberry Pi boots directly into your code (no Linux!) and displays colored rectangles or patterns on the monitor
- You can draw a simple animation (bouncing box, scrolling colors)
- You have a foundation for building a graphics demo or simple game
Learning milestones:
- Screen shows a solid color → You understand mailbox protocol and framebuffer allocation
- You can draw arbitrary pixels → You understand memory addressing in AArch64
- Simple animation runs smoothly → You understand efficient loops and timing
Real World Outcome
When you successfully complete this project, here’s exactly what you’ll experience:
The Boot Sequence: Insert your SD card into the Raspberry Pi 4/5 and power it on. Instead of the usual Linux boot sequence, within 500ms your screen will light up directly from your assembly code. No bootloader messages, no kernel log—just your code talking directly to the VideoCore GPU.
Visual Output: Your HDMI monitor displays a framebuffer at 1920x1080 resolution (or 1280x720 if you configure it for lower resolution). The default will be 32-bit RGBA format, meaning each pixel takes 4 bytes: Red, Green, Blue, and Alpha channels. You’ll see:
- Solid color fill: The entire screen filled with RGB(255, 0, 0) - pure red, or RGB(0, 128, 255) - azure blue
- Gradient patterns: Smooth color transitions from top to bottom, created by calculating RGB values based on Y coordinate
- Geometric shapes: Rectangles, horizontal/vertical lines with precise pixel-perfect positioning
- Animated content: A colored box (say, 50x50 pixels) bouncing around the screen at 60 FPS, reflecting off edges
Technical Achievement: You’ve created a 2MB+ framebuffer in GPU memory (1920×1080×4 bytes), received its physical address via the mailbox, and your ARM core is writing directly to it. The GPU’s display controller continuously reads this memory region and outputs it to HDMI at 60Hz refresh rate. Every pixel you write appears on screen within 16ms.
Debug Experience: Unlike the Pico projects where you had UART for debugging, here you’re debugging blind until that first pixel appears. Your debugging tools are:
- LED blinks on GPIO pins to indicate code progress
- Using a logic analyzer on UART if you implement basic serial output
- The moment of triumph when the screen changes from black to your chosen color
Performance Characteristics: Your drawing routines will achieve:
- Full screen clear: ~2ms (writing 8MB of data)
- Single pixel plot: ~50 nanoseconds (one store instruction)
- 100×100 filled rectangle: ~200 microseconds
- 60 FPS animation with double buffering without visible tearing
The Core Question You’re Answering
“How does software actually control what appears on a modern display without an operating system?”
This fundamental question branches into several sub-questions:
- Hardware initialization: How does an ARM CPU communicate with a completely different processor (the VideoCore GPU) to request resources?
- Memory architecture: Where does the framebuffer actually live in physical memory, and how do you get a pointer to it?
- Pixel representation: How do you convert a 2D coordinate (x, y) and a color (R, G, B) into a memory write operation?
- AArch64 differences: Why do 64-bit ARM systems require different approaches than 32-bit systems for peripheral access?
- Memory ordering: Why do you need memory barriers when writing to peripherals, and what happens if you omit them?
The deeper insight: You’re learning how heterogeneous computing actually works. The Raspberry Pi has multiple processors (ARM cores, GPU, USB controller, etc.), and this project reveals the communication protocols between them. This is the foundation of modern embedded systems where specialized processors handle different tasks.
Concepts You Must Understand First
- AArch64 Register Model and Calling Convention
- Why does AArch64 have 31 general-purpose registers (x0-x30) instead of ARM32’s 16?
- What’s the difference between W registers (32-bit) and X registers (64-bit)?
- Which registers must be preserved across function calls? (x19-x28, x29/FP, x30/LR)
- Why is x8 special for return values larger than 128 bits?
- Book references: “Modern Arm Assembly Language Programming” Ch. 1 & 2 - Daniel Kusswurm; “ARM Cortex-A Series Programmer’s Guide” Ch. 3
- Boot Process and Exception Levels
- What are EL0, EL1, EL2, EL3 (Exception Levels) and which one does your bare-metal code run at?
- How does the Raspberry Pi boot ROM load your
kernel8.imgfile? - Why does the GPU start first, not the ARM cores?
- What state are the CPU caches in at boot time?
- Book references: “Raspberry Pi Assembly Language Programming” Ch. 2 - Stephen Smith; Raspberry Pi 4 Boot Flow Documentation
- Memory-Mapped I/O and Bus Architecture
- What physical address does the mailbox peripheral live at? (0xFE00B880 on RPi 4)
- Why is it different from the address in the ARM peripherals manual? (Bus address vs physical address)
- What’s the difference between cached and uncached memory regions?
- Why can’t you use normal DRAM load/store instructions for peripherals without barriers?
- Book references: “Making Embedded Systems, 2nd Edition” Ch. 9 - Elecia White; BCM2711 ARM Peripherals Manual - Section 1.2
- Mailbox Protocol and Property Interface
- What is the VideoCore mailbox and why does it exist?
- What’s the structure of a mailbox message? (Buffer, tags, end marker)
- How do you request a framebuffer? (Tag 0x00048003: Set Physical Display)
- What happens if the GPU rejects your framebuffer request?
- Book references: Raspberry Pi Firmware Wiki - Mailbox Property Interface
- Memory Barriers and Ordering
- What’s the difference between
DMB,DSB, andISBinstructions? - Why do you need
DSB SYbefore writing to the mailbox? - What could go wrong if you omit barriers? (Writes reordered, stuck in write buffer)
- When do you need barriers for regular framebuffer writes? (Usually not needed!)
- Book references: ARM Architecture Reference Manual Section B2.3 “Memory Ordering”; “Modern Arm Assembly Language Programming” Ch. 17 - Daniel Kusswurm
- What’s the difference between
- Framebuffer Pixel Formats
- What does RGB888, RGB565, RGBA32 mean?
- How do you calculate the byte offset for pixel (x, y)? (
offset = (y * pitch) + (x * bytes_per_pixel)) - What is “pitch” and why might it not equal
width * bytes_per_pixel? (Alignment requirements) - What’s the difference between physical and virtual dimensions?
- Book references: “Computer Graphics from Scratch” Ch. 2 - Gabriel Gambetta
- Linker Scripts for AArch64
- Where should your code load in physical memory? (0x80000 for kernel8.img)
- How do you create separate sections for .text, .data, .bss?
- Why must you align the mailbox buffer to 16 bytes?
- Book references: Bare Metal Programming Guide - Linker Scripts
Questions to Guide Your Design
-
Initialization sequence: In what order must you initialize components? (Stack → BSS clear → Mailbox request → Framebuffer received → Drawing)
-
Mailbox buffer structure: How do you construct the property buffer with tags for Set Physical Width/Height (0x00048003), Set Virtual Width/Height (0x00048004), Set Depth (0x00048005), and Allocate Buffer (0x00040001)?
-
Address translation: The mailbox returns a GPU bus address (0xC0000000 + offset). How do you convert it to an ARM physical address? (Mask off top bits:
AND 0x3FFFFFFF) -
Pixel plotting: Given (x, y, r, g, b), how do you calculate the memory address and construct the pixel value?
-
Drawing optimization: How do you fill a rectangle efficiently? (Inner loop: use
STPto write 2 pixels at once, unroll loop) -
Error handling: What do you do if the mailbox never responds? (Timeout counter, blink LED in error pattern)
-
Double buffering: How would you implement page flipping? (Allocate 2× virtual height, change Y offset via mailbox tag 0x00048009)
Thinking Exercise
Exercise 1: Framebuffer Memory Layout (15 minutes, paper and pencil)
Draw the memory layout for a 4×3 pixel framebuffer in RGBA32 format:
Pixel grid (x increases right, y increases down):
(0,0) (1,0) (2,0) (3,0)
(0,1) (1,1) (2,1) (3,1)
(0,2) (1,2) (2,2) (3,2)
Assume framebuffer starts at address 0x0C000000, pitch = 16 bytes.
Tasks:
- Draw the memory layout showing byte addresses and which pixel each RGBA group belongs to
- Calculate the address of pixel (2, 1) - show your work
- If you want to draw a red pixel (255, 0, 0) at (2, 1), what value (in hex) do you write?
- Why is the pitch 16 bytes, not 12 bytes? (Hint: alignment)
Exercise 2: Mailbox Protocol Trace (10 minutes)
Trace through a mailbox buffer requesting a 640×480 framebuffer at 32-bit depth. What values go at each address? What values does the GPU write back after processing?
The Interview Questions They’ll Ask
- “Explain the difference between the ARM CPU and GPU on the Raspberry Pi, and how they communicate.”
- What they’re really asking: Do you understand heterogeneous computing?
- Strong answer: “The Raspberry Pi has a VideoCore GPU that actually boots first and loads the ARM kernel. They communicate via the mailbox interface—a shared memory protocol where each processor has MMIO registers to send/receive messages. The mailbox uses tagged property buffers where the ARM requests resources (like framebuffer allocation) and the GPU responds. I implemented this in assembly using memory barriers to ensure write ordering, since the two processors don’t share cache coherency.”
- “You write a pixel to the framebuffer but nothing appears on screen. Walk me through your debugging process.”
- What they’re really asking: Can you systematically debug hardware issues?
- Strong answer: “First, I’d verify the mailbox request succeeded—check that the response code is 0x80000000 and the framebuffer address is non-zero. Second, I’d check address translation—the GPU returns a bus address that needs masking. Third, I’d verify my pixel offset calculation using a simple case like (0,0). Fourth, I’d check if I used the wrong byte order—maybe I wrote ABGR instead of RGBA. Finally, I’d verify memory barriers were placed correctly before the mailbox write. I’d use GPIO pins to blink at each stage to narrow down where it fails.”
- “What’s the difference between pitch and width in a framebuffer?”
- What they’re really asking: Do you understand memory alignment?
- Strong answer: “Width is the visible pixel width, but pitch is the number of bytes per scanline in memory. They differ due to alignment—GPUs often require scanlines aligned to cache line boundaries (64 or 256 bytes). For example, a 1920-pixel width at 4 bytes/pixel is 7680 bytes, but the pitch might be rounded to 7936 bytes. If you use width instead of pitch in your offset calculation, your image will appear sheared and corrupted.”
- “Why do you need a memory barrier before writing to the mailbox, but not when drawing pixels?”
- What they’re really asking: Do you understand memory ordering and when it matters?
- Strong answer: “The mailbox is a peripheral with side effects—once you write to its control register, the hardware acts on whatever’s in the buffer. Without a barrier, the CPU could reorder writes, so the mailbox pointer write might occur before the buffer data writes complete. But framebuffer writes are to normal cacheable memory, and the GPU reads it continuously, not transactionally. The CPU’s cache coherency mechanism handles the ordering. You only need barriers for peripherals, not regular memory.”
- “How would you implement double buffering for smooth animation?”
- What they’re really asking: Do you understand the GPU’s display pipeline?
- Strong answer: “Allocate a virtual framebuffer twice the physical height—e.g., 1920×2160 for a 1920×1080 display. Draw to the off-screen half while GPU displays the other half. When drawing completes, use mailbox tag 0x00048009 (Set Virtual Offset) to instantly flip which half is visible. This is page flipping—true double buffering. Alternatively, you could allocate two separate buffers and change the framebuffer base address, but changing the offset is faster since it doesn’t reallocate.”
- “You see screen tearing during animation. What’s causing it and how do you fix it?”
- What they’re really asking: Do you understand frame synchronization?
- Strong answer: “Tearing occurs when you update the framebuffer mid-refresh—the GPU is halfway through drawing the old frame when you write new data. Fix it by synchronizing with vertical blank (VSYNC). On bare metal, you can use a timer to wait ~16ms between frames, or ideally, use mailbox properties to detect VSYNC events if the VideoCore exposes them. Only flip buffers or update pixels during the VSYNC interval.”
Hints in Layers
Hint Layer 1: Getting Started
Your first goal is to see anything on screen. Start with the mailbox.
.section .data
.align 4
mailbox_buffer:
.word 35*4 // Buffer size
.word 0 // Request code
// Set physical size
.word 0x48003 // Tag
.word 8 // Value size
.word 8 // Request size
.word 1280 // Width
.word 720 // Height
// Set virtual size (same as physical)
.word 0x48004
.word 8
.word 8
.word 1280
.word 720
// Set depth
.word 0x48005
.word 4
.word 4
.word 32 // 32 bits per pixel
// Allocate buffer
.word 0x40001
.word 8
.word 8
.word 16 // Alignment
.word 0 // Size (response)
.word 0 // End tag
Don’t move on until this mailbox request returns successfully.
Hint Layer 2: Mailbox Communication
The mailbox peripheral on RPi 4 is at physical address 0xFE00B880 (note: this is different from RPi 3!).
.equ MAILBOX_BASE, 0xFE00B880
.equ MAILBOX_READ, 0x00
.equ MAILBOX_STATUS, 0x18
.equ MAILBOX_WRITE, 0x20
.equ MAILBOX_FULL, 0x80000000
.equ MAILBOX_EMPTY, 0x40000000
mailbox_write:
// x0 = buffer address, x1 = channel (8 for property tags)
ldr x2, =MAILBOX_BASE
orr x0, x0, x1 // Combine address and channel
// Wait until mailbox not full
1: ldr w3, [x2, MAILBOX_STATUS]
tst w3, MAILBOX_FULL
b.ne 1b
// Write to mailbox
dsb sy // Barrier before peripheral write!
str w0, [x2, MAILBOX_WRITE]
ret
Key insight: The buffer address must be physical (not virtual) and bits [3:0] encode the channel number.
Hint Layer 3: Parsing the Response
After sending the mailbox request, you need to read it back to get the framebuffer address:
mailbox_read:
// x0 = channel to read from
ldr x2, =MAILBOX_BASE
mov x1, x0 // Save channel
1: // Wait for response
ldr w3, [x2, MAILBOX_STATUS]
tst w3, MAILBOX_EMPTY
b.ne 1b
// Read response
ldr w0, [x2, MAILBOX_READ]
and w3, w0, #0xF // Extract channel
cmp w3, w1 // Is it our channel?
b.ne 1b // No, keep waiting
and x0, x0, #0xFFFFFFF0 // Mask off channel bits
ret
After this returns, check the response buffer for the framebuffer address.
Hint Layer 4: Address Translation
The GPU returns a bus address like 0xC1234000. Convert it to ARM physical address:
// x0 = GPU bus address
and x0, x0, #0x3FFFFFFF // Strip bus identifier
// Now x0 is the physical address ARM can use
Why? The VideoCore uses a different address space. Addresses with 0xC0000000 set are cached alias addresses.
Hint Layer 5: Drawing Your First Pixel
// x0 = framebuffer base
// w1 = x coordinate
// w2 = y coordinate
// w3 = color (0x00RRGGBB format)
draw_pixel:
ldr x4, =framebuffer_pitch // e.g., 1280 * 4 = 5120
ldr w4, [x4]
// offset = y * pitch + x * 4
mul w5, w2, w4 // y * pitch
add w5, w5, w1, lsl #2 // + (x * 4)
// Write pixel
str w3, [x0, w5]
ret
Test with: draw_pixel(framebuffer, 100, 100, 0x00FF0000) - should see a red pixel near top-left.
Hint Layer 6: Fill Screen Optimization
To fill the entire screen efficiently:
fill_screen:
// x0 = framebuffer base
// w1 = color
ldr x2, =1280*720 // Pixel count
1: str w1, [x0], #4 // Write and post-increment
sub x2, x2, #1
cbnz x2, 1b
ret
Better optimization using SIMD:
dup v0.4s, w1 // Broadcast color to all 4 lanes
1: st1 {v0.4s}, [x0], #16 // Write 4 pixels at once
sub x2, x2, #4
cbnz x2, 1b
This is 4× faster!
Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| AArch64 register model | Modern Arm Assembly Language Programming (Kusswurm) | Ch. 1-2 |
| AArch64 instruction set | The Art of ARM Assembly, Volume 1 (Hyde) | Ch. 16 |
| Memory barriers | ARM Architecture Reference Manual | Section B2.3 |
| Raspberry Pi boot process | Raspberry Pi Assembly Language Programming (Smith) | Ch. 2, 11 |
| Mailbox protocol | RPi Firmware Wiki | Mailbox Property Interface |
| Peripheral addressing | BCM2711 ARM Peripherals | Section 1.2 |
| Framebuffer concepts | Computer Graphics from Scratch (Gambetta) | Ch. 2 |
| Bare-metal examples | rpi4os.com | Part 4 - Framebuffer |
| Optimization techniques | Write Great Code, Volume 2 (Hyde) | Ch. 12-13 |
| SIMD programming | Modern Arm Assembly Language Programming (Kusswurm) | Ch. 8-11 |
Project 4: PIO Assembly for Custom Protocol (Raspberry Pi Pico)
- File: ARM_ASSEMBLY_LEARNING_PROJECTS.md
- Programming Language: ARM Assembly & PIO Assembly
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Embedded Systems / Timing
- Software or Tool: Raspberry Pi Pico (PIO)
- Main Book: “RP2040 Assembly Language Programming” by Stephen Smith
What you’ll build: Use the RP2040’s unique PIO (Programmable I/O) state machines to implement the WS2812B (NeoPixel) LED protocol in PIO assembly—a completely different assembly language than ARM!
Why it teaches ARM assembly: The PIO has its own 9-instruction assembly language that runs in parallel with the Cortex-M0+ cores. Understanding both systems shows you how specialized vs general-purpose processors differ, and you’ll write ARM assembly to control the PIO.
Core challenges you’ll face:
- Learning PIO assembly’s unique instruction set (just 9 instructions!)
- Understanding side-set and delay for precise timing
- Writing ARM assembly to load PIO programs and configure state machines
- Generating precisely-timed signals (WS2812B needs 800kHz timing)
- Coordinating between ARM code and PIO execution
Key Concepts:
- PIO Architecture: RP2040 Datasheet Chapter 3 (PIO) - Raspberry Pi Foundation
- WS2812B Protocol Timing: “RP2040 Assembly Language Programming” Ch. 13-15 - Stephen Smith
- State Machine Concepts: DigiKey - How to Use PIO
- ARM/PIO Integration: Raspberry Pi Pico Assembly Programming - Codalogic
Difficulty: Intermediate-Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 completed, basic understanding of timing diagrams
Real world outcome:
- A strip of WS2812B LEDs displaying patterns you control
- Rainbow effects, chasing lights, or reactive displays
- Understanding of how to implement any bit-banged protocol with precise timing
Learning milestones:
- Single LED lights up with correct color → You understand PIO basics and WS2812B protocol
- Full strip shows patterns → You understand FIFO communication between ARM and PIO
- Smooth animations run → You’ve mastered both ARM and PIO assembly working together
Real World Outcome
When you successfully complete this project, here’s exactly what you’ll see:
Physical Setup: A strip of WS2812B LEDs (NeoPixels) connected to GPIO 0 on your Raspberry Pi Pico. The strip has individually addressable LEDs, each containing red, green, and blue LEDs. Common strips have 30, 60, or 144 LEDs per meter.
Visual Output: Your LED strip will display:
- Single color test: The first LED lights up bright red (255, 0, 0), confirming the PIO timing is correct
- Full strip solid color: All LEDs show the same color (e.g., cyan: RGB(0, 255, 255)), proving the data chain works
- Rainbow pattern: Each LED shows a different hue, creating a smooth rainbow gradient across the strip
- Chasing animation: A colored “dot” moves smoothly along the strip at your chosen speed
- Color fade effects: LEDs smoothly transition between colors, updating at 60 FPS
Technical Achievement: The PIO state machine is generating a precisely-timed 800kHz signal on GPIO 0. Each bit is encoded as:
- 0 bit: 350ns high, 800ns low (total 1.25μs)
- 1 bit: 700ns high, 600ns low (total 1.25μs)
Your ARM code pushes 24-bit RGB values (GRB format) into the PIO FIFO, and the PIO automatically serializes them into this protocol. The entire operation happens in parallel with your ARM core—you can update LED colors while the PIO independently handles the bit-level timing.
Debugging Experience: Unlike GPIO or UART, PIO errors are silent. Your debugging process will be:
- Logic analyzer on GPIO 0 showing the exact timing waveform
- First LED not lighting → timing is wrong or data format incorrect
- First LED correct but others wrong → data chain issue or reset timing
- Random colors → bit order wrong (RGB vs GRB) or endianness issue
- Flickering → not sending data fast enough or FIFO underrun
Performance: With PIO handling the timing, your ARM core is free to:
- Calculate next frame’s colors (HSV to RGB conversion)
- Read sensor data for reactive effects
- Handle button input
- Run at low clock speed to save power
Typical performance:
- 60 LED strip update: 1.8ms (30μs per LED × 60)
- ARM CPU usage during update: <5% (just FIFO writes)
- Power: ~1.8W for 60 LEDs at full white brightness
The Core Question You’re Answering
“How do you generate microsecond-precise signals on a microcontroller while the CPU does other work?”
This question reveals a fundamental tradeoff in embedded systems:
- Software bit-banging (toggling GPIO in a loop): Perfect timing but blocks the CPU
- Hardware peripherals (UART, SPI): Non-blocking but fixed protocols
- PIO (Programmable I/O): Non-blocking AND custom protocols!
You’re learning how specialized, parallel processors handle timing-critical tasks that would be impossible or wasteful on a general-purpose CPU. This is the same principle behind GPUs, DSPs, and DMA controllers—offload specialized work to dedicated hardware.
Concepts You Must Understand First
- What is the WS2812B protocol and why is it challenging?
- Why can’t you use SPI for WS2812B? (Timing doesn’t match)
- What does “800kHz” mean for this protocol? (Each bit is 1.25μs)
- Why does WS2812B use PWM encoding instead of levels? (Single wire, self-clocking)
- What is the GRB byte order? (Green, Red, Blue per pixel)
- Book references: “RP2040 Assembly Language Programming” Ch. 13 - Stephen Smith
- How does a PIO state machine work?
- What are the 4 PIO state machines and can they run simultaneously?
- What is the instruction memory (32 instructions shared across all SMs)?
- What are the four 32-bit FIFOs (TX and RX per SM)?
- How fast can a PIO state machine run? (Up to system clock speed: 125MHz)
- Book references: RP2040 Datasheet Ch. 3.1-3.2; “RP2040 Assembly Language Programming” Ch. 14-15
- What is PIO assembly’s instruction set?
- JMP, WAIT, IN, OUT, PUSH, PULL, MOV, IRQ, SET (just 9 instructions!)
- What is “side-set” and how does it differ from SET? (Parallel GPIO control)
- What are delay cycles and how do you encode them? (5-bit delay field)
- Why is there no ADD or SUB instruction? (PIO is not a CPU, it’s a sequencer)
- Book references: RP2040 Datasheet Ch. 3.4 “PIO Instruction Set”
- How do ARM and PIO coordinate?
- How does ARM load a PIO program into instruction memory? (Write to PIO_INSTRx registers)
- What is the PIO FIFO and how does ARM write to it? (Memory-mapped registers)
- How do you start a state machine? (Set SM_ENABLE bit in PIO_CTRL)
- What happens if the FIFO is empty when PIO pulls? (Stalls until data available)
- Book references: RP2040 Datasheet Ch. 3.5 “PIO Control and Status Registers”
- Clock division and timing
- The system clock is 125MHz. How do you get 800kHz for WS2812B?
- Clock divider formula:
actual_freq = sys_freq / (divider + frac/256) - For 800kHz: 125MHz / 800kHz = 156.25, but each PIO instruction has 3 cycles…
- Why does the WS2812B program use side-set instead of separate SET instructions? (Saves cycles)
- Book references: “RP2040 Assembly Language Programming” Ch. 15
- GPIO muxing and PIO pins
- How do you assign a GPIO pin to PIO control? (GPIO_CTRL register, FUNCSEL=PIO0/PIO1)
- What is the difference between base pins, sideset pins, and SET pins?
- Can multiple PIO state machines share the same GPIO pin? (No, conflicts!)
- Book references: RP2040 Datasheet Ch. 2.19 “GPIO Functions”
- Autopull and shift direction
- What is “autopull” and why use it? (Automatically pulls from FIFO when OSR empty)
- What is the OSR (Output Shift Register)?
- Shift direction: MSB-first or LSB-first? (Depends on protocol)
- Pull threshold: how many bits must be shifted before pull? (8, 16, 24, or 32)
- Book references: RP2040 Datasheet Ch. 3.5.4 “Autopush and Autopull”
Questions to Guide Your Design
- PIO program structure: What is the minimal PIO program to send one bit to a WS2812B LED?
; Simplified (not complete): .wrap_target out x, 1 ; Get one bit from OSR into x jmp !x, send_0 ; If bit is 0, jump to send_0 send_1: set pins, 1 [T1-1] ; High for T1 cycles set pins, 0 [T2-1] ; Low for T2 cycles jmp .wrap_target send_0: set pins, 1 [T3-1] ; High for T3 cycles set pins, 0 [T4-1] ; Low for T4 cycles .wrap - Timing calculations: Given 125MHz system clock, what clock divider gets you close to 800kHz?
- Target: 1.25μs per bit = 800kHz
- PIO instructions with timing: need to fit 0.35μs, 0.7μs, 0.6μs, 0.8μs pulses
- Solution: Use a clock divider to make each PIO cycle equal a fraction of 1.25μs
- ARM-side setup: What ARM assembly code loads the PIO program?
; Load PIO program into instruction memory ldr r0, =PIO0_BASE ldr r1, =pio_program ; Your PIO instructions mov r2, #0 ; Start at offset 0 loop: ldr r3, [r1], #4 ; Load instruction str r3, [r0, #PIO_INSTR_MEM + r2] add r2, #4 cmp r2, #(program_len * 4) blt loop - FIFO writing: How do you send an RGB value (24 bits) to the PIO?
; Convert RGB (r, g, b) to GRB format (WS2812B requirement) ; r4 = R, r5 = G, r6 = B lsl r0, r5, #16 ; G in bits 16-23 orr r0, r0, r4, lsl #8 ; R in bits 8-15 orr r0, r0, r6 ; B in bits 0-7 ; Now r0 = 0x00GGRRBB ldr r1, =PIO0_TXF0 ; State machine 0 TX FIFO str r0, [r1] ; Write to FIFO (PIO will autopull) - Error handling: What if the FIFO is full? How do you check?
ldr r1, =PIO0_FSTAT ldr r2, [r1] tst r2, #(1 << 0) ; Check TXFULL bit for SM0 bne fifo_full ; If full, wait or skip - Reset timing: WS2812B needs >50μs low to latch colors. How do you ensure this?
; After writing all LEDs, wait for latch time ldr r0, =(50 * 125) ; 50μs × 125 cycles/μs = 6250 cycles delay: subs r0, #1 bne delay - Multiple LEDs: How do you send data for a 60-LED strip?
ldr r7, =60 ; LED count led_loop: ; Calculate next LED color (e.g., rainbow) bl calculate_rgb ; Send to FIFO bl write_pio_fifo subs r7, #1 bne led_loop ; Wait for latch bl wait_latch
Thinking Exercise
Exercise 1: WS2812B Timing Diagram (15 minutes, paper and pencil)
Draw the voltage waveform for sending two bits: 1 then 0
Bit 1 encoding:
______ (T1H = 700ns high, T1L = 600ns low)
| |________|
Bit 0 encoding:
___ (T0H = 350ns high, T0L = 800ns low)
| |___________|
Tasks:
- Draw the combined waveform for sending
1then0(2.5μs total) - Mark the time axis in nanoseconds
- If system clock is 125MHz, how many CPU cycles is 700ns? (125 × 0.7 = 87.5 cycles)
- Why is the PIO better than bit-banging this in ARM assembly? (Precise timing without blocking CPU)
Exercise 2: PIO Instruction Trace (20 minutes)
Trace through this PIO program for input data 0b10:
.program ws2812
.side_set 1
.wrap_target
out x, 1 side 0 ; Get one bit, drive pin low
jmp !x do_zero side 1 [T1-1] ; Pin high, test bit
do_one:
jmp .wrap_target side 0 [T2-1] ; Pin low
do_zero:
nop side 0 [T3-1] ; Pin low (total T0H+T0L)
.wrap
Tasks:
- Trace register X and pin state after each instruction for input
0b10 - How many cycles does it take to send one bit?
- What happens when OSR is empty? (Autopull from FIFO)
The Interview Questions They’ll Ask
- “What makes the PIO unique compared to traditional peripherals like UART or SPI?”
- What they’re really asking: Do you understand programmable hardware?
- Strong answer: “PIO is a programmable state machine with its own instruction memory, not a fixed-function peripheral. You write a tiny assembly program (up to 32 instructions) that runs independently at up to 125MHz. This lets you implement custom protocols—like WS2812B’s precise PWM timing—that don’t match standard peripherals. It’s like having a tiny, specialized coprocessor for I/O.”
- “Why can’t you use SPI to drive WS2812B LEDs?”
- What they’re really asking: Do you understand protocol timing requirements?
- Strong answer: “WS2812B uses PWM encoding where a ‘1’ is 700ns high/600ns low and a ‘0’ is 350ns high/800ns low—both 1.25μs total. SPI uses clock + data lines with fixed bit times. You’d need to run SPI at exactly 2.4MHz and carefully craft bytes to produce the right pulse widths, which is fragile and wastes bandwidth. PIO lets you directly encode the protocol timing.”
- “Walk me through what happens when you write to the PIO FIFO.”
- What they’re really asking: Do you understand the ARM-PIO data flow?
- Strong answer: “When I write a 32-bit value to PIO_TXF0 using ARM assembly, it goes into a 4-deep hardware FIFO. The PIO state machine, running in parallel, has autopull configured. When its OSR (Output Shift Register) empties after shifting out 24 bits, the PIO automatically pulls the next value from the FIFO into the OSR. If the FIFO is empty, the PIO stalls until ARM writes more data. This decouples ARM timing from PIO timing.”
- “How do you debug PIO timing issues?”
- What they’re really asking: Can you debug hardware timing?
- Strong answer: “First, use a logic analyzer on the GPIO pin to capture the actual waveform and measure pulse widths. Compare against spec: WS2812B needs 0.35/0.7μs ±150ns. Second, check the clock divider calculation—off-by-one errors are common. Third, verify the PIO program’s delay cycles in brackets [N]. Fourth, check if autopull threshold matches your data size (24 bits for RGB). Finally, verify GPIO is actually assigned to PIO, not ARM.”
- “What happens if your ARM code can’t keep up with the PIO FIFO?”
- What they’re really asking: Do you understand real-time constraints?
- Strong answer: “If ARM doesn’t write to the FIFO fast enough, the FIFO empties and the PIO stalls waiting for autopull. For WS2812B this breaks the data chain—LEDs after the stall show wrong colors or turn off. Solutions: (1) Use DMA to feed the FIFO automatically, (2) Pre-compute the entire frame into a buffer, (3) Increase FIFO depth by chaining state machines, (4) Simplify color calculations to meet real-time deadline of ~30μs per LED.”
- “How would you implement a rainbow effect in ARM assembly?”
- What they’re really asking: Can you combine algorithms with hardware control?
- Strong answer: “I’d use HSV color space—vary hue from 0-360° across the strip, keep saturation and value at 100%. For each LED, calculate hue = (led_index × 360 / strip_length) + animation_offset. Convert HSV to RGB using piecewise functions in assembly. Increment animation_offset each frame for motion. The math involves multiply/divide, so I’d use lookup tables or integer approximations to stay under the 30μs per-LED budget. Write final RGB values to PIO FIFO in GRB order.”
Hints in Layers
Hint Layer 1: Getting Started with PIO
Before writing any code, understand the PIO architecture. The RP2040 has 2 PIO blocks (PIO0, PIO1), each with:
- 4 state machines (SM0-SM3)
- 32 instruction memory slots (shared across all 4 SMs)
- 4 TX and 4 RX FIFOs
Start with the WS2812B example from the Pico SDK, but in assembly:
.equ PIO0_BASE, 0x50200000
.equ PIO_CTRL, 0x000
.equ PIO_FSTAT, 0x004
.equ PIO_INSTR_MEM, 0x048
.equ PIO_SM0_CLKDIV, 0x0C8
.equ PIO_SM0_EXECCTRL, 0x0CC
.equ PIO_SM0_SHIFTCTRL, 0x0D0
.equ PIO_SM0_PINCTRL, 0x0D8
.equ PIO_TXF0, 0x010
Hint Layer 2: The WS2812B PIO Program
Here’s a complete PIO program for WS2812B:
.program ws2812
.side_set 1
.define public T1 2
.define public T2 5
.define public T3 3
.wrap_target
bitloop:
out x, 1 side 0 [T3 - 1] ; Side-set still takes place with delay
jmp !x do_zero side 1 [T1 - 1] ; Branch on bit from OSR
do_one:
jmp bitloop side 1 [T2 - 1]
do_zero:
nop side 0 [T2 - 1]
.wrap
This is the actual program from the Pico SDK. It’s only 4 instructions!
Hint Layer 3: Loading the PIO Program in ARM Assembly
; PIO program as 32-bit instructions
pio_ws2812_program:
.word 0x6221 ; out x, 1 side 0 [2]
.word 0x1124 ; jmp !x do_zero side 1 [1]
.word 0x1425 ; jmp bitloop side 1 [4]
.word 0xA442 ; nop side 0 [4]
load_pio_program:
ldr r0, =PIO0_BASE
ldr r1, =pio_ws2812_program
; Load 4 instructions into PIO instruction memory
ldr r2, [r1, #0]
str r2, [r0, #PIO_INSTR_MEM + 0]
ldr r2, [r1, #4]
str r2, [r0, #PIO_INSTR_MEM + 4]
ldr r2, [r1, #8]
str r2, [r0, #PIO_INSTR_MEM + 8]
ldr r2, [r1, #12]
str r2, [r0, #PIO_INSTR_MEM + 12]
bx lr
Hint Layer 4: Configuring the State Machine
configure_sm:
ldr r0, =PIO0_BASE
; Set clock divider for 800kHz (125MHz / 156.25 = 800kHz)
; Actually use integer part only: 156 = 0x9C
; Clock div format: integer.frac (16.8 fixed point)
ldr r1, =(156 << 8) ; Integer part in upper 16 bits
str r1, [r0, #PIO_SM0_CLKDIV]
; Configure autopull: threshold 24 bits, shift right
ldr r1, =(1 << 17) | (24 << 25) ; AUTOPULL | PULL_THRESH=24
str r1, [r0, #PIO_SM0_SHIFTCTRL]
; Configure pins: side-set base = 0, count = 1
ldr r1, =(1 << 29) ; SIDESET_COUNT = 1
str r1, [r0, #PIO_SM0_PINCTRL]
; Enable state machine 0
ldr r1, =(1 << 0)
str r1, [r0, #PIO_CTRL]
bx lr
Hint Layer 5: Sending One LED Color
; Input: r0 = red (0-255), r1 = green (0-255), r2 = blue (0-255)
send_led_color:
push {r4, lr}
; Convert RGB to GRB (WS2812B format)
lsl r3, r1, #16 ; Green in bits 16-23
orr r3, r3, r0, lsl #8 ; Red in bits 8-15
orr r3, r3, r2 ; Blue in bits 0-7
; Wait for FIFO to have space
ldr r4, =PIO0_BASE
1: ldr r0, [r4, #PIO_FSTAT]
tst r0, #(1 << 0) ; Check TXFULL for SM0
bne 1b ; Wait if full
; Write to FIFO
str r3, [r4, #PIO_TXF0]
pop {r4, pc}
Hint Layer 6: Animating a Rainbow
; Simple rainbow: cycle through hues
rainbow_pattern:
push {r4-r7, lr}
ldr r4, =60 ; Number of LEDs
ldr r5, =0 ; Hue offset (0-360)
led_loop:
; Calculate hue for this LED
mov r0, r5
add r5, #6 ; Next LED's hue (360/60 = 6 degrees apart)
cmp r5, #360
blt 1f
sub r5, #360 ; Wrap around
1:
; Convert hue to RGB (simplified: only red/green transition)
cmp r0, #120
blt red_green
; Full implementation would handle all 6 hue sectors...
red_green:
mov r1, r0 ; Green increases
mov r0, #255
sub r0, r1 ; Red decreases
mov r2, #0 ; Blue = 0
bl send_led_color
subs r4, #1
bne led_loop
; Wait 50μs for latch
ldr r0, =6250
2: subs r0, #1
bne 2b
pop {r4-r7, pc}
Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| PIO architecture | RP2040 Datasheet | Ch. 3 - PIO |
| PIO assembly language | RP2040 Assembly Language Programming (Smith) | Ch. 13-15 |
| WS2812B protocol | RP2040 Assembly Language Programming (Smith) | Ch. 13 |
| State machine concepts | RP2040 Datasheet | Section 3.2 - Programmer’s Model |
| GPIO configuration | RP2040 Datasheet | Ch. 2.19 - GPIO Functions |
| Clock dividers | RP2040 Datasheet | Section 3.5.2 - Clock Divider |
| FIFO usage | RP2040 Datasheet | Section 3.5.3 - FIFOs |
| Autopull/Autopush | RP2040 Datasheet | Section 3.5.4 - Autopush and Autopull |
| Side-set pins | RP2040 Datasheet | Section 3.5.1 - Pin Mapping |
| Timing-critical code | Making Embedded Systems (White) | Ch. 6 - Doing More with Less |
| Color space conversion | Computer Graphics from Scratch (Gambetta) | Ch. 14 - Color |
Project 5: Context-Switching Mini-Kernel (Raspberry Pi Pico)
- File: ARM_ASSEMBLY_LEARNING_PROJECTS.md
- Programming Language: ARM Assembly
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Operating Systems / Embedded
- Software or Tool: Raspberry Pi Pico
- Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau
What you’ll build: A cooperative multi-tasking kernel in assembly that can switch between multiple “tasks,” each with its own stack, giving you a foundation for understanding how RTOSes work.
Why it teaches ARM assembly: Context switching is the heart of any OS, and implementing it yourself forces you to understand exactly what “context” means: registers, stack pointer, program counter, and processor state.
Core challenges you’ll face:
- Understanding which registers must be saved/restored (caller-saved vs callee-saved)
- Implementing stack switching between tasks
- Using PendSV exception for safe context switches
- Setting up SysTick for preemptive scheduling (advanced)
- Managing task control blocks (TCBs) in assembly
Key Concepts:
- Context Switching Mechanics: “Operating Systems: Three Easy Pieces” Ch. 6-7 - Arpaci-Dusseau
- Cortex-M Exception Model: “The Art of ARM Assembly, Volume 1” Ch. 14 - Randall Hyde
- AAPCS Register Usage: “Modern Arm Assembly Language Programming” Ch. 2 - Daniel Kusswurm
- PendSV for Context Switch: ARM Cortex-M Baremetal Assembly Programming - Engineers in Pyjama
Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Projects 1-2 completed, understanding of stacks and function calls
Real world outcome:
- Two or more independent “tasks” running concurrently, each blinking an LED at different rates
- A serial console showing task switches happening
- Foundation code you could extend into a real RTOS
Learning milestones:
- Manual task switch works → You understand context save/restore
- Cooperative scheduling works → You understand how
yield()triggers PendSV - Preemptive scheduling works → You’ve mastered SysTick and nested exceptions
Real World Outcome
When your mini-kernel boots, you’ll see this in your serial terminal:
[BOOT] Mini-Kernel v1.0 starting...
[INIT] Task 1: LED Blinker (500ms) - Stack @ 0x20003000
[INIT] Task 2: UART Logger (1000ms) - Stack @ 0x20003800
[INIT] Task 3: Counter (250ms) - Stack @ 0x20004000
[SCHED] Starting cooperative scheduler...
[T1] Blink! LED ON
[T3] Counter: 0
[T3] Counter: 1
[T2] Logger tick: 1
[T1] Blink! LED OFF
[T3] Counter: 2
[T3] Counter: 3
[T2] Logger tick: 2
[T1] Blink! LED ON
...
Physical observations:
- The onboard LED blinks smoothly at 500ms intervals (Task 1)
- Serial output appears regularly from Tasks 2 and 3 at their scheduled intervals
- Pressing a button triggers an interrupt that creates a new task dynamically
- All tasks appear to run “simultaneously” even on a single core
What you can observe in code/debugger:
- Each task has its own Task Control Block (TCB) with saved register state
- Stack memory is cleanly separated (inspect memory at 0x20003000, 0x20003800, etc.)
- The PendSV handler executes every context switch, visible with a debugger breakpoint
- SysTick interrupt fires every 10ms when you enable preemptive mode
- Set a breakpoint in PendSV_Handler and watch the stack pointer change between 0x20003000 → 0x20003800 → 0x20004000 in round-robin fashion
Advanced milestone: Enable SysTick preemption and watch a long-running task (busy loop) get interrupted mid-execution while other tasks continue responsive—this is the “magic moment” where you see true preemptive multitasking emerge from your assembly code.
The Core Question You’re Answering
“How does a processor create the illusion of running multiple programs simultaneously when it can only execute one instruction at a time?”
This question is fundamental to all operating systems. Your mini-kernel answers it by demonstrating that “running a task” is just the state of registers and memory—save that state, load different state, and you’re now running a different task. The “illusion” is just rapidly switching between saved states.
Concepts You Must Understand First
- What is “context” and why must it be saved?
- Which registers hold the current task’s execution state?
- What happens if you switch tasks without saving R0-R3?
- Why does the stack pointer (SP) need special handling?
- Book: “Operating Systems: Three Easy Pieces” Ch. 6 (Limited Direct Execution)
- How does the Cortex-M exception model enable safe context switches?
- What is the difference between Thread mode and Handler mode?
- Why use PendSV instead of calling a regular function?
- How does the processor automatically save/restore some registers?
- Book: “The Art of ARM Assembly, Volume 1” Ch. 14 (Exceptions and Interrupts)
- What is the ARM Procedure Call Standard (AAPCS)?
- Which registers are caller-saved (R0-R3, R12)?
- Which registers are callee-saved (R4-R11)?
- Why does this matter for context switching?
- Book: “Modern Arm Assembly Language Programming” Ch. 2 (Armv8-32 Architecture)
- How do you manipulate the stack pointer safely?
- What is the difference between MSP and PSP?
- How do you switch from one task’s stack to another?
- Why must stacks be 8-byte aligned?
- Book: “Making Embedded Systems, 2nd Edition” Ch. 3 (Getting Your Hands Dirty)
- What is the difference between cooperative and preemptive scheduling?
- When does a task give up control in cooperative mode?
- How does SysTick force a context switch in preemptive mode?
- What are the tradeoffs of each approach?
- Book: “Operating Systems: Three Easy Pieces” Ch. 7 (CPU Scheduling)
- How do you represent a task in memory?
- What data structure holds task state (TCB)?
- How do you initialize a task’s stack to make it “runnable”?
- What does the stack look like after a context switch?
- Book: RP2040 Datasheet Section 2.3.1 (Cortex-M0+ Processor)
Questions to Guide Your Design
-
Task Control Block (TCB) structure: What minimum information do you need to represent a task? (Hint: at least stack pointer, task function pointer, state)
-
Initial stack setup: How do you “fake” the stack so a brand new task looks like it was previously interrupted? (Hint: manually push register values including PC and xPSR)
-
Context save order: In what order should you push R4-R11 to the stack during PendSV? (Hint: consistency matters more than the specific order)
-
Task switching logic: How do you choose the next task to run? (Hint: start with simple round-robin)
-
PendSV trigger: How do you signal that a context switch is needed? (Hint: set the PendSV pending bit in ICSR register)
-
Critical sections: Which parts of your code must disable interrupts to prevent corruption? (Hint: modifying the task list)
-
Stack size calculation: How much stack does each task need? (Hint: 256-512 bytes is typical for simple tasks, but calculate based on deepest call chain + local variables)
-
SysTick period: What timer period gives good responsiveness without excessive overhead? (Hint: 1-10ms is typical; 1ms = 1000 context switches/second)
Thinking Exercise
Before writing code, draw this on paper:
- Draw three task stacks side-by-side:
- Show Task A’s stack with R0-R3, R12, LR, PC, xPSR at top (auto-saved by hardware)
- Show where R4-R11 are manually saved by your PendSV handler
- Show the SP pointing to the bottom of saved context
- Repeat for Tasks B and C
- Draw the context switch sequence:
- Label “Task A running” → “PendSV triggered” → “Save R4-R11 to Task A stack” → “Switch SP to Task B stack” → “Restore R4-R11 from Task B stack” → “Return from exception” → “Hardware restores R0-R3, PC” → “Task B running”
- Draw the TCB linked list:
- Each TCB contains: task_id, stack_pointer, state (READY/RUNNING/BLOCKED), next_tcb
- Show how current_task pointer moves during round-robin scheduling
- Calculate stack memory layout:
- If RAM starts at 0x20000000 and you have 4KB available for task stacks
- Task 1 stack: 0x20000400 (top) - 0x20000000 (bottom) = 1KB
- Task 2 stack: 0x20000800 (top) - 0x20000400 (bottom) = 1KB
- Task 3 stack: 0x20000C00 (top) - 0x20000800 (bottom) = 1KB
- (Stacks grow downward!)
Key insight from this exercise: Context switching is just bookkeeping—save numbers to memory, load different numbers from memory. The “magic” is that PC (program counter) is just another register.
The Interview Questions They’ll Ask
- “Walk me through what happens during a context switch on ARM Cortex-M. Which registers are saved by hardware and which must you save manually?”
- Expected answer: Hardware auto-saves R0-R3, R12, LR, PC, xPSR when entering exception. PendSV handler must manually save/restore R4-R11 (callee-saved). Explain why: AAPCS defines which registers functions must preserve.
- “Why do we use PendSV for context switching instead of SysTick or another interrupt?”
- Expected answer: PendSV is lowest priority, so it runs after all other interrupts complete. This ensures context switch happens at a “safe” time. SysTick triggers PendSV by setting pending bit, but the actual switch is deferred.
- “How would you debug a stack overflow in one of your tasks?”
- Expected answer: Fill stack memory with a known pattern (0xDEADBEEF) at initialization. Periodically check bottom of stack for pattern corruption. Or use MPU (Memory Protection Unit) to trigger fault on stack overflow. Discuss stack canaries.
- “What’s the difference between MSP and PSP, and why does it matter?”
- Expected answer: MSP = Main Stack Pointer (used in Handler mode/interrupts), PSP = Process Stack Pointer (used for tasks in Thread mode). Separating them prevents task stack overflow from corrupting interrupt stack. CONTROL register selects which SP is active.
- “Your scheduler is supposed to be preemptive, but tasks are blocking each other. What could cause this?”
- Expected answer: Common issues: (1) SysTick not enabled/configured, (2) Task holding a lock with interrupts disabled, (3) Priority inversion (low-priority task blocks high-priority), (4) PendSV priority set too high, (5) Infinite loop in interrupt handler.
- “How would you implement task priorities in your scheduler?”
- Expected answer: Modify scheduler to maintain multiple ready queues (one per priority level). Always run highest-priority ready task. Discuss priority inversion problem and solutions (priority inheritance, priority ceiling). May need aging to prevent starvation.
Hints in Layers
Level 1 - Starting Point:
; Task Control Block structure (32 bytes)
; Offset 0: Stack pointer (4 bytes)
; Offset 4: Task state (4 bytes) - 0=READY, 1=RUNNING, 2=BLOCKED
; Offset 8: Next TCB pointer (4 bytes)
; Offset 12: Task ID (4 bytes)
Level 2 - TCB Management:
.data
.align 4
task1_tcb:
.word 0x20003000 ; Initial stack pointer (top of stack)
.word 0 ; State: READY
.word task2_tcb ; Next task
.word 1 ; Task ID
task2_tcb:
.word 0x20003800
.word 0
.word task3_tcb
.word 2
current_task:
.word task1_tcb ; Pointer to current running task
Level 3 - Initializing a Task Stack:
; Initialize task stack to look like it was interrupted
; Stack must contain (from top to bottom):
; xPSR, PC, LR, R12, R3, R2, R1, R0, R11, R10, R9, R8, R7, R6, R5, R4
init_task_stack:
; R0 = task function pointer
; R1 = stack top address
mov r2, r1 ; Save stack top
; Push initial values for context switch to restore
ldr r3, =0x01000000 ; xPSR with Thumb bit set
str r3, [r1, #-4]! ; Push xPSR
str r0, [r1, #-4]! ; Push PC (task function)
ldr r3, =0xFFFFFFFD ; EXC_RETURN value (return to Thread mode with PSP)
str r3, [r1, #-4]! ; Push LR
; Push dummy values for R12, R3-R0
movs r3, #0
mov r4, #6 ; Push 6 registers
.Lpush_regs:
str r3, [r1, #-4]!
subs r4, #1
bne .Lpush_regs
; Push R11-R4 (will be restored by PendSV)
mov r4, #8
.Lpush_regs2:
str r3, [r1, #-4]!
subs r4, #1
bne .Lpush_regs2
; Return final stack pointer in R0
mov r0, r1
bx lr
Level 4 - PendSV Handler (The Heart of Context Switching):
.thumb_func
PendSV_Handler:
; Disable interrupts to make this atomic
cpsid i
; Get current task TCB
ldr r0, =current_task
ldr r1, [r0] ; R1 = current TCB pointer
; Save R4-R11 to current task's stack (R0-R3, R12, LR, PC, xPSR already saved by hardware)
mrs r2, psp ; Get current Process Stack Pointer
subs r2, #32 ; Make room for 8 registers (R4-R11)
stmia r2!, {r4-r7} ; Save R4-R7
mov r4, r8
mov r5, r9
mov r6, r10
mov r7, r11
stmia r2!, {r4-r7} ; Save R8-R11
subs r2, #32 ; Restore R2 to bottom of saved context
; Save stack pointer to current TCB
str r2, [r1, #0] ; Store stack pointer at offset 0 of TCB
; Get next task (round-robin)
ldr r1, [r1, #8] ; R1 = next_tcb (offset 8)
str r1, [r0] ; Update current_task pointer
; Restore stack pointer from new task's TCB
ldr r2, [r1, #0] ; Load new task's stack pointer
; Restore R4-R11 from new task's stack
ldmia r2!, {r4-r7} ; Restore R4-R7
ldmia r2!, {r0-r3} ; Load R8-R11 into R0-R3 temporarily
mov r8, r0
mov r9, r1
mov r10, r2
mov r11, r3
subs r2, #32
adds r2, #32 ; R2 now points to top of saved context
; Set PSP to new stack
msr psp, r2
; Re-enable interrupts
cpsie i
; Return from exception (hardware will restore R0-R3, R12, LR, PC, xPSR)
ldr r0, =0xFFFFFFFD ; EXC_RETURN: return to Thread mode, use PSP
bx r0
Level 5 - Triggering a Context Switch:
; Yield function - voluntarily give up CPU
.global yield
.thumb_func
yield:
; Set PendSV pending bit in ICSR register
ldr r0, =0xE000ED04 ; ICSR address
ldr r1, =0x10000000 ; PendSV set-pending bit (bit 28)
str r1, [r0]
; PendSV will execute when this function returns
bx lr
; SysTick handler for preemptive scheduling
.thumb_func
SysTick_Handler:
; Just trigger PendSV - let it do the actual switch
ldr r0, =0xE000ED04 ; ICSR
ldr r1, =0x10000000 ; PendSV set-pending bit
str r1, [r0]
bx lr
Level 6 - Complete Initialization:
.global kernel_start
.thumb_func
kernel_start:
; Initialize SysTick for 10ms tick (assuming 125MHz clock)
ldr r0, =0xE000E010 ; SysTick base
ldr r1, =1250000 ; 125MHz / 100 = 10ms
str r1, [r0, #4] ; SYST_RVR (reload value)
movs r1, #0
str r1, [r0, #8] ; SYST_CVR (clear current value)
movs r1, #7 ; Enable | TickInt | ClkSource
str r1, [r0, #0] ; SYST_CSR
; Set PendSV to lowest priority
ldr r0, =0xE000ED20 ; SHPR3 (System Handler Priority 3)
ldr r1, =0x00FF0000 ; PendSV priority 0xFF (lowest)
str r1, [r0]
; Initialize task stacks
ldr r0, =task1_func
ldr r1, =0x20003000
bl init_task_stack
ldr r1, =task1_tcb
str r0, [r1, #0] ; Save initialized SP to TCB
; ... repeat for other tasks ...
; Set PSP to first task's stack
ldr r0, =current_task
ldr r0, [r0]
ldr r1, [r0, #0]
msr psp, r1
; Switch to use PSP in Thread mode
movs r0, #2
msr control, r0
isb ; Instruction Synchronization Barrier
; Jump to first task (will never return)
ldr r0, =task1_func
bx r0
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Context switching concepts | “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau | Ch. 6 (Limited Direct Execution) |
| CPU scheduling algorithms | “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau | Ch. 7-9 (Scheduling) |
| Cortex-M exception model | “The Art of ARM Assembly, Volume 1” - Randall Hyde | Ch. 14 (Exceptions and Interrupts) |
| ARM calling convention (AAPCS) | “Modern Arm Assembly Language Programming” - Daniel Kusswurm | Ch. 2 (Armv8-32 Architecture) |
| Stack frame structure | “Introduction to Computer Organization: ARM Edition” - Robert G. Plantz | Ch. 11 (Functions and Stack) |
| Cortex-M0+ specifics | RP2040 Datasheet - Raspberry Pi Foundation | Section 2.3 (Processor) |
| RTOS fundamentals | “Making Embedded Systems, 2nd Edition” - Elecia White | Ch. 8 (Embedded Operating Systems) |
| PendSV technique | ARM Cortex-M Programming Guide - ARM Ltd | Section 4.5 (Exception Usage Hints) |
| Stack overflow detection | “Better Embedded System Software” - Philip Koopman | Ch. 8 (Stack Management) |
| Real RTOS comparison | “The Definitive Guide to ARM Cortex-M3/M4” - Joseph Yiu | Ch. 10 (OS Support Features) |
Project Comparison Table
| Project | Difficulty | Time | Platform | Depth of Understanding | Fun Factor |
|---|---|---|---|---|---|
| Bare-Metal LED Blinker | Beginner | Weekend | Pico | ⭐⭐⭐ Boot process, registers | ⭐⭐⭐ Instant gratification |
| UART Driver | Intermediate | 1-2 weeks | Pico | ⭐⭐⭐⭐ Peripherals, interrupts | ⭐⭐⭐⭐ Actual communication |
| Framebuffer Graphics | Intermediate | 1-2 weeks | RPi 4/5 | ⭐⭐⭐⭐ AArch64, mailbox | ⭐⭐⭐⭐⭐ Visual results |
| PIO Protocol | Int-Advanced | 1-2 weeks | Pico | ⭐⭐⭐⭐ Two architectures | ⭐⭐⭐⭐⭐ Pretty LEDs |
| Mini-Kernel | Advanced | 2-4 weeks | Pico | ⭐⭐⭐⭐⭐ OS internals | ⭐⭐⭐⭐ Deep satisfaction |
Recommendation
Based on targets (Cortex-M0/M1, Raspberry Pi, Pico), here is the recommended progression:
- Start with Project 1 (LED Blinker on Pico) - This is your foundation. Don’t skip it. The Pico is ideal because:
- Very cheap (~$4)
- Excellent documentation (636-page datasheet)
- Simple boot process (no Linux to bypass)
- Stephen Smith’s book “RP2040 Assembly Language Programming” covers exactly this
-
Then Project 2 (UART) - Being able to print debug output changes everything
- Branch based on interest:
- Want visual results? → Project 3 (Framebuffer) or Project 4 (PIO/NeoPixels)
- Want deep systems understanding? → Project 5 (Mini-Kernel)
Final Comprehensive Project: Bare-Metal Game Console
What you’ll build: A complete handheld game console on Raspberry Pi Pico with:
- Custom bootloader in assembly
- Framebuffer-based graphics using PIO-driven display (SPI LCD like ST7789)
- Button input handling with interrupts
- Sound generation using PWM
- A playable game (Pong, Snake, or simple platformer)
- All critical routines in hand-optimized assembly
Why this is the ultimate ARM assembly project: This integrates everything: boot code, multiple peripherals, interrupt handling, timing-critical code, and optimization. You’ll write assembly because you need the performance and control, not just as an exercise.
Core challenges you’ll face:
- Writing a SPI driver in assembly for the LCD (timing-critical)
- Double-buffering the framebuffer to prevent tearing
- Implementing sprite rendering with clipping
- Handling multiple interrupt sources (buttons, timers, DMA)
- Optimizing the render loop to hit 60fps
Key Concepts:
- SPI Protocol & DMA: RP2040 Datasheet Ch. 4 (SPI) - Raspberry Pi Foundation
- Game Loop Architecture: “Making Embedded Systems, 2nd Edition” Ch. 10 - Elecia White
- Graphics Algorithms: “Computer Graphics from Scratch” Ch. 1-6 - Gabriel Gambetta
- Performance Optimization: “Write Great Code, Volume 2, 2nd Edition” Ch. 12-13 - Randall Hyde
- PWM Audio: “RP2040 Assembly Language Programming” Ch. 12 - Stephen Smith
Difficulty: Advanced Time estimate: 1-2 months Prerequisites: All previous projects, basic game development concepts
Real world outcome:
- A physical device you can hold that runs your game
- Boot time under 100ms (no OS!)
- Smooth 60fps gameplay
- Understanding deep enough to write drivers for any ARM peripheral
Learning milestones:
- Static image on LCD → You understand SPI and display initialization
- Moving sprite with no tearing → You understand double-buffering and timing
- Playable game with sound → You’ve integrated multiple subsystems
- Game runs at 60fps → You’ve mastered optimization and DMA
Real World Outcome
What you’ll physically experience when you power it on:
Power on → 85ms boot sequence → Splash screen with “PICO CONSOLE v1.0” → Main menu with beep sound → Press START button → Game loads instantly
The complete gaming experience:
- Display: 240x240 pixel ST7789 LCD showing vibrant 16-bit color (RGB565)
- Visuals: Smooth scrolling background, animated player sprite (8x8 pixels), enemy sprites moving independently
- Controls: D-pad + A/B buttons, tactile feedback, ~5ms input latency
- Audio: PWM-generated square wave music (440Hz melody) + sound effects (jump: rising pitch, hit: descending pitch)
- Performance: Solid 60fps gameplay, no screen tearing, no frame drops
- Frame timing: Each frame takes exactly 16.67ms (measured with logic analyzer on debug pin)
Serial debug output while playing:
[BOOT] Game Console Init - 85ms
[LCD] ST7789 240x240 @ 40MHz SPI
[DMA] Ch0: Framebuffer → SPI TX
[AUDIO] PWM @ 44.1kHz, Ch1 (GPIO 15)
[INPUT] 4 buttons on GPIO 2-5 (interrupt mode)
[MEM] FB1: 0x20010000, FB2: 0x20022C00 (115,200 bytes each)
[GAME] Snake v1.0 starting...
[FPS] 60.1 fps (avg over 100 frames)
[GAME] Score: 5, Length: 8
[GAME] Score: 10, Length: 13
[GAME] Game Over! Final Score: 15
Memory usage breakdown:
Text (code): 24,576 bytes (Flash @ 0x10000000)
Framebuffer 1: 115,200 bytes (RAM @ 0x20010000)
Framebuffer 2: 115,200 bytes (RAM @ 0x20022C00)
Sprite data: 2,048 bytes (Flash @ 0x10006000)
Stack (3 tasks): 1,536 bytes (RAM @ 0x20000000)
Total RAM used: 232KB / 264KB (88%)
Advanced observations:
- Set logic analyzer on SPI CLK line: perfect 40MHz square wave during display updates
- DMA controller automatically transfers framebuffer without CPU intervention
- CPU usage: ~40% for game logic, ~0% for display transfer (DMA handles it)
- Power consumption: ~120mA @ 3.3V during active gameplay
The Core Question You’re Answering
“How do you build a complete, responsive, real-time system on bare metal that coordinates graphics, input, audio, and timing—all without an operating system?”
This is the ultimate embedded systems question. You’re answering:
- How do you achieve 60fps when every pixel must be calculated and transferred?
- How do you handle multiple interrupt sources without missing events?
- How do you manage limited RAM when your framebuffer alone takes 88% of available memory?
- How do you optimize assembly code to squeeze maximum performance from a 125MHz Cortex-M0+?
Concepts You Must Understand First
- SPI Protocol & Timing
- What is MOSI, MISO, SCK, and CS?
- How fast can you clock the SPI peripheral on RP2040?
- Why does display update require both commands and data?
- Book: RP2040 Datasheet Ch. 4.4 (SPI)
- DMA (Direct Memory Access)
- How does DMA transfer data without CPU involvement?
- What are DMA channels, transfer counts, and triggers?
- How do you chain DMA operations for continuous display refresh?
- Book: RP2040 Datasheet Ch. 2.5 (DMA)
- Double Buffering
- Why does drawing directly to the display cause tearing?
- How do you swap buffers atomically?
- What is VSync and why does it matter?
- Book: “Computer Graphics from Scratch” Ch. 14 (Rasterization)
- Sprite Rendering
- How do you calculate pixel positions from sprite coordinates?
- What is clipping and why is it necessary?
- How do you handle transparency (color key)?
- Book: “Write Great Code, Volume 2” Ch. 12 (Graphics)
- PWM for Audio
- How does PWM approximate an analog waveform?
- What sample rate do you need for reasonable audio quality?
- How do you generate square waves at specific frequencies?
- Book: “RP2040 Assembly Language Programming” Ch. 12 (PWM)
- Interrupt Priority & Handling
- How do you prioritize button input vs timer vs DMA complete?
- What happens when interrupts overlap?
- How do you prevent race conditions on shared data?
- Book: “Making Embedded Systems, 2nd Edition” Ch. 4 (Interrupts)
- Game Loop Architecture
- What is the fixed timestep pattern?
- How do you separate update rate from render rate?
- How do you measure and maintain 60fps?
- Book: “Making Embedded Systems, 2nd Edition” Ch. 10 (State Machines)
Questions to Guide Your Design
-
Memory management: With only 264KB RAM and needing 230KB for two framebuffers, how do you allocate the remaining 34KB for code, stack, and game state?
-
Display initialization: The ST7789 requires a specific command sequence. How do you send commands vs data over SPI?
-
Framebuffer format: Do you use RGB565 (16-bit, 2 bytes per pixel) or RGB888 (24-bit, 3 bytes per pixel)? What are the tradeoffs?
-
DMA chaining: How do you set up DMA to continuously transfer the framebuffer to SPI without CPU intervention?
-
Sprite storage: Do you store sprites in flash or RAM? How do you balance access speed vs memory usage?
-
Input debouncing: Button bouncing can cause multiple triggers. Do you debounce in hardware or software?
-
Audio generation: At 44.1kHz sample rate, you need a new sample every 22.68μs. How do you generate samples without blocking the game loop?
-
Performance profiling: How do you measure frame time and identify bottlenecks without a profiler?
Thinking Exercise
Before writing code, design the complete system on paper:
- Memory Map Diagram:
Flash (2MB): 0x10000000: Vector table (192 bytes) 0x100000C0: Boot code (4KB) 0x10001000: Game code (20KB) 0x10006000: Sprite data (8x8 sprites, 64 bytes each, 32 sprites = 2KB) RAM (264KB): 0x20000000: Stack for main task (512 bytes) 0x20000200: Stack for audio task (512 bytes) 0x20000400: Stack for input task (512 bytes) 0x20001000: Game state (player pos, score, etc., 1KB) 0x20010000: Framebuffer 1 (240*240*2 = 115,200 bytes) 0x20022C00: Framebuffer 2 (240*240*2 = 115,200 bytes) 0x20035800: DMA descriptors (256 bytes) - Interrupt Priority Scheme:
Priority 0 (Highest): DMA complete (must swap buffers immediately) Priority 1: SysTick (game loop timing, 60Hz) Priority 2: Button GPIO (input handling) Priority 3: PWM wrap (audio sample generation) Priority 255 (Lowest): PendSV (context switching) - Game Loop Timing:
Frame target: 16.67ms (60fps) Timeline for one frame: T=0ms: SysTick fires → Start frame T=0-5ms: Update game logic (player movement, collision detection) T=5-10ms: Render to back buffer (clear + draw sprites) T=10ms: Trigger DMA transfer of back buffer to display T=10-16ms: CPU idle (DMA working) or audio processing T=16ms: DMA complete interrupt → Swap buffers T=16.67ms: Next SysTick → Repeat If frame takes > 16.67ms → Frame drop → Detect and skip render - Sprite Rendering Algorithm:
```
For each sprite:
- Calculate screen position: screen_x = sprite.x - camera.x
-
Check if on screen: if (screen_x < 0 screen_x >= 240) skip - Calculate clipping: start_x = max(0, screen_x) end_x = min(239, screen_x + sprite.width)
- For each pixel in sprite: if (sprite.pixel != TRANSPARENT_COLOR): framebuffer[screen_y * 240 + screen_x] = sprite.pixel ```
- DMA Transfer Calculation:
Framebuffer size: 240 * 240 * 2 = 115,200 bytes SPI clock: 40 MHz = 40,000,000 bits/sec Transfer time: (115,200 * 8) / 40,000,000 = 23.04ms Problem: 23ms > 16.67ms frame budget! Solution: Only update dirty regions or reduce resolution to 160x160
The Interview Questions They’ll Ask
- “You said you’re hitting 60fps with a 240x240 display. Walk me through the math—how is that even possible when the DMA transfer alone takes 23ms?”
- Expected answer: Either (1) reduce resolution to 160x160 (requires only 10.24ms transfer), (2) only update dirty rectangles, not full screen, (3) use 4-bit color to halve transfer size, or (4) run SPI faster than 40MHz (RP2040 can do 62.5MHz). Explain the tradeoff chosen.
- “How do you prevent screen tearing with double buffering?”
- Expected answer: Maintain front buffer (currently displayed) and back buffer (being drawn). Only swap buffers during VBlank or after complete DMA transfer. Use a flag set by DMA complete interrupt to signal safe swap. Without VSync, rely on DMA completion timing.
- “Your game uses PWM for audio at 44.1kHz. That’s an interrupt every 22.68μs. Won’t that kill your frame rate?”
- Expected answer: No, if ISR is fast (<2μs). Pre-generate waveform samples into a buffer, ISR just reads next sample and writes to PWM register. Or use DMA to transfer audio samples from buffer to PWM, avoiding interrupts entirely. Discuss PCM vs square wave tradeoffs.
- “How do you handle button debouncing in assembly?”
- Expected answer: Hardware solution: RC filter (10kΩ + 0.1μF = 1ms time constant). Software: Wait 20ms after first press, re-read button, only register if still pressed. Or use timer to sample button state every 10ms and require 2-3 consecutive identical reads. Show assembly code for software approach.
- “You’re using 88% of RAM for framebuffers. What happens if you need more memory for game state?”
- Expected answer: Options: (1) Single buffer + tearing (saves 115KB), (2) Reduce resolution (160x160 = 51KB per buffer), (3) Use 8-bit color (saves 50%), (4) Stream tiles from flash instead of pre-loading, (5) Compress sprite data. Discuss chosen approach and why.
- “Walk me through your DMA setup for transferring the framebuffer to SPI.”
- Expected answer: Configure DMA channel with: source = framebuffer address, dest = SPI TX FIFO address, transfer count = 115,200 bytes, increment source (not dest), trigger on SPI TX FIFO ready, enable completion interrupt. Show register configuration in assembly.
Hints in Layers
Level 1 - ST7789 Display Initialization:
; ST7789 Commands
.equ ST7789_SWRESET, 0x01 ; Software reset
.equ ST7789_SLPOUT, 0x11 ; Sleep out
.equ ST7789_COLMOD, 0x3A ; Color mode
.equ ST7789_MADCTL, 0x36 ; Memory data access control
.equ ST7789_DISPON, 0x29 ; Display on
.equ ST7789_CASET, 0x2A ; Column address set
.equ ST7789_RASET, 0x2B ; Row address set
.equ ST7789_RAMWR, 0x2C ; RAM write
; GPIO pins
.equ LCD_DC_PIN, 8 ; Data/Command select
.equ LCD_CS_PIN, 9 ; Chip select
.equ LCD_RST_PIN, 10 ; Reset
Level 2 - SPI Communication:
; Send command to ST7789
; R0 = command byte
lcd_cmd:
push {r4, lr}
; Set DC low (command mode)
ldr r1, =SIO_BASE
mov r2, #1
lsl r2, r2, #LCD_DC_PIN
str r2, [r1, #GPIO_OUT_CLR]
; Set CS low (select display)
mov r2, #1
lsl r2, r2, #LCD_CS_PIN
str r2, [r1, #GPIO_OUT_CLR]
; Send byte over SPI
bl spi_write_byte
; Set CS high (deselect)
mov r2, #1
lsl r2, r2, #LCD_CS_PIN
str r2, [r1, #GPIO_OUT_SET]
pop {r4, pc}
; Send data to ST7789
; R0 = data byte
lcd_data:
push {r4, lr}
; Set DC high (data mode)
ldr r1, =SIO_BASE
mov r2, #1
lsl r2, r2, #LCD_DC_PIN
str r2, [r1, #GPIO_OUT_SET]
; Set CS low
mov r2, #1
lsl r2, r2, #LCD_CS_PIN
str r2, [r1, #GPIO_OUT_CLR]
; Send byte over SPI
bl spi_write_byte
; Set CS high
mov r2, #1
lsl r2, r2, #LCD_CS_PIN
str r2, [r1, #GPIO_OUT_SET]
pop {r4, pc}
Level 3 - Display Initialization Sequence:
lcd_init:
push {r4, lr}
; Hardware reset
ldr r0, =SIO_BASE
mov r1, #1
lsl r1, r1, #LCD_RST_PIN
str r1, [r0, #GPIO_OUT_CLR] ; Reset low
ldr r2, =10000
bl delay_us ; 10ms delay
str r1, [r0, #GPIO_OUT_SET] ; Reset high
ldr r2, =120000
bl delay_us ; 120ms delay
; Software reset
ldr r0, =ST7789_SWRESET
bl lcd_cmd
ldr r0, =120000
bl delay_us
; Sleep out
ldr r0, =ST7789_SLPOUT
bl lcd_cmd
ldr r0, =120000
bl delay_us
; Color mode: 16-bit RGB565
ldr r0, =ST7789_COLMOD
bl lcd_cmd
ldr r0, =0x55 ; 16-bit color
bl lcd_data
; Memory access control: landscape mode
ldr r0, =ST7789_MADCTL
bl lcd_cmd
ldr r0, =0x00
bl lcd_data
; Display on
ldr r0, =ST7789_DISPON
bl lcd_cmd
pop {r4, pc}
Level 4 - DMA Setup for Framebuffer Transfer:
; Setup DMA to transfer framebuffer to SPI
dma_setup_framebuffer:
push {r4-r6, lr}
.equ DMA_BASE, 0x50000000
.equ DMA_CH0_READ_ADDR, 0x000
.equ DMA_CH0_WRITE_ADDR, 0x004
.equ DMA_CH0_TRANS_COUNT, 0x008
.equ DMA_CH0_CTRL_TRIG, 0x00C
; DMA Channel 0 for framebuffer → SPI
ldr r0, =DMA_BASE
; Read address = framebuffer
ldr r1, =framebuffer
str r1, [r0, #DMA_CH0_READ_ADDR]
; Write address = SPI TX FIFO
ldr r1, =SPI0_BASE
adds r1, #SPI_DR ; Data register
str r1, [r0, #DMA_CH0_WRITE_ADDR]
; Transfer count = framebuffer size in 16-bit words
ldr r1, =(240 * 240) ; 57,600 words
str r1, [r0, #DMA_CH0_TRANS_COUNT]
; Control: 16-bit transfers, increment read, DREQ=SPI0_TX, enable IRQ
; Bits: [31]=IRQ_QUIET(0), [15]=TREQ_SEL(16=SPI0_TX), [5]=INCR_READ(1), [1:0]=DATA_SIZE(1=16bit)
ldr r1, =0x00100021 ; 16-bit, incr read, SPI0 TX DREQ
str r1, [r0, #DMA_CH0_CTRL_TRIG]
pop {r4-r6, pc}
Level 5 - Sprite Rendering with Transparency:
; Draw 8x8 sprite to framebuffer
; R0 = sprite data pointer
; R1 = x position
; R2 = y position
; Transparent color = 0xF81F (magenta in RGB565)
draw_sprite:
push {r4-r7, lr}
mov r4, r0 ; Sprite data
mov r5, r1 ; X pos
mov r6, r2 ; Y pos
movs r7, #0 ; Row counter
.sprite_row_loop:
cmp r7, #8
bge .sprite_done
movs r3, #0 ; Column counter
.sprite_col_loop:
cmp r3, #8
bge .sprite_next_row
; Load pixel from sprite
ldrh r0, [r4] ; Load 16-bit pixel
adds r4, #2
; Check if transparent
ldr r1, =0xF81F
cmp r0, r1
beq .sprite_skip_pixel
; Calculate framebuffer position
; fb_offset = (y + row) * 240 + (x + col)
mov r1, r6
add r1, r7 ; y + row
ldr r2, =240
muls r1, r2 ; (y + row) * 240
add r1, r5 ; + x
add r1, r3 ; + col
lsls r1, r1, #1 ; * 2 (bytes per pixel)
; Write to framebuffer
ldr r2, =framebuffer
strh r0, [r2, r1]
.sprite_skip_pixel:
adds r3, #1
b .sprite_col_loop
.sprite_next_row:
adds r7, #1
b .sprite_row_loop
.sprite_done:
pop {r4-r7, pc}
Level 6 - Game Loop with Fixed Timestep:
; Main game loop
; Target: 60fps = 16.67ms per frame
game_loop:
push {r4-r7, lr}
ldr r4, =frame_counter
ldr r5, =game_state
.frame_loop:
; Wait for SysTick (60Hz)
bl wait_for_frame_tick
; Read input
bl read_buttons
; Update game logic
bl update_player
bl update_enemies
bl check_collisions
; Get back buffer
bl get_back_buffer
mov r6, r0 ; R6 = back buffer address
; Clear back buffer
mov r0, r6
ldr r1, =0x0000 ; Black
bl clear_framebuffer
; Render sprites
ldr r0, =player_sprite
ldr r1, [r5, #PLAYER_X]
ldr r2, [r5, #PLAYER_Y]
bl draw_sprite
; Trigger DMA transfer
mov r0, r6
bl dma_start_transfer
; Increment frame counter
ldr r0, [r4]
adds r0, #1
str r0, [r4]
; Check for quit condition
bl check_quit
cmp r0, #0
beq .frame_loop
pop {r4-r7, pc}
; Wait for next frame tick (SysTick)
wait_for_frame_tick:
ldr r0, =frame_ready_flag
movs r1, #0
str r1, [r0] ; Clear flag
.wait_loop:
ldr r1, [r0]
cmp r1, #0
beq .wait_loop
bx lr
; SysTick ISR - fires at 60Hz
SysTick_Handler:
push {r4, lr}
; Set frame ready flag
ldr r0, =frame_ready_flag
movs r1, #1
str r1, [r0]
; Toggle debug pin for timing measurement
ldr r0, =SIO_BASE
ldr r1, =0x01000000 ; Bit 24
str r1, [r0, #GPIO_OUT_XOR]
pop {r4, pc}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| SPI Protocol | RP2040 Datasheet - Raspberry Pi Foundation | Ch. 4.4 (SPI) |
| DMA Architecture | RP2040 Datasheet - Raspberry Pi Foundation | Ch. 2.5 (DMA Controller) |
| ST7789 Display | ST7789 Datasheet - Sitronix | Full document |
| Graphics Algorithms | “Computer Graphics from Scratch” - Gabriel Gambetta | Ch. 1-6 (Rasterization) |
| Sprite Rendering | “Write Great Code, Volume 2” - Randall Hyde | Ch. 12 (Graphics) |
| Double Buffering | “Computer Graphics from Scratch” - Gabriel Gambetta | Ch. 14 (Shading) |
| PWM Audio | “RP2040 Assembly Language Programming” - Stephen Smith | Ch. 12 (PWM) |
| Game Loop Patterns | “Making Embedded Systems, 2nd Edition” - Elecia White | Ch. 10 (State Machines) |
| Fixed Timestep | “Game Programming Patterns” - Robert Nystrom | Ch. 12 (Game Loop) |
| Performance Optimization | “Write Great Code, Volume 2” - Randall Hyde | Ch. 13 (Optimization) |
| Interrupt Handling | “Making Embedded Systems, 2nd Edition” - Elecia White | Ch. 4 (Interrupts) |
| ARM Assembly | “The Art of ARM Assembly, Volume 1” - Randall Hyde | All chapters |
Essential Resources Summary
Books directly relevant:
- “The Art of ARM Assembly, Volume 1” - Randall Hyde (primary reference for ARM/Thumb)
- “Modern Arm Assembly Language Programming” - Daniel Kusswurm (excellent for AArch64)
- “Making Embedded Systems, 2nd Edition” - Elecia White (the “why” behind bare metal)
- “Introduction to Computer Organization: ARM Edition” - Robert G. Plantz (CPU architecture context)
Book to consider adding:
- “RP2040 Assembly Language Programming” by Stephen Smith - Specifically targets your Pico with practical projects
Online Resources:
- Bare Metal Programming Guide (GitHub) - Practical starting point
- rpi4os.com - RPi 4 bare metal OS tutorial
- dwelch67/raspberrypi (GitHub) - Extensive bare metal examples
- RP2040 Datasheet - Your bible for Pico projects