Sprint: Learn ARM Assembly & Architecture Mastery - Real World Projects

Goal: Deeply understand ARM architecture from first principles—from registers and instruction encoding to bare-metal systems programming, bootloaders, and building your own ARM emulator. You’ll master how billions of devices work, understand the design philosophy that made ARM dominant, and be able to write high-performance, low-level code for any ARM target.


Introduction

What is ARM?

ARM (Advanced RISC Machines) is a 64-bit RISC instruction set architecture. The ARM architecture has become the dominant processor architecture in the world—it powers over 200 billion devices across smartphones, tablets, embedded systems, data centers, and increasingly, laptops and servers.

Unlike x86’s philosophy of “complex instructions for complex operations,” ARM embraces a Load/Store architecture: simple, uniform instructions that operate on registers, with explicit load/store operations for memory access. This design choice makes ARM:

  • Energy efficient (critical for mobile/IoT)
  • Simpler to pipeline (faster execution per clock cycle)
  • Easier to optimize (compiler-friendly)
  • Scalable (same ISA from tiny microcontrollers to high-end servers)

The Problem ARM Solves

Before ARM, most processor architectures were either:

  • CISC-based (like x86): Complex, power-hungry, hard to scale to tiny devices
  • Proprietary: MIPS, PowerPC, Sparc—fragmented ecosystems

ARM solved the fragmentation problem by becoming a licensed architecture. Companies license ARM designs and adapt them for their specific needs—Apple makes Apple Silicon, Qualcomm makes Snapdragon, Samsung makes Exynos. This licensing model created an ecosystem where a single instruction set could power everything from $5 microcontrollers to $3,000 servers.

What You’ll Build

Across these projects, you’ll build:

  1. An ARM instruction decoder - understand how 32 bits encode complex operations
  2. ARM assembly programs - write low-level code without libraries
  3. Bare-metal systems - control hardware directly (LED blinkers, UART drivers)
  4. Device drivers - GPIO, UART, I2C, SPI, DMA
  5. A bootloader - how systems actually start up
  6. A scheduler - task switching and context management
  7. An ARM emulator - simulate the entire CPU pipeline
  8. Exception handlers - interrupt/fault management
  9. Peripheral controllers - Motor PWM, audio playback, display drivers
  10. A complete tiny OS - bring it all together

Scope & What’s Not Included

Included:

  • ARMv7 (32-bit) and ARMv8/AArch64 (64-bit) instruction sets
  • Bare-metal programming (no OS)
  • Peripheral drivers and memory-mapped I/O
  • Exception handling and interrupts
  • Assembly language fundamentals

Not included:

  • High-level OS kernel design (beyond a tiny scheduler)
  • Virtualization (hypervisors, nested VM)
  • NEON/SIMD instructions (though referenced)
  • TrustZone security extensions
  • Detailed microarchitecture (pipeline stalls, cache coherency)

How to Use This Guide

Reading Strategy

  1. Start here: Read “Introduction” (this section) + “Why ARM Matters” (for context)
  2. Learn the fundamentals: Read the “Theory Primer” chapters before starting projects
  3. Pick your path: Follow “Recommended Learning Paths” based on your goals
  4. Deep dive as needed: Use “Deep Dive Reading by Concept” to supplement with books
  5. Work through projects: Complete projects in recommended order; don’t skip ahead

Working Through Projects

Each project has:

  • Real World Outcome - See exactly what you’re building and what success looks like
  • Core Question - The conceptual question the project answers
  • Thinking Exercise - Mental prep before coding
  • Hints in Layers - Progressively detailed guidance if stuck
  • Definition of Done - Clear completion criteria

Golden rule: Read through the entire project spec before writing any code. Understand the mental model first.

Time Investment

  • Simple projects (Decoder, Calculator, LED Blinker): 4-8 hours each
  • Moderate projects (Drivers, Bootloader): 10-20 hours each
  • Complex projects (Emulator, Scheduler, Tiny OS): 20-40+ hours each
  • Total sprint: 3-6 months part-time, 4-8 weeks full-time

Environment Setup

You’ll need:

Essential:

  • ARM GNU Toolchain (arm-none-eabi-gcc) for cross-compilation
  • QEMU (ARM emulator) or physical hardware (Raspberry Pi, STM32 board)
  • A text editor or IDE (VS Code, Sublime)

Recommended:

  • Debugger: gdb-multiarch or OpenOCD
  • Utilities: objdump, readelf, nm
  • Hardware: STM32F4 Nucleo board (~$15-30)

Installation:

# macOS
brew install arm-none-eabi-gcc qemu

# Ubuntu/Debian
sudo apt install gcc-arm-none-eabi qemu-system-arm gdb-multiarch

# Verify
arm-none-eabi-gcc --version
qemu-system-arm -version

Big Picture / Mental Model

ARM is a layered system. Understanding how each layer works and how they interact is the key to mastery:

┌──────────────────────────────────────────────────────────────┐
│                    Your ARM Program                          │
│                  (Assembly or C code)                        │
└────────────────────────────┬─────────────────────────────────┘
                             │
┌────────────────────────────▼─────────────────────────────────┐
│              Instruction Set Architecture (ISA)              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Registers  │  Instructions  │  Memory Model  │     │    │
│  │  (32 GPRs)  │  (Data/Control)│  (Load/Store)  │     │    │
│  └─────────────────────────────────────────────────────┘    │
└────────────────────────────┬─────────────────────────────────┘
                             │
┌────────────────────────────▼─────────────────────────────────┐
│                  Hardware Implementation                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   Fetch      │→ │   Decode     │→ │   Execute    │       │
│  │  (Pipeline)  │  │  (Control)   │  │   (ALU)      │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
└────────────────────────────┬─────────────────────────────────┘
                             │
┌────────────────────────────▼─────────────────────────────────┐
│         System Abstractions (Memory, Interrupts)             │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │     MMU     │  │   Exception  │  │   Cache &    │        │
│  │             │  │   Vectors    │  │   Memory     │        │
│  └─────────────┘  └──────────────┘  └──────────────┘        │
└────────────────────────────┬─────────────────────────────────┘
                             │
┌────────────────────────────▼─────────────────────────────────┐
│         Peripherals (GPIO, UART, Timers, DMA)               │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌──────────┐ │
│  │ GPIO/LEDs │  │  Serial   │  │  Timers   │  │   DMA    │ │
│  └───────────┘  └───────────┘  └───────────┘  └──────────┘ │
└───────────────────────────────────────────────────────────────┘

Key insight: You start at the TOP (your code) and work DOWN to understand the hardware. By Project 8 (emulator), you’ll rebuild this entire stack.


Theory Primer: Core Concepts Deep Dive

1. ARM Registers & The Register File

Fundamentals

ARM processors have 32 general-purpose registers (in ARMv7) or 31 + SP/PC (in AArch64). Unlike x86, which has a small, specialized register set, ARM treats registers uniformly. This is the first principle of RISC: maximize register usage to minimize memory access.

Deep Dive

ARMv7 (32-bit) Register Set:

Registers R0-R15 (each 32 bits):

R0-R3      | Argument/Return Register (ABI Convention)
           | When you call a function, first 4 arguments go here
           | Return value (up to 32 bits) comes back in R0

R4-R11     | Callee-Saved Registers
           | If a function uses these, it MUST save/restore them
           | Safe to rely on these across function calls

R12 (IP)   | Intra-Procedure Call Register
           | Temporary; used by linker for code generation
           | Function can use it without saving

R13 (SP)   | Stack Pointer
           | Points to top of the stack
           | Usually grows downward (high → low addresses)

R14 (LR)   | Link Register
           | When you BL (Branch with Link), address of next
           | instruction (PC + 4) stored here
           | This is how functions return: MOV PC, LR

R15 (PC)   | Program Counter
           | Points to current instruction (actually +8 ahead due to pipeline)
           | Reading PC gives weird results; writing PC = branch

CPSR       | Current Program Status Register (32-bit flags)
           | N (Negative): result was negative
           | Z (Zero): result was zero
           | C (Carry): unsigned overflow or shift-out
           | V (Overflow): signed overflow
           | I (IRQ disable), F (FIQ disable) flags
           | M0-M4 (Processor mode bits)

AArch64 (64-bit) Register Set:

Registers X0-X30 (each 64 bits), SP, PC (implicit):

X0-X7      | Argument/Return Register
           | First 8 integer arguments
           | Return values in X0-X1

X8         | Indirect Result Register
           | Used for large struct returns

X9-X15     | Caller-Saved Temporary
           | Can be freely used; caller must save if needed

X16-X17    | Intra-Procedure Call Register
           | Platform reserved (compiler uses for trampolines)

X18        | Platform Register
           | Reserved (some platforms use for TLS - Thread Local Storage)

X19-X28    | Callee-Saved Register
           | Must save/restore if function uses them

X29 (FP)   | Frame Pointer
           | Points to current stack frame
           | Convention: helps debuggers understand stack

X30 (LR)   | Link Register
           | Return address; BL stores PC+4 here

SP         | Stack Pointer
           | Must be 16-byte aligned at function boundaries

PC         | Program Counter
           | Not directly accessible in AArch64 (unlike ARMv7)
           | ADR/ADRP instructions reference it

PSTATE     | Processor State Register
           | N, Z, C, V flags (same as ARMv7)
           | Plus additional flags (V128, etc.)

W0-W30     | 32-bit Views of X0-X30
           | Writing to W0 clears upper 32 bits of X0
           | Useful for 32-bit operations in 64-bit mode

How It Works: The Call Stack

When function A calls function B:

Before call (in A):
  X0 = arg1, X1 = arg2, ... (arguments)
  BL function_b              (PC+4 stored in X30/LR)

In function B:
  Stack frame setup:
    [SP - 32] = saved X29 (old frame pointer)
    [SP - 24] = saved X30 (return address)
    [SP - 16] = local_var1
    [SP - 8]  = local_var2
    X29 = SP - 32  (new frame pointer)

  Do work with arguments in X0-X7
  If calling C: prepare C's args in X0-X7, BL function_c

  Return:
    X0 = return value
    Restore X29, X30 from stack
    RET (= MOV PC, X30)

Back in A:
  Return value in X0
  Continue execution

Common Misconceptions

Myth: “All registers are the same; just use any one.”

  • Reality: Calling conventions enforce which registers hold arguments, returns, and must be saved. Violating this breaks interoperability.

Myth: “The stack is just memory; I can put anything there.”

  • Reality: The stack must be aligned (16-byte in AArch64), and exception/interrupt handlers expect a valid stack to save state.

Myth: “PC is just another register; I can read it like R0.”

  • Reality: In ARMv7, reading PC gives PC+8 (pipeline effect). In AArch64, you can’t read PC directly; use ADR or ADRP instructions.

Check Your Understanding

  1. If I call a function with 10 arguments, how many fit in registers? Where do the rest go?
  2. Why is R13 special? What breaks if you use it like a normal register?
  3. In ARMv7, if I read R15 (PC) in the middle of a function, what value do I get and why?
  4. What happens if a callee-saved function doesn’t restore R4 before returning?

Answers

  1. Only first 4 (R0-R3, or X0-X7 in AArch64). The rest go on the stack; the caller reserves space and passes pointers.
  2. R13 is the stack pointer. If you overwrite it, the CPU loses track of the stack and exceptions/returns fail catastrophically.
  3. You get PC+8 because ARM uses a 3-stage pipeline. The pipeline has already fetched and decoded instructions 8 bytes ahead.
  4. The caller’s code that relied on R4’s value will see garbage or crashes. Compiler-enforced ABI prevents this.

Real-World Applications

  • Debuggers: Need to understand register conventions to extract arguments from a crashed function
  • JIT Compilers: Allocate which variables go in which registers to minimize memory access
  • OS Kernels: Save/restore all registers on context switch; manage LR for interrupt returns
  • Reverse Engineering: Identify function arguments and return values by tracking register flow

Where You’ll Apply It

  • Project 2 (Calculator): Work with R0-R7, understand SVC calling convention
  • Project 4 (UART Driver): Save/restore registers in interrupt handlers
  • Project 7 (Context Switcher): Save all registers on task switch
  • Project 15 (Tiny OS): Schedule register allocation across tasks

References

  • ARM Cortex-M3/M4 Technical Reference Manual - ARM Ltd
  • “The Art of ARM Assembly, Vol 1” - Randall Hyde, Chapters 1-3
  • Azeria Labs: “Writing ARM Assembly” series

Key Insight

Registers are the fastest memory on the CPU. Every memory access is slower than a register operation. The ISA design centers on maximizing register usage to minimize memory traffic. This is why ARM processors can be fast despite being “simpler” than x86—they move work from hardware to smart compilers that allocate registers efficiently.

Summary

ARM’s register model is uniform, numerous, and fast. Unlike x86’s specialized registers, ARM gives you 32 general-purpose registers plus SP and PC. The calling convention (ABI) dictates which registers hold what, making function calls predictable and cacheable.

Homework & Exercises

  1. Write pseudocode for a function that takes 8 arguments. Where does each argument live in ARMv7 vs AArch64?
  2. Trace through a stack frame: show how SP and FP change as you enter/exit nested functions
  3. Read an ARM assembly disassembly and identify all register saves/restores

2. Load/Store Architecture

Fundamentals

The defining characteristic of RISC (Reduced Instruction Set Computer) is Load/Store separation:

  • Data processing instructions (ADD, SUB, MUL, etc.) work only on registers
  • Load instructions move data FROM memory TO registers
  • Store instructions move data FROM registers TO memory
  • No instruction can operate directly on memory (unlike x86: ADD [RAX], RBX)

This design choice has profound consequences for everything from compiler design to pipeline performance.

Deep Dive

x86 Example (CISC):

ADD [memory_address], 5     ; Add 5 to value in memory (no register intermediate)
                            ; This instruction:
                            ;  1. Fetch from memory
                            ;  2. Add 5
                            ;  3. Write back
                            ;  All in one instruction!

ARM Example (RISC):

LDR R0, [R1]                ; Load: memory → register
ADD R0, R0, #5              ; Add: register ← register + 5
STR R0, [R1]                ; Store: register → memory
                            ; Three instructions, but each does one thing

Why Load/Store?

  1. Simpler CPU design: Each instruction stage (fetch, decode, execute, memory, writeback) is straightforward
  2. Better pipelining: No instruction competes for memory and execute in the same cycle
  3. Compiler-friendly: Compiler can allocate variables to registers; only use memory for spills
  4. Cache efficiency: Predictable memory patterns help cache prefetchers

Load/Store Instruction Formats

LDR Rd, [Rn {, #offset}]    ; Load word (32-bit)
LDRB Rd, [Rn {, #offset}]   ; Load byte (8-bit, zero-extend)
LDRH Rd, [Rn {, #offset}]   ; Load halfword (16-bit, zero-extend)
LDRSB Rd, [Rn {, #offset}]  ; Load byte (sign-extend)

STR Rd, [Rn {, #offset}]    ; Store word
STRB Rd, [Rn {, #offset}]   ; Store byte
STRH Rd, [Rn {, #offset}]   ; Store halfword

LDM Rn, {Rx, Ry, ...}       ; Load Multiple (faster bulk load)
STM Rn, {Rx, Ry, ...}       ; Store Multiple

Examples

; Load a 32-bit value from memory address in R1
LDR R0, [R1]                ; R0 = *R1

; Load with offset
LDR R0, [R1, #4]            ; R0 = *(R1 + 4)

; Load with register offset
LDR R0, [R1, R2]            ; R0 = *(R1 + R2)

; Load with shift
LDR R0, [R1, R2, LSL #2]    ; R0 = *(R1 + R2*4)

; Store
STR R0, [R1]                ; *R1 = R0

; Load multiple (e.g., function prologue)
LDMIA SP!, {R4-R7, PC}      ; Load 5 values from stack, auto-increment SP
                            ; Last value loaded into PC = return to caller

; Store multiple (e.g., function epilogue)
STMDB SP!, {R4-R7, LR}      ; Save registers before function work

How It Works: Memory Addressing Modes

ARM supports several addressing modes:

1. Immediate Offset:
   LDR R0, [R1, #8]         ; R1 + 8 (8-bit constant in instruction)

2. Register Offset:
   LDR R0, [R1, R2]         ; R1 + R2

3. Scaled Register Offset:
   LDR R0, [R1, R2, LSL #2] ; R1 + (R2 << 2) = R1 + R2*4

4. Pre-indexed (write-back):
   LDR R0, [R1, #4]!        ; R0 = *(R1+4); then R1 = R1+4

5. Post-indexed (write-back):
   LDR R0, [R1], #4         ; R0 = *R1; then R1 = R1+4

6. Relative (PC-relative, AArch64):
   LDR R0, label            ; R0 = *(PC + offset_to_label)
   ; Compiler computes offset

Common Misconceptions

Myth: “ARM is slower because it takes 3 instructions instead of 1.”

  • Reality: A pipelined ARM processor can execute all 3 instructions in parallel (Fetch inst 2, Decode inst 1, Execute inst 0). Modern ARM executes at 1 instruction per cycle despite using Load/Store separation.

Myth: “Just use one big instruction; it’s easier.”

  • Reality: Pipelined CPUs benefit from uniform, short instructions. A complex memory-modifying instruction creates a pipeline bubble (where nothing else can execute in parallel).

Myth: “I can access memory from any register to any register.”

  • Reality: Some memory addressing modes are restricted. Not all registers work with all offsets. Check the ARM manual.

Check Your Understanding

  1. Write ARM assembly to implement: array[i] = array[i] + 10 where array base is in R0, i is in R1
  2. Why does ADD R0, [memory], R1 not exist in ARM?
  3. What’s the difference between LDR R0, [R1]! and LDR R0, [R1], #0?
  4. Can you load directly from memory into memory? Why or why not?

Answers

1.

LDR R2, [R0, R1, LSL #2]    ; R2 = array[i] (R1 << 2 = R1*4 for 32-bit elements)
ADD R2, R2, #10
STR R2, [R0, R1, LSL #2]    ; array[i] = R2
  1. Because each instruction must have clear roles: Load/Store for memory, ADD for arithmetic. Mixing creates pipeline conflicts.

  2. LDR R0, [R1]! loads from R1 and updates R1 by 4 (default word size). LDR R0, [R1], #0 loads from R1 and updates R1 by 0 (no update). They’re subtly different.

  3. No. Only load/store instructions access memory. To do memory-to-memory, you need an intermediate register. This is Load/Store architecture’s core rule.

Real-World Applications

  • Cache optimization: Knowing memory access patterns helps predict/prefetch
  • SIMD vectorization: SIMD instructions process multiple values; Load/Store separation enables efficient data movement
  • Embedded systems: Minimal registers → careful memory access → predictable power consumption
  • Compiler backends: Register allocators minimize memory traffic by keeping hot variables in registers

Where You’ll Apply It

  • Project 1 (Decoder): Understand LDR/STR encoding
  • Project 2 (Calculator): Use LDR/STR for input/output buffers
  • Project 3 (LED Blinker): Access GPIO registers via memory-mapped I/O
  • Project 5 (Allocator): Walk heap pointers using LDR with offsets

References

  • ARM Architecture Reference Manual, Section A5 (Load/Store Instructions)
  • “Computer Systems: A Programmer’s Perspective”, Chapter 3 - Bryant & O’Hallaron
  • Azeria Labs: “ARM Shellcode” series (shows memory access patterns)

Key Insight

Load/Store separation is the foundation of modern CPU design. It enables pipelining, cache optimization, and compiler-friendly code generation. By forcing memory operations to be explicit, the ISA makes performance bottlenecks visible.

Summary

ARM’s Load/Store architecture means data processing and memory access are separate. Every memory access is explicit and predictable, which benefits pipelined execution and cache design. This is unlike CISC architectures (x86) where one instruction can access memory and compute simultaneously, creating hidden complexity.

Homework & Exercises

  1. Convert x86 code using memory operands to Load/Store form
  2. Optimize a loop that repeatedly loads/stores to the same array by keeping the base pointer in a register
  3. Calculate the encoding size and execution time for both approaches

3. Conditional Execution & The Barrel Shifter

Fundamentals

Conditional Execution: Most ARM instructions can be conditionally executed based on CPU flags (N, Z, C, V) set by previous instructions. Instead of branching, you execute or skip instructions in parallel.

Barrel Shifter: A hardware component that shifts operands as part of any data processing instruction, without wasting a cycle.

Together, these features minimize branching (expensive in pipelined CPUs) and enable compact, efficient code.

Deep Dive

Condition Codes (bits 31-28 of instruction):

0000 = EQ    (Equal, Z=1)
0001 = NE    (Not Equal, Z=0)
0010 = CS/HS (Carry Set, unsigned higher or same)
0011 = CC/LO (Carry Clear, unsigned lower)
0100 = MI    (Minus, negative, N=1)
0101 = PL    (Plus, non-negative, N=0)
0110 = VS    (Overflow Set, V=1)
0111 = VC    (Overflow Clear, V=0)
1000 = HI    (Unsigned higher)
1001 = LS    (Unsigned lower or same)
1010 = GE    (Signed ≥)
1011 = LT    (Signed <)
1100 = GT    (Signed >)
1101 = LE    (Signed ≤)
1110 = AL    (Always - default)
1111 = NV    (Never - historically unused; now special)

Without conditional execution (x86-like branching):

CMP R0, #10              ; Compare R0 with 10
JGE skip                 ; Branch if ≥
ADD R1, R1, #1           ; R1++
skip:

With conditional execution (ARM way):

CMP R0, #10              ; Set flags
ADDGE R1, R1, #1         ; R1++ only if ≥ (no branch!)

Benefit: The second version doesn’t empty the pipeline. The CPU decodes both instructions before the CMP completes; the second instruction’s condition code is checked at execution time.

The Barrel Shifter:

The barrel shifter is a combinatorial circuit that shifts operands in one cycle:

LSL (Logical Shift Left):   x << n     (fills with 0s)
LSR (Logical Shift Right):  x >> n     (fills with 0s)
ASR (Arithmetic Shift Right): x >> n   (fills with sign bit)
ROR (Rotate Right):         x >> n | x << (32-n)
RRX (Rotate Right eXtended): (x >> 1) | (C << 31)

Without barrel shifter (multiple instructions):

MOV R0, #8
MOV R1, #2
LSL R1, R1, R0           ; This would take 2 cycles (hypothetically)

With barrel shifter (one instruction):

MOV R1, R1, LSL #2       ; R1 = R1 << 2 (in the MOV itself!)

Or in a data processing instruction:

ADD R0, R1, R2, LSL #3   ; R0 = R1 + (R2 << 3) = R1 + R2*8

Examples:

; Conditional execution examples
CMP R0, R1
ADDLT R0, R0, #1         ; If less than, R0++
MOVEQ R2, #0             ; If equal, R2 = 0
STRNE R0, [R1]           ; If not equal, store R0

; Barrel shifter examples
MOV R0, #1, ROR #0       ; R0 = 1 (no shift)
MOV R0, #1, ROR #8       ; R0 = 0x01000000 (rotated left 8)
ADD R0, R1, R2, LSL #2   ; R0 = R1 + R2*4
SUB R0, R1, R2, ASR #1   ; R0 = R1 - (R2 / 2)

; Conditional + Shifter
ADDSLT R0, R1, R2, LSL #3 ; If less: R0 = R1 + R2*8, update flags

How It Works: Conditional Execution Pipeline

Cycle 1: CMP R0, R1
         - Execute compares R0 and R1
         - Update flags (N, Z, C, V)

Cycle 2: ADDGE R1, R1, #1  ; Condition code is GE
         - Fetch/Decode continues in parallel
         - At execute time: Check flags against GE condition
         - If condition true: Execute ADD
         - If condition false: Discard result (write nothing)

The key insight: The instruction is always fetched and decoded; the condition is checked at execution time.

Common Misconceptions

Myth: “Conditional execution means the instruction doesn’t execute.”

  • Reality: The instruction is fetched, decoded, and executed. The result is discarded if the condition is false. This avoids a pipeline flush, but it’s not “zero cost.”

Myth: “Barrel shifter is ‘free’; always use it.”

  • Reality: Barrel shifter adds one extra data path in the ALU. It’s very cheap (true), but not “free” in energy-constrained embedded systems.

Myth: “Use conditional execution everywhere to avoid branches.”

  • Reality: Conditional execution only works for short sequences (usually 1-3 instructions). Longer sequences should use branches. Too much conditional execution confuses compilers and readers.

Check Your Understanding

  1. Write conditional execution to implement: if (x > y) z = a * b; else z = c / d; (assuming operands in R0-R4)
  2. Why is ADD R0, R1, R2, LSL #2 faster than separate MOV + ADD instructions?
  3. Can you conditionally execute a branch instruction? (Hint: BEQ with a condition code prefix)
  4. How many shifts can you apply in a single instruction?

Answers

1.

CMP R0, R1           ; x vs y
MUL R5, R2, R3       ; (in parallel: a*b)
DIV R6, R4, R5       ; (separate: c/d - DIV is serial)
MOVGT R4, R5         ; If x > y: z = a*b
MOVLE R4, R6         ; If x ≤ y: z = c/d
; Note: Mixing operations makes this complex; usually you'd use branches for control flow logic.
  1. Because the shifter is part of the ALU. The instruction decodes “ADD with left-shift operand” in one execute stage, instead of: MOV (shift), ADD (wait for shift result).

  2. Yes! BLEQ label (Branch if Equal). The instruction is only fetched if EQ is true, otherwise skipped. But this is rare; usually you branch unconditionally.

  3. Only one barrel shifter. You can apply one shift to one operand per instruction: ADD Rd, Rn, Rm, LSL #3 shifts Rm; Rn is not shifted.

Real-World Applications

  • Loop unrolling with conditional execution: Reduce branches in tight loops
  • Inline comparisons: Compare while loading: LDR R0, [R1]; CMPNE R0, #0
  • Bit manipulation: MOV R0, R1, LSR #8 extracts a byte; ORR R0, R0, R1, LSL #8 packs bytes
  • Address calculations: LDR R0, [R1, R2, LSL #2] loads from array without intermediate MOV

Where You’ll Apply It

  • Project 1 (Decoder): Understand condition code encoding
  • Project 2 (Calculator): Conditional execution for operator branching
  • Project 7 (Scheduler): Conditional execution in context switch code
  • Project 15 (OS): Tight loops using conditional execution and shifts

References

  • ARM Architecture Reference Manual, Section A8 (Conditional Execution)
  • “The Art of ARM Assembly, Vol 1”, Chapters 5-6 - Randall Hyde
  • Azeria Labs: “ARM Assembly Basics Part 4” (conditionals)

Key Insight

Conditional execution and barrel shifter reduce pipeline stalls. By avoiding branches and multi-cycle operations, ARM code can execute at one instruction per cycle. This is a major reason ARM is efficient—not because it’s simpler, but because it’s optimized for pipelined execution.

Summary

ARM’s conditional execution (based on flags) and barrel shifter (shifts in one cycle) minimize branching and optimize tight loops. Together, they enable high-performance code without complex branch prediction logic.

Homework & Exercises

  1. Write conditional execution for absolute value: z = abs(x)
  2. Implement array element access with shifting: array[i] = value where array is 64-bit elements
  3. Convert a cascade of if-else statements to ARM with minimal branches

Glossary

  • ABI: Application Binary Interface; defines how functions call each other (register conventions, stack layout)
  • AArch32: 32-bit execution state of ARM (ARMv8+)
  • AArch64: 64-bit execution state of ARM (ARMv8+)
  • ALU: Arithmetic Logic Unit; the part of CPU that executes ADD, SUB, etc.
  • ARMv7: 32-bit ARM architecture (Cortex-A9, A8)
  • ARMv8: 64-bit ARM architecture (Cortex-A57, Apple A7+)
  • Bare-metal: Running code without an OS (no libc, no system calls except maybe SVC)
  • Barrel Shifter: Hardware that shifts operands in one cycle as part of ALU operations
  • Calling Convention: ABI specification for which registers pass arguments/returns
  • Conditional Execution: Executing or skipping an instruction based on CPU flags
  • CPSR: Current Program Status Register; holds flags and mode bits (ARMv7)
  • EABI: Embedded ABI; calling convention for embedded ARM systems
  • Exception: Interrupt, fault, or trap (hardware event requiring CPU attention)
  • FIQ: Fast Interrupt; high-priority interrupt with dedicated banked registers
  • ISA: Instruction Set Architecture; the programmer-visible interface of CPU
  • IRQ: Interrupt Request; standard interrupt
  • Load/Store Architecture: ISA where memory access is separate from computation
  • LR: Link Register (R14/X30); holds return address for function calls
  • MMIO: Memory-Mapped I/O; accessing peripherals via memory addresses
  • MMU: Memory Management Unit; translates virtual to physical addresses
  • NEON: SIMD (Single Instruction Multiple Data) extension for ARM
  • PSTATE: Processor State Register (AArch64 equivalent of CPSR)
  • RISC: Reduced Instruction Set Computer; simple, uniform instructions
  • ROR: Rotate Right; shifts bits with wraparound
  • Supervisor Mode: Privileged mode where OS kernel runs
  • SVC: Supervisor Call (formerly SWI); software interrupt to invoke OS
  • Thumb: 16-bit instruction encoding (compressed, 2-byte instructions)
  • TLB: Translation Lookaside Buffer; cache of MMU translations
  • User Mode: Unprivileged mode for applications
  • Vector Table: Array of exception handler addresses at a fixed location
  • ZCR: Zero Control Register (NEON/SVE control)

Why ARM Matters: Context & Statistics

Modern Adoption (2024-2025)

According to recent market analysis:

  • Mobile market: ARM has 90%+ market share in smartphones and tablets
  • IoT & Embedded: ARM covers 65% of IoT and embedded computing
  • Servers: ARM-based servers growing at 14.3% CAGR; estimated $14.51B by 2030 (vs. $5.84B in 2023)
  • PCs/Laptops: ARM-based notebooks expected to reach 25% market share by 2027 (Apple captured 90% of ARM notebook market with M-series)
  • Market size: Entire ARM processor market expected to grow from $37.41B (2024) to $170.4B (2032)—20.87% CAGR

Why ARM Dominates

1. Energy Efficiency

ARM’s design prioritizes power per instruction. A smartphone running an ARM chip drains the battery less than equivalent x86 would. This made ARM the natural choice for mobile, which then scaled to IoT and wearables.

2. Scalability

The same ISA works from 8-bit microcontrollers (Cortex-M0) to 64-bit servers (Cortex-A72+). You learn one architecture and it applies everywhere.

3. Licensing Model

Unlike x86 (Intel monopoly) or MIPS (once fragmented), ARM licenses the ISA to dozens of companies. Apple, Qualcomm, Samsung, MediaTek, Broadcom all design ARM chips. This ecosystem diversity has been crucial to ARM’s success.

4. Simplicity for Compilers

Load/Store architecture, uniform registers, and conditional execution make ARM code straightforward for compilers to optimize. LLVM, GCC, and Clang all have excellent ARM backends.

The Shift to 64-bit: ARMv8/AArch64

In 2011, ARM introduced ARMv8 with a 64-bit execution state (AArch64). This was crucial because:

  • Servers and high-end devices need 64-bit address space and 64-bit registers
  • Apple, AWS, and hyperscalers built their silicon around AArch64
  • ARMv7 is being phased out in favor of ARMv8+ (ARMv9 removes AArch32 support entirely)

Today’s landscape:

  • ARMv7-A (32-bit): Legacy; still common in older IoT/embedded
  • ARMv8-A (64-bit): Current standard; all modern phones, servers, Macs
  • ARMv9 (2021+): Enhanced security, SIMD, and scalability

Comparison: ARM vs x86 vs RISC-V

Feature ARM x86 RISC-V
Market Share 90%+ mobile, 65% IoT PCs, servers Emerging (5%?)
Power Efficiency Excellent Moderate Theoretical
Ecosystem Mature, massive Mature, Intel/AMD Early
Development Standardized Proprietary Open ISA
Cost Licensing model High entry barrier Free to implement
Performance Competitive Leading (briefly) TBD
Learning Curve Moderate Steep (x86 complex) Easy (RISC-V simple)

Takeaway: ARM is the clear winner in market share and momentum. RISC-V is promising but not yet production-ready in consumer products. x86 remains entrenched in PCs/servers but losing mobile/embedded to ARM.

Real-World Examples

  • Apple M1/M2/M3: ARM-based Macs outselling Intel models
  • Raspberry Pi: $35 computer powered entire education/hobbyist ecosystem
  • AWS Graviton: Amazon’s ARM chips for data centers; claimed better price/performance
  • 5G & AI: ARM chips chosen for edge computing (power-efficient, low latency)
  • Automotive: Electric vehicles heavily invest in ARM; traditional automotive still x86/PowerPC

Career Impact

  • ARM knowledge is in-demand: Embedded systems, automotive, IoT, mobile development
  • Competitive advantage: Most developers only know one architecture; understanding ARM sets you apart
  • Future-proof: ARM is growing; x86 is legacy in many markets
  • Interviewing: System-level interviews often assume x86; ARM knowledge impresses

Why Learn ARM from First Principles?

  1. You’ll understand processor design - ISA → microarchitecture → optimization
  2. Embedded systems become your playground - STM32, Raspberry Pi, Arduino are ARM
  3. You’ll reverse-engineer binaries - iOS, Android, firmware all ARM assembly
  4. Performance optimization becomes tangible - See how pipeline, cache, and branching affect real code
  5. You’ll be hireable - Positions in automotive, IoT, mobile development heavily seek this

Concept Summary Table

Concept Cluster What You Need to Internalize
ARM Registers & Calling Conventions ARM has 32 general-purpose registers (vs x86’s 8-16). Registers R0-R3 pass arguments, R4-R11 must be saved by callees. R13=SP, R14=LR, R15=PC. This is fundamental to all code.
Load/Store Architecture Data processing works only on registers. Memory is accessed via explicit LDR/STR instructions. No “ADD [memory], value” like x86. This separates concerns and enables pipelining.
Conditional Execution Most instructions can be conditional (ADDEQ, STRNE, etc.) based on flags. This avoids branches and pipeline flushes, enabling efficient loops.
Barrel Shifter The shifter is part of the ALU; apply shifts in one cycle: ADD R0, R1, R2, LSL #3. Enables efficient scaling and bit manipulation.
Instruction Pipelining ARM uses a 3+ stage pipeline. Instructions fetch/decode while previous ones execute. PC is 8 bytes ahead due to pipeline fill. Branches are expensive.
Exception Model Exceptions (interrupts, faults) trigger handlers via vector table. Different processor modes (User, Supervisor, FIQ, IRQ) with banked registers for context.
Memory Management MMU translates virtual addresses to physical; TLB caches translations. Cache hierarchies (L1, L2) improve memory bandwidth. Understanding cache is key to performance.
ARMv7 vs ARMv8/AArch64 ARMv7 is 32-bit (legacy); ARMv8+ is 64-bit (modern). AArch64 has 31 GPRs (vs 16 in ARMv7), simpler exception model. Future is AArch64.
Thumb Instruction Set 16-bit compressed instructions (vs 32-bit ARM). Reduces code size; slower in some pipelines. ARMv7 and AArch64 T32 support Thumb; AArch32 supports both ARM and Thumb.
NEON/SIMD Optional SIMD extension (like SSE on x86); process multiple data in parallel. Not required for basics, but used in multimedia/DSP.

Project-to-Concept Map

Project Primary Concepts Secondary Concepts
1. Instruction Decoder Instruction Pipelining, ARMv7 Encoding Calling Conventions, Exception Model
2. Assembly Calculator Registers & Conventions, Load/Store, Conditional Execution Instruction Pipelining, Barrel Shifter
3. LED Blinker Exception Model, Memory Management, MMIO Registers, Load/Store, Conditional Execution
4. UART Driver Exception Model (interrupts), MMIO, Calling Conventions Memory Management, Barrel Shifter
5. Memory Allocator Memory Management, Registers, Load/Store Barrel Shifter, Instruction Pipelining
6. Bootloader Exception Model (reset vector), Memory Management Instruction Pipelining, Load/Store
7. Context Switcher Registers & Conventions, Exception Model Memory Management, Instruction Pipelining
8. Emulator All concepts (simulation of CPU) All
9. Exception Handler Exception Model, Registers Memory Management, Instruction Pipelining
10-14. Drivers MMIO, Exception Model, Registers Memory Management, Conditional Execution
15. Tiny OS All concepts (integration) All

Deep Dive Reading by Concept

Concept Book & Chapter Why This Matters
ARM Registers & Conventions “The Art of ARM Assembly, Vol 1” by Randall Hyde - Chapters 1-3 Foundational for every function call
  “ARM Architecture Reference Manual” - Section A4 (Registers) Official specification
Load/Store Architecture “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 3 Explains CPU design principles
  ARM Architecture Reference Manual - Section A5 (Load/Store) Official instruction specs
Conditional Execution “The Art of ARM Assembly, Vol 1” - Chapters 5-6 Performance optimization technique
  Azeria Labs blog: “ARM Assembly Basics Part 4” Practical examples
Barrel Shifter “The Art of ARM Assembly, Vol 1” - Chapter 4 Essential for bit manipulation
  ARM Architecture Reference Manual - Section A8.4 Encoding details
Instruction Pipelining “Computer Systems: A Programmer’s Perspective” - Chapter 4 Explains CPU execution model
  “The Art of Computer Programming, Vol 1” by Knuth - Section 1.4.1 Theoretical foundations
Exception Handling “The Definitive Guide to ARM Cortex-M3 and Cortex-M4” by Joseph Yiu - Chapter 8 NVIC, interrupt priorities
  ARM Cortex-M Technical Reference Manual - Exception Model Official specs
Memory Management “Computer Systems: A Programmer’s Perspective” - Chapter 9 Virtual memory, caching
  “Understanding the Linux Kernel” by Bovet & Cesati - Chapter 2 Practical OS perspective
ARMv8/AArch64 “ARM Cortex-A Series Programmer’s Guide for ARMv8-A” by ARM Ltd Modern 64-bit architecture
  “Programming with ARM Neon” (ARM Ltd) SIMD basics
Thumb Instruction Set ARM Architecture Reference Manual - Section A6 (Thumb) Code size optimization
  “Writing Compact ARM Thumb Code” (ARM white paper) Best practices

Quick Start: Your First 48 Hours

Day 1 (Friday Evening)

  1. Read this guide’s introduction (1 hour)
    • Understand why ARM matters
    • Skim the register diagrams
  2. Set up environment (30 min)
    • Install arm-none-eabi-gcc, QEMU
    • Verify installation
  3. Start Project 1: Instruction Decoder (2-3 hours)
    • Read the “Real World Outcome” section
    • Read the “Core Question You’re Answering” section
    • Understand the mental model (instruction encoding)
    • Do not code yet—just understand

Day 2 (Saturday)

  1. Read Theory Primer sections 1-3 (2 hours)
    • Registers, Load/Store, Conditional Execution
    • Understand how they fit together
  2. Continue Project 1 (4-6 hours)
    • Follow Hints in Layers progressively
    • Code the instruction decoder
    • Test against provided examples
  3. Quick validation (30 min)
    • Does your decoder match the reference output?

By Sunday Morning

You should understand:

  • How ARM instructions are encoded (32-bit format)
  • What registers do and how they’re used
  • How the CPU pipeline affects code

Path 1: The Embedded Systems Engineer

Goal: Build real systems on hardware (STM32, Raspberry Pi)

  1. Project 1 (Decoder) - understand the ISA
  2. Project 2 (Calculator) - write real ARM code
  3. Project 3 (LED Blinker) - first bare-metal code
  4. Project 4 (UART) - debug via serial
  5. Project 10 (I2C/OLED) - add sensors
  6. Project 12 (PWM Motor) - control motors
  7. Project 15 (Tiny OS) - capstone

Time: 8-12 weeks part-time Focus: Peripheral programming, real hardware End result: You can build IoT/embedded products

Path 2: The Systems Programmer

Goal: Understand CPU architecture deeply; build OS components

  1. Project 1 (Decoder) - understand encoding
  2. Project 2 (Calculator) - assembly basics
  3. Project 6 (Bootloader) - CPU startup sequence
  4. Project 7 (Context Switcher) - task management
  5. Project 8 (Emulator) - simulate the CPU
  6. Project 9 (Exception Handler) - interrupts
  7. Project 15 (Tiny OS) - put it together

Time: 10-14 weeks part-time Focus: Architecture, abstraction layers End result: You understand how CPUs and OSes work

Path 3: The Reverse Engineer / Security Researcher

Goal: Analyze ARM binaries, find vulnerabilities

  1. Project 1 (Decoder) - read ARM assembly
  2. Project 2 (Calculator) - write small programs
  3. Project 8 (Emulator) - trace execution
  4. Project 9 (Exception Handler) - understand fault handling
  5. Project 14 (GDB Stub) - debug techniques
  6. Project 15 (Tiny OS) - understand runtime environment

Time: 6-10 weeks part-time Focus: Analysis, debugging, security End result: You can reverse-engineer ARM binaries professionally

Path 4: The Accelerated Learner

Goal: Cover all concepts in 3-4 weeks

  1. Projects 1-2 (week 1): ISA and basics
  2. Projects 3-5 (week 2): Bare-metal systems
  3. Projects 6-9 (week 3): Advanced systems
  4. Projects 10-15 (week 4): Integration and capstone

Requires: Prior OS/system knowledge Intensity: Full-time, ~30 hours/week Result: Comprehensive understanding, projects less polished


Success Metrics

After completing this learning path, you should be able to:

Conceptual Mastery:

  • Explain why ARM has 32 registers while x86 has 8-16
  • Draw the instruction pipeline and explain why branches are expensive
  • Describe Load/Store architecture and why it’s better for pipelining
  • Explain conditional execution vs branching

Practical Skills:

  • Read ARM assembly disassembly and understand what code does
  • Write ARM assembly programs from scratch (at least 200 lines)
  • Cross-compile C code to ARM and optimize the assembly output
  • Set up and debug ARM code on QEMU or physical hardware
  • Understand and write device drivers (GPIO, UART, I2C, SPI)

Advanced Projects:

  • Decode ARM instruction binaries correctly
  • Build a bootloader that starts a CPU from reset
  • Implement context switching and a mini-scheduler
  • Write a CPU emulator that executes ARM instructions
  • Handle exceptions and interrupts correctly
  • Build a complete tiny OS with task management

Interview-Ready:

  • Explain the ARM architecture to a hiring manager
  • Solve ARM assembly problems on whiteboard
  • Debug a production embedded issue end-to-end
  • Discuss tradeoffs between CISC (x86) and RISC (ARM)
  • Optimize code for pipeline efficiency

Optional Domain Appendices

Appendix A: ARM Development Toolchain Reference

Compilers:

  • arm-none-eabi-gcc - GCC for bare-metal ARM
  • arm-linux-gnueabihf-gcc - GCC for Linux on ARM
  • armv7l-rpi2-linux-gnueabihf - Raspberry Pi specific

Assemblers:

  • arm-none-eabi-as - GNU assembler
  • armasm - ARM’s proprietary assembler (free download)

Debuggers:

  • gdb-multiarch - GDB for ARM targets
  • arm-none-eabi-gdb - ARM-specific GDB
  • OpenOCD - On-Chip Debugger (hardware probe interface)

Emulators:

  • QEMU (qemu-system-arm, qemu-system-aarch64)
  • Unicorn Engine (Python-based emulation library)

Binary Analysis:

  • objdump - Disassemble binaries
  • readelf - Read ELF headers and sections
  • nm - List symbols
  • file - Identify binary format
  • strip - Remove debug symbols
  • size - Show code/data/bss sizes

Useful Scripts:

# Compile and link
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -c program.c -o program.o
arm-none-eabi-ld -T linker.ld program.o -o program.elf

# Disassemble
arm-none-eabi-objdump -d program.elf | head -50

# Run in QEMU
qemu-system-arm -machine virt -kernel program.elf -nographic

# Debug
arm-none-eabi-gdb program.elf
(gdb) target remote localhost:1234
(gdb) file program.elf
(gdb) break main
(gdb) continue

Appendix B: Common Pitfalls & Quick Fixes

Symptom Likely Cause Fix
“Undefined reference to _start No entry point defined Add _start: label to assembly or use C main() with startup code
“illegal operand” Instruction format wrong Check bit widths (immediates must fit in allowed bits)
Segfault when reading PC value Pipeline effect; PC is ahead Use PIC (Position Independent Code) or compute addresses via offsets
Breakpoint not working QEMU running wrong target Ensure matching QEMU machine and CPU type
Data corruption in arrays Off-by-one in shift calculation Check: [base, index, LSL #2] for 32-bit elements (not #3!)
Stack overflow Infinite recursion or stack too small Increase stack size in linker script; trace call depth
Interrupt never fires NVIC not configured Enable interrupt in NVIC_ISER register; set priority

Appendix C: ARM ISA Quick Reference

Data Processing (ALU Operations):

MOV Rd, Rn              Register move
MVN Rd, Rn              Move NOT
ADD Rd, Rn, Rm          Add
SUB Rd, Rn, Rm          Subtract
MUL Rd, Rn, Rm          Multiply (32x32→32)
SMUL Rd, Rn, Rm         Signed multiply
UMUL Rd, Rn, Rm         Unsigned multiply
UDIV Rd, Rn, Rm         Unsigned divide (ARMv7+)
SDIV Rd, Rn, Rm         Signed divide (ARMv7+)
AND Rd, Rn, Rm          Bitwise AND
ORR Rd, Rn, Rm          Bitwise OR
EOR Rd, Rn, Rm          Bitwise XOR
BIC Rd, Rn, Rm          Bit clear (AND NOT)
CMP Rn, Rm              Compare (flags only)
CMN Rn, Rm              Compare Negate (flags only)

Memory Access (Load/Store):

LDR Rd, [Rn]            Load word (32-bit)
LDRH Rd, [Rn]           Load halfword (16-bit)
LDRB Rd, [Rn]           Load byte (8-bit)
STR Rd, [Rn]            Store word
STRH Rd, [Rn]           Store halfword
STRB Rd, [Rn]           Store byte
LDM Rn, {reglist}       Load multiple
STM Rn, {reglist}       Store multiple
PUSH {reglist}          Push to stack
POP {reglist}           Pop from stack

Control Flow:

B label                 Branch (unconditional)
BL label                Branch with link (call function)
BX Rn                   Branch exchange (goto address in Rn, switch modes)
RET Rn                  Return (AArch64 variant: RET = RET X30)
SVC #imm                Supervisor call (system call)
BEQ, BNE, BGT, etc      Conditional branches
CALL (not ARM; use BL)

Project Overview Table

The following table summarizes all 15 projects in this learning path:

# Project Name Language Difficulty Time Coolness Prerequisites
1 Instruction Decoder C Intermediate 1-2w ⭐⭐⭐ C basics
2 Assembly Calculator Assembly Beginner 1w ⭐⭐⭐ Proj 1 helpful
3 LED Blinker C+ASM Advanced 1-2w ⭐⭐⭐⭐ Proj 2
4 UART Driver C Advanced 1-2w ⭐⭐⭐⭐ Proj 3
5 Memory Allocator C Advanced 1-2w ⭐⭐⭐ Proj 2
6 Bootloader C+ASM Expert 2-3w ⭐⭐⭐⭐⭐ Proj 3
7 Context Switcher C+ASM Advanced 2w ⭐⭐⭐⭐ Proj 6
8 ARM Emulator C/Rust Advanced 3-4w ⭐⭐⭐⭐ Proj 1
9 Exception Handler C+ASM Advanced 1-2w ⭐⭐⭐ Proj 3
10 I2C/OLED Driver C Advanced 1-2w ⭐⭐⭐ Proj 4
11 SPI/SD Card Driver C Advanced 2w ⭐⭐⭐ Proj 4
12 PWM Motor Control C Advanced 1-2w ⭐⭐⭐ Proj 4
13 DMA Audio Player C Expert 2-3w ⭐⭐⭐⭐ Proj 4
14 GDB Stub Debugger C Expert 2-3w ⭐⭐⭐⭐ Proj 9
15 Tiny Operating System C+ASM Expert 4-6w ⭐⭐⭐⭐⭐ Proj 6,7

Project List

The following projects guide you from foundational understanding to building complete systems. Projects are roughly ordered by dependency, though you can choose your own path based on your goals (see “Recommended Learning Paths”).

Project 1: ARM Instruction Decoder & Disassembler

View Detailed Guide

  • File: P01-arm-instruction-decoder.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Parsing / Instruction Sets
  • Software or Tool: Custom Disassembler (like objdump)
  • Main Book: “The Art of ARM Assembly, Volume 1” by Randall Hyde

What you’ll build: A command-line tool that takes raw ARM binary code (or ELF files) and decodes each instruction into human-readable assembly, showing the opcode breakdown, registers used, and instruction effects.

Why it teaches ARM: Before you can write ARM assembly, you need to understand how instructions are encoded. Every ARM instruction fits into 32 bits with a specific structure. This project forces you to internalize the encoding scheme—condition codes, opcodes, register fields, immediate values, and shift operations.

Real World Outcome

$ ./arm-decode firmware.bin

0x00000000: E3A00001  MOV   R0, #1         ; R0 = 1
0x00000004: E3A01002  MOV   R1, #2         ; R1 = 2
0x00000008: E0802001  ADD   R2, R0, R1     ; R2 = R0 + R1
0x0000000C: E1A0F00E  MOV   PC, LR         ; Return (PC = LR)
0x00000010: 0A000003  BEQ   0x00000024     ; Branch if Z=1
0x00000014: E59F0010  LDR   R0, [PC, #16]  ; Load from PC+16+8
0x00000018: E1520000  CMP   R2, R0         ; Compare R2 with R0
0x0000001C: C2833005  ADDGT R3, R3, #5    ; If greater, R3 += 5

Instruction breakdown for 0xE0802001:
  Cond: 1110 (AL - Always)
  Type: 00 (Data Processing)
  OpCode: 0100 (ADD)
  S-bit: 0 (Don't update flags)
  Rn: 0000 (R0)
  Rd: 0010 (R2)
  Operand2: 000000000001 (R1, no shift)

The Core Question You’re Answering

“How is a 32-bit ARM instruction decoded into human-readable assembly?”

Before you code, understand that an ARM instruction is just 32 bits with a rigid structure. Bits 31-28 are condition, bits 27-25 determine type, and so on. Your job is to extract these bit fields and map them to mnemonics and operand formats. This is exactly what objdump does internally.

Concepts You Must Understand First

  1. Binary bit extraction and masking
    • How to extract bits 31-28: (instruction >> 28) & 0xF
    • Understand masks: 0xF = 0b1111 (4 bits)
    • Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 2 - Bryant & O’Hallaron
  2. ARM Instruction Encoding
    • Condition codes (bits 31-28)
    • Instruction type (bits 27-25)
    • Opcode and operand encoding
    • Book Reference: ARM Architecture Reference Manual - Section A5
  3. Immediate value rotation
    • 8-bit value rotated right by 4-bit amount
    • How to decode: value << (32 - amount*2)
    • Book Reference: “The Art of ARM Assembly, Vol 1” Ch. 4 - Randall Hyde

Questions to Guide Your Design

  1. Architecture:
    • How will you organize your decoder? (Single big switch? Helper functions? Table-driven?)
    • Will you support ELF files or just raw binaries?
    • How will you handle instruction variants (ARM vs Thumb)?
  2. Implementation:
    • How do you build a lookup table for mnemonics vs opcodes?
    • How do you format output to match objdump or other tools?
    • Can you validate instruction encoding (e.g., detect invalid opcodes)?
  3. Optimization:
    • Can you minimize code size?
    • Can you add statistics (opcode frequency, etc.)?

Thinking Exercise: Manual Decode

Given 0xE3A00001, decode it manually:

Bits 31-28 (Cond): 1110 = AL (Always)
Bits 27-25 (Type): 001 = Data Processing Immediate
Bits 24-21 (OpCode): 1101 = MOV (move to Rd, source is Operand2)
Bit 20 (S-flag): 0 = Don't update CPSR
Bits 19-16 (Rn): 0000 = R0
Bits 15-12 (Rd): 0000 = R0
Bits 11-0 (Operand2): 0x001 = Immediate #1 (no rotation)

Result: MOV R0, #1

Now try: 0xE59F0010 and 0xC2833005 (harder!)

The Interview Questions They’ll Ask

  1. “How does ARM encode an instruction in 32 bits? Walk me through an example.”
  2. “What’s the difference between immediate and register operands in ARM?”
  3. “How does the barrel shifter affect instruction encoding?”
  4. “Can you decode this binary instruction? What does it do?”
  5. “Why does the rotation apply to immediates? What problem does it solve?”

Hints in Layers

Hint 1: Start Simple Begin by just reading a 32-bit value and extracting condition code. Print “AL” if condition is 0xE. Build from there: add instruction type detection, then opcodes, then operands.

Hint 2: Pseudo-structure Think of the instruction as having clear sections:

Bits 31-28: Condition  
Bits 27-25: Type
Bits 24-21: Opcode (depends on type)
...

Write helper functions to extract each range.

Hint 3: Look-Up Tables Build tables mapping values to mnemonics:

const char *cond_names[] = {
    "EQ", "NE", "CS", "CC", "MI", "PL", "VS", "VC",
    "HI", "LS", "GE", "LT", "GT", "LE", "AL", "NV"
};

Hint 4: Test with Known Good Use arm-none-eabi-objdump on real binaries to compare your output. The reference output is your test oracle.

Books That Will Help

Topic Book Chapter
Binary bit manipulation “Computer Systems” - Bryant & O’Hallaron Ch. 2.1
ARM instruction encoding ARM Architecture Reference Manual A5.1-A5.3
Opcode decoding “The Art of ARM Assembly, Vol 1” - Randall Hyde Ch. 4-5
ELF format (if reading files) “Practical Binary Analysis” - Dennis Andriesse Ch. 2

Common Pitfalls & Debugging

Problem 1: “All my decoded instructions say AL (Always)”

  • Why: You’re always getting bits 31-28 as 1110 (0xE)
  • Fix: Make sure you’re reading the binary in big-endian or little-endian correctly. Check with hexdump -C.
  • Quick test:
    echo -ne '\x01\x00\xa0\xe3' | xxd  # Should show E3A00001
    

Problem 2: “Immediate values are wrong; I get huge numbers”

  • Why: You forgot to apply rotation. An immediate is 8 bits rotated by 4x the rotation field.
  • Fix: Apply: value = imm8 ror (rot*2)
  • Quick test: Check that #1 decodes from 0x001, not something huge.

Problem 3: “My output doesn’t match objdump”

  • Why: Operand format, capitalization, or spacing differences
  • Fix: Look at objdump output format exactly; replicate spacing and case
  • Quick test:
    arm-none-eabi-objdump -d program.elf | head -20
    

Definition of Done

  • Decodes ARM data processing instructions correctly (ADD, SUB, MOV, etc.)
  • Decodes load/store instructions (LDR, STR, LDRH, etc.)
  • Decodes branch instructions (B, BEQ, BL, etc.)
  • Correctly handles condition codes for all 16 conditions
  • Correctly decodes barrel shifter operands (LSL, LSR, ASR, ROR)
  • Handles immediate values with rotation correctly
  • Output matches arm-none-eabi-objdump -d for at least 50 test instructions
  • Processes a real firmware binary and produces reasonable output
  • Error handling for invalid/undefined instructions

Project 2: ARM Assembly Calculator

View Detailed Guide

  • File: P02-arm-assembly-calculator.md
  • Main Programming Language: ARM Assembly
  • Alternative Programming Languages: N/A (this must be pure assembly)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: ARM Assembly / Basic Arithmetic
  • Software or Tool: QEMU (ARM emulator) or Raspberry Pi
  • Main Book: “ARM Assembly By Example” (armasm.com)

What you’ll build: A four-function calculator entirely in ARM assembly that reads two numbers and an operator from stdin, performs the calculation (add, subtract, multiply, divide), and outputs the result.

Why it teaches ARM: This is your “Hello World” for ARM assembly. You’ll learn registers, arithmetic instructions, system calls, branching, and basic program structure. By handling I/O without libc, you understand how programs interact with the operating system at the lowest level.

Real World Outcome

$ ./armcalc
Enter first number: 42
Enter operator (+, -, *, /): *
Enter second number: 13
Result: 546

$ ./armcalc
Enter first number: 100
Enter operator (+, -, *, /): /
Enter second number: 7
Result: 14 remainder 2

$ ./armcalc
Enter first number: 255
Enter operator (+, -, *, /): +
Enter second number: 256
Result: 511

The Core Question You’re Answering

“How do I write a complete program in pure ARM assembly without any C or libraries?”

This forces you to understand: system calls (read, write, exit), register usage, stack management, and basic control flow. You’ll see how the CPU actually executes code, without any abstractions.

Concepts You Must Understand First

  1. ARM Calling Conventions (Linux EABI)
    • R0-R3: First 4 arguments, also scratch
    • R4-R11: Must be saved/restored by function
    • R12, R13 (SP), R14 (LR), R15 (PC): Special
    • Book Reference: Azeria Labs blog: “ARM Assembly Basics” series
  2. Linux System Calls on ARM
    • Move syscall number into R7
    • Arguments in R0-R3
    • SVC #0 triggers the kernel
    • Result in R0
    • Book Reference: ARM Assembly By Example - armasm.com
  3. ASCII to Integer Conversion
    • ‘0’ = 0x30, ‘9’ = 0x39
    • To convert char to digit: subtract 0x30 (‘0’)
    • Multi-digit: accumulator = accumulator * 10 + digit
    • Book Reference: “Introduction to Computer Organization: ARM Edition” Ch. 7 - Robert Plantz

Questions to Guide Your Design

  1. Architecture:
    • How will you organize your code? (One big main or multiple functions?)
    • Will you use the stack for local variables or keep everything in registers?
    • How will you handle input/output?
  2. Implementation:
    • How do you convert ASCII string to integer?
    • How do you convert integer back to ASCII string?
    • How do you implement division (does ARMv7 have UDIV? What if not?)
  3. Error Handling:
    • What if the user types non-numeric input?
    • What if they divide by zero?

Thinking Exercise: System Call Trace

Trace through reading a single character from stdin:

MOV R7, #3          ; Syscall 3 = read
MOV R0, #0          ; fd = 0 (stdin)
MOV R1, R13         ; buffer address (sp)
MOV R2, #1          ; read 1 byte
SVC #0              ; Call kernel
                    ; R0 now contains number of bytes read (usually 1)

What’s in memory at SP after this? What’s in R0?

The Interview Questions They’ll Ask

  1. “Write the assembly to read a number from stdin and store it in R0.”
  2. “How do you implement division if the CPU doesn’t have a UDIV instruction?”
  3. “What’s the purpose of the LR (Link Register)? How do you use it?”
  4. “Walk me through a system call. What happens when you execute SVC #0?”
  5. “How do you know if a system call failed?”

Hints in Layers

Hint 1: Just Print” Start by printing “Hello, ARM” to verify system calls work. Use syscall 4 (write). Once this works, you know your toolchain is set up correctly.

Hint 2: Read One Number Implement reading one number from stdin. Get the user input, convert ASCII to integer, store in R0. Test with simple numbers like “42”.

Hint 3: Add Two Numbers Implement the add operation: read two numbers, add them, print result. Once this works, you understand the full loop.

Hint 4: Add the Other Operations Branch based on operator. For multiply, use MUL. For divide, use UDIV if available, or implement software division.

Books That Will Help

Topic Book Chapter
ARM assembly basics Azeria Labs Blog Series “Part 1-4”
System calls on ARM ARM Assembly By Example
ASCII conversion “Introduction to Computer Organization: ARM Edition” Ch. 7
Multiply/Divide “The Art of ARM Assembly, Vol 1” Ch. 7

Common Pitfalls & Debugging

Problem 1: “SVC #0 crashes or does nothing”

  • Why: You’re not in the right mode, or syscall number is wrong, or you’re running bare-metal without OS
  • Fix: Test on Linux (not bare-metal) first. Verify syscall number: read=3, write=4, exit=1
  • Quick test:
    qemu-system-arm -machine virt -kernel calc.elf -nographic
    

Problem 2: “Characters print as garbage”

  • Why: ASCII conversion is wrong, or endianness issue
  • Fix: Check: ‘0’ = 0x30, ‘A’ = 0x41. Use printf in C to verify conversion
  • Quick test:
    echo "48" | ./armcalc  # Should print 48, not garbage
    

Problem 3: “Division gives wrong answer”

  • Why: UDIV doesn’t exist on older ARMv7, or remainder handling is wrong
  • Fix: Check ARMv7 version. If no UDIV, implement loop-based division or use libgcc
  • Quick test:
    ./armcalc  # 100 / 7 should give 14 remainder 2
    

Definition of Done

  • Program compiles with arm-none-eabi-as and arm-none-eabi-ld
  • Program runs on QEMU ARM or Raspberry Pi
  • Can read two numbers from stdin
  • Can add two numbers correctly
  • Can subtract, multiply, and divide
  • Outputs results in correct ASCII format
  • Handles multi-digit numbers (not just single digits)
  • Returns exit code 0 on success
  • Total code size < 2KB

Project 3: Bare-Metal LED Blinker (STM32 Discovery Board)

View Detailed Guide

  • File: P03-led-blinker.md
  • Main Programming Language: ARM Assembly + C
  • Alternative Programming Languages: C (with inline assembly), Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Embedded Systems / Hardware Interaction
  • Software or Tool: STM32F407 Discovery Board, OpenOCD, ARM GCC
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A bare-metal LED blinker that runs directly on ARM hardware without an OS, controlling GPIO pins through memory-mapped registers.

Why it teaches ARM: This forces you to understand how ARM processors interact with actual hardware through memory-mapped I/O. You’ll see registers, clock configuration, GPIO setup, and the complete boot sequence in practice.

Core challenges you’ll face:

  • Memory-mapped I/O and register access → Maps to understanding how peripherals connect to ARM
  • Clock tree configuration and RCC registers → Maps to microcontroller power and timing systems
  • GPIO mode configuration (input/output/alternate) → Maps to peripheral initialization
  • Timing and delay loops → Maps to understanding ARM execution model and instruction timing
  • Debugging without a working terminal → Maps to using hardware debuggers (JTAG/SWD)

Key Concepts:

  • Memory-Mapped I/O: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Ch. 6
  • ARM Cortex-M GPIO: STM32F4 Reference Manual - Section 8-10
  • Clock Configuration (RCC): STM32F4 Reference Manual - Section 5
  • JTAG/SWD Debugging: “The Art of ARM Assembly” by Randall Hyde - Ch. 15

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites:

  • Completed Project 1 (understand ARM instruction format)
  • Completed Project 2 (write ARM assembly)
  • Basic knowledge of electronics (GPIO, voltage, current)
  • Access to STM32F407 Discovery Board or similar ARM Cortex-M4 board

Real World Outcome

When you run this program on real hardware, you will see an LED on the discovery board blink at a regular interval. The beauty: no operating system, no runtime, no libraries—just your code running directly on the processor.

Console Output:

$ arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -O0 main.c startup.s -o firmware.elf
$ openocd -f stm32f4discovery.cfg
Open On-Chip Debugger 0.11.0 [2024-01-15] (http://openocd.org)
Licensed under GNU GPL v2
For bug reports, read http://openocd.org/doc/doxygen/html/bugs.html
Info : The selected transport took over low-level target control.
       This is deprecated. Please report this as a bug.
adapter speed: 2000 kHz
adapter_nsrst_delay: 100
jtag_ntrst_delay: 100
target halted due to debug-request

$ arm-none-eabi-gdb firmware.elf
(gdb) load
Loading section .text, size 0x4d4 lma 0x8000000
Loading section .data, size 0x4 lma 0x80004d4
Start address 0x8000000, load size 1240
Transfer rate: 9 KB/sec, 1240 bytes/write

(gdb) continue

# Now on the physical board: LED PD12 (orange LED) blinks ON/OFF every ~1 second
# No terminal output - the LED IS your output!

What you’ll observe:

  • Orange LED on the discovery board blinks continuously
  • If you adjust the delay value in code and rebuild, the blink rate changes
  • If you use the debugger to set breakpoints at GPIO register writes, you can watch the exact moment the LED turns on/off

The Core Question You’re Answering

“How does an ARM processor talk to physical hardware without an operating system?”

Before you write any code, sit with this question. Software developers often treat hardware as abstract. But on bare-metal systems, YOU are responsible for configuring every detail—clock speeds, register addresses, timing. This project makes that concrete.

Concepts You Must Understand First

Stop and research these before coding:

  1. Memory-Mapped I/O Registers
    • How does a processor write to a GPIO pin if there’s no function call? (Answer: specific memory addresses)
    • What’s the difference between RAM and a peripheral register?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 6 (Memory Hierarchy)
  2. Microcontroller Reset and Boot
    • What code runs before main() on bare-metal systems?
    • How does the processor know where to start executing code?
    • Book Reference: STM32F4 Reference Manual - Section 1 (Introduction)
  3. Clock Trees and RCC (Reset and Clock Control)
    • Why do you need to enable clocks for GPIO?
    • What happens if a peripheral’s clock isn’t enabled?
    • Book Reference: STM32F4 Reference Manual - Section 5 (RCC)

Questions to Guide Your Design

Before implementing, think through these:

  1. Boot Sequence
    • Where does execution start after power-on reset?
    • What must happen before main() executes?
    • Do you need a linker script? Why?
  2. Register Configuration
    • Which register controls whether GPIO pin is output vs input?
    • How do you set a GPIO pin HIGH vs LOW?
    • What does “alternate function” mean and when do you use it?
  3. Timing and Delays
    • How do you create a delay without a delay function?
    • How accurate must your delays be?
    • How would you measure if your delays are correct?

Thinking Exercise

Tracing GPIO Register Writes

Before coding, trace what happens when you want to turn on an LED:

Physical Goal: Turn on LED connected to GPIO pin PD12

Step 1: Enable clock to GPIOD
   - GPIOD lives in the AHB1 peripheral
   - Find the RCC_AHB1ENR register address
   - Set bit for GPIOD

Step 2: Configure PD12 as output
   - Find GPIOD_MODER register
   - PD12 is pin 12, so bits [25:24] = 0b01 (output)

Step 3: Set the pin HIGH
   - Find GPIOD_BSRR (Bit Set/Reset Register)
   - Write 1 to bit 12 to SET the pin
   - Write 1 to bit (12+16) to RESET the pin

Step 4: Create a delay loop
   - Count ARM instructions to approximate delay
   - Or use the SysTick timer

Questions while tracing:

  • Which register holds the memory address for each control register?
  • What happens if you forget to enable the clock in Step 1?
  • Why is there a “Set” register and a “Reset” register instead of one register with 0/1?

The Interview Questions They’ll Ask

  1. “You’re writing code that will run on millions of devices in the field. How do you debug a bare-metal program if it crashes at runtime?”
  2. “Explain the memory map of an ARM Cortex-M microcontroller. Where does code live? Where is RAM?”
  3. “Why can’t you use printf() in bare-metal code without additional setup?”
  4. “Draw the sequence of register writes needed to configure and use a GPIO pin.”
  5. “You want to set a GPIO pin HIGH, then wait 1 second, then set it LOW. Write pseudocode for the bare-metal steps.”
  6. “What’s the difference between a linker script, a startup file, and main() code?”

Hints in Layers

Hint 1: Understand the Hardware Block Diagram Your first task: Find the STM32F407 block diagram in the datasheet. Understand that GPIO peripherals connect via AHB1 bus, and the RCC (Reset & Clock Control) is the gatekeeper. You can’t touch GPIO until RCC gives permission.

Hint 2: Locate Register Addresses Each peripheral has a base address. GPIO is at specific memory addresses. The reference manual lists these. For example, GPIOD starts at 0x40020C00. When you write to address 0x40020C00 + offset, you’re writing to GPIOD’s MODER register. Find the offset for MODER in the reference manual.

Hint 3: Implement Clock Enable and GPIO Config Pseudocode approach:

1. Read RCC_AHB1ENR from memory
2. Set bit 3 (GPIOD enable)
3. Write back to RCC_AHB1ENR
4. Read GPIOD_MODER
5. Clear bits [25:24] (PD12)
6. Set bits [25:24] = 0b01 (output mode)
7. Write back to GPIOD_MODER
8. Create a loop that runs ~2M times to delay ~1 second
9. Toggle GPIOD_BSRR to set/reset pin 12

Hint 4: Debug Using OpenOCD You can’t see console output, so use a debugger:

  • Set breakpoint before GPIO write
  • Step through register writes
  • Check memory at each step to verify changes
  • Watch the LED turn on when bit is set

Books That Will Help

Topic Book Chapter
Memory-Mapped I/O “Computer Systems: A Programmer’s Perspective” Ch. 6
Microcontroller Basics “Making Embedded Systems, 2nd Edition” Ch. 1-3
STM32 Specifics STM32F4 Reference Manual Sections 1, 5, 8
Debugging “The Art of ARM Assembly, Volume 1” Ch. 15

Common Pitfalls and Debugging

Problem 1: “LED doesn’t turn on at all”

  • Why: Probably forgot to enable clock. RCC_AHB1ENR bit for GPIOD is still 0, so the peripheral ignores all writes.
  • Fix: Check that you wrote to RCC_AHB1ENR correctly. Set bit 3 (GPIOD). Verify in debugger by reading RCC_AHB1ENR at 0x40023830.
  • Quick test: (gdb) x/1xw 0x40023830 — should see bit 3 set to 1

Problem 2: “LED is always on, won’t blink”

  • Why: Your delay loop isn’t working, or you’re not toggling the pin correctly.
  • Fix: Make the delay longer and obvious. Or use SysTick timer instead of empty loop.
  • Quick test: Rebuild with a 5-second delay. If still doesn’t blink, LED isn’t toggling at all.

Problem 3: “Program runs once then hangs”

  • Why: Infinite delay loop is too long, or loop condition is wrong.
  • Fix: Reduce delay count or add safety exit condition.
  • Quick test: Halve the delay value and rebuild. Does it blink faster?

Problem 4: “Different LED blinks instead of the one I wanted”

  • Why: Wrong GPIO bank or pin number in configuration.
  • Fix: Double-check which LED is which on the discovery board. PD12, PD13, PD14, PD15 are the four LEDs. Make sure you’re writing to the right bit in GPIOD_BSRR.
  • Quick test: Look at the board schematic. Verify PD12 connects to the orange LED.

Definition of Done

  • Code compiles without warnings
  • Linker script correctly maps code to 0x8000000
  • Program flashes to board without errors
  • LED blinks visibly (on/off at ~1 Hz)
  • Debugger can set breakpoints and step through GPIO writes
  • You can modify delay value and see blink rate change
  • You can identify each GPIO register address in the code and explain its purpose

Learning Milestones

  1. You understand the boot sequence and linker script → You grasp how code gets from flash to execution
  2. You can read and modify a register correctly → You’ve done hardware I/O without libraries
  3. You see the LED blink in response to your code → You’ve written code that controls physical hardware

Project 4: UART Serial Driver (Hello from Hardware)

View Detailed Guide

  • File: P04-uart-driver.md
  • Main Programming Language: ARM Assembly + C
  • Alternative Programming Languages: C with minimal inline assembly, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Serial Communication / Hardware Drivers
  • Software or Tool: STM32F407, UART, OpenOCD, Serial terminal
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A bare-metal UART driver that allows the microcontroller to send text messages to a computer over a serial connection, and optionally receive commands.

Why it teaches ARM: UART is your first real communication protocol. You’ll learn about baud rates, FIFOs, interrupts/polling, and how software talks to hardware in a structured way. This is foundation for any embedded system that needs debugging or user interaction.

Core challenges you’ll face:

  • Understanding UART peripheral and its registers → Maps to hardware protocol implementation
  • Baud rate calculation and clock dividers → Maps to precision timing on ARM
  • Interrupt handling vs polling → Maps to real-time systems trade-offs
  • Circular buffers for data handling → Maps to embedded systems data structures
  • Debugging communication failures → Maps to hardware protocol debugging

Key Concepts:

  • Serial Communication (UART): “The Joys of Hashing” by Thomas Mailund - Concepts of buffering
  • Interrupt Handling: STM32F4 Reference Manual - Section 6 (NVIC)
  • FIFO and Circular Buffers: “Mastering Algorithms with C” by Kyle Loudon - Ch. 4
  • Baud Rate Generation: STM32F4 Reference Manual - USART section

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites:

  • Completed Project 3 (understand bare-metal GPIO and register access)
  • Understanding of serial communication basics (baud rate, stop bits, parity)
  • A USB-to-UART adapter to connect board to computer

Real World Outcome

When complete, you’ll be able to run a serial terminal and see messages from your microcontroller:

$ screen /dev/ttyUSB0 115200
# or
$ minicom -D /dev/ttyUSB0 -b 115200
# or
$ picocom /dev/ttyUSB0 -b 115200

Connected to STM32F407!
System initialized successfully.
Uptime: 1 seconds
Uptime: 2 seconds
Uptime: 3 seconds
Blinker test: LED ON
Blinker test: LED OFF
Temperature sensor: 47.3°C

Your code is printing real-time data from the hardware. This is the foundation for debugging, logging, and user interaction on embedded systems.

The Core Question You’re Answering

“How does hardware with no operating system send text data to a computer?”

Concepts You Must Understand First

  1. UART Protocol Fundamentals
    • What does each bit in a UART frame mean?
    • Why is baud rate important?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Networking sections
  2. Interrupt-Driven vs Polling I/O
    • When should you use interrupts vs polling?
    • What happens if UART receives data but you don’t read it in time?
    • Book Reference: “Making Embedded Systems” - Ch. 6 (Interrupt Management)
  3. Circular Buffers
    • How do you handle variable-length data from a hardware peripheral?
    • Book Reference: “Mastering Algorithms with C” - Ch. 4 (Queues)

Questions to Guide Your Design

  1. Implementation Strategy
    • Will you use polling (check if data ready) or interrupts (let hardware signal)?
    • How large should your buffer be?
    • Will you support both TX and RX, or just one?
  2. Baud Rate Configuration
    • Given a system clock and desired baud rate, how do you set the divider?
    • The STM32 USART divider is 16-bit. How do you achieve 115200 baud with a 168 MHz clock?
  3. Error Handling
    • What UART errors can occur? (framing, parity, overrun)
    • How will your driver report them?

Thinking Exercise

Tracing UART TX

Goal: Send the character 'A' (0x41) to the serial port

Step 1: Wait for USART transmit ready
   - Check USART_SR register, bit TXE (transmit data register empty)

Step 2: Write data to USART_DR
   - Write 0x41 to USART_DR

Step 3: Wait for transmission complete (optional but safer)
   - Check USART_SR bit TC (transmission complete)

Step 4: Repeat for next character

Questions while tracing:

  • What happens if you write to USART_DR before TXE is set?
  • How many ARM instructions does the UART hardware execute per bit transmitted?
  • If baud rate is 115200 baud and 8 data bits, what’s the time per byte?

The Interview Questions They’ll Ask

  1. “Explain UART frame format. What’s inside a UART byte?”
  2. “If your baud rate is set to 115200 but the receiver is at 9600, what happens?”
  3. “Write pseudocode for a circular buffer that handles UART RX interrupts.”
  4. “How would you implement printf() to use your UART driver?”
  5. “What UART errors can occur, and how would you detect them?”
  6. “You’re receiving data from UART at 115200 baud. If your interrupt handler takes 200 µs to execute, will you lose data?”

Hints in Layers

Hint 1: Find the USART Peripheral The STM32F407 has multiple USART/UART peripherals. The discovery board has USART2 on PA2/PA3. Find the USART2 base address (0x40004400) and the key registers: USART_SR, USART_DR, USART_BRR, USART_CR1.

Hint 2: Configure USART2 Before you send anything, you need:

  • Enable clock to USART2 (via RCC)
  • Configure PA2 as TX (alternate function AF7)
  • Configure PA3 as RX (alternate function AF7)
  • Set baud rate in USART_BRR
  • Set control bits in USART_CR1 (TE for TX, RE for RX, UE for enable)

Hint 3: Implement TX Function Pseudocode:

transmit_char(char c) {
    while ((USART_SR & TXE_BIT) == 0) {
        // Wait for transmit buffer empty
    }
    USART_DR = c;
    while ((USART_SR & TC_BIT) == 0) {
        // Wait for transmission complete (optional)
    }
}

Hint 4: Implement RX with Circular Buffer (Advanced) Use a fixed-size buffer and head/tail pointers. When RX interrupt fires, add received byte to buffer. Reading code pulls from buffer.

Books That Will Help

Topic Book Chapter
UART Protocol “Computer Systems: A Programmer’s Perspective” Networking Ch.
Embedded Interrupts “Making Embedded Systems, 2nd Edition” Ch. 6
Buffer Design “Mastering Algorithms with C” Ch. 4
STM32 USART Details STM32F4 Reference Manual USART section

Common Pitfalls and Debugging

Problem 1: “Nothing appears in serial terminal”

  • Why: USART not configured, wrong baud rate, or TX pin not in alternate function mode.
  • Fix: Verify USART clock is enabled, PA2 is configured as AF7, and USART_CR1 UE bit is set.
  • Quick test: Use debugger to read USART2 base+0x00 (USART_SR). Does it show TXE bit set?

Problem 2: “Garbage characters appear”

  • Why: Baud rate mismatch. Terminal expecting 115200 but microcontroller sending at different rate.
  • Fix: Verify BRR calculation. For 168 MHz clock and 115200 baud: BRR should be ~91 (0x5B).
  • Quick test: Try 9600 baud. If text appears but still wrong, it’s a baud rate issue.

Problem 3: “Data appears but only first few bytes”

  • Why: RX buffer overflow or interrupt not configured.
  • Fix: If RX-ing, make sure UART interrupt is enabled (USART_CR1 RXNEIE).
  • Quick test: Slow down the sender (add delays). Does more data come through?

Definition of Done

  • Code compiles, no warnings
  • UART initializes correctly (no hung register writes)
  • At least one character transmits successfully to serial terminal
  • Full string transmit works (test with “Hello World\r\n”)
  • Baud rate matches terminal expectations
  • No data corruption in strings
  • Can explain each USART register configuration and why

Learning Milestones

  1. You configure a hardware peripheral from scratch → You understand registers, alternate functions, clock enables
  2. You send your first message from hardware to the computer → You’ve crossed the hardware/software bridge
  3. You debug communication failures → You understand protocol correctness and timing

Project 5: Dynamic Memory Allocator (malloc/free for Embedded)

View Detailed Guide

  • File: P05-memory-allocator.md
  • Main Programming Language: C with ARM Assembly
  • Alternative Programming Languages: C, Rust (with unsafe for raw pointers)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Memory Management / Data Structures
  • Software or Tool: ARM GCC, GDB, Valgrind (for testing)
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A custom malloc() and free() implementation that manages a heap on bare-metal ARM, handling memory fragmentation, allocation requests, and freeing with proper coalescing.

Why it teaches ARM: Allocators are the bridge between bare-metal memory and high-level abstractions. You’ll understand pointer arithmetic at the ARM level, how the heap grows, why fragmentation matters, and how real systems solve these problems. You’ll also learn to think about memory safety—something ARM systems firmware must care deeply about.

Core challenges you’ll face:

  • Pointer manipulation and arithmetic → Maps to ARM load/store operations and addressing modes
  • Metadata management (block sizes and free lists) → Maps to data structure layout in memory
  • Memory fragmentation and coalescing → Maps to algorithmic complexity and trade-offs
  • Thread-safety (if applicable) → Maps to atomic operations and synchronization
  • Testing for memory leaks and corruption → Maps to debugging and validation

Key Concepts:

  • Dynamic Memory Allocation: “Computer Systems: A Programmer’s Perspective” - Ch. 9
  • Free List Algorithms: “Mastering Algorithms with C” - Ch. 1-2
  • Memory Safety: “Rust in Action” - Ch. 4 (ownership/borrowing concepts apply here)
  • ARM Pointer Arithmetic: “The Art of ARM Assembly, Volume 1” - Ch. 8

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites:

  • Completed Project 2 (comfortable with ARM assembly)
  • Strong understanding of C pointers
  • Knowledge of linked lists and basic data structures
  • Familiarity with GDB for debugging

Real World Outcome

After this project, you’ll have a working malloc/free library that can be linked into bare-metal programs. You can test it like this:

// Your custom malloc/free
void* my_malloc(size_t size);
void my_free(void* ptr);

int main() {
    // Allocate 100 bytes
    int* array = (int*)my_malloc(100);
    array[0] = 42;
    array[1] = 100;

    // Allocate another block
    char* buffer = (char*)my_malloc(256);
    strcpy(buffer, "Hello from malloc!");

    // Free in different order than allocation
    my_free(buffer);
    my_free(array);

    // Allocate again - should reuse freed space
    int* array2 = (int*)my_malloc(150);
    array2[0] = 999;

    my_free(array2);

    return 0;
}

Test output (with your own test framework):

Test 1: Single allocation
  malloc(100) returned 0x20000100
  Freed successfully
  ✓ PASS

Test 2: Multiple allocations
  malloc(100) = 0x20000100
  malloc(256) = 0x20000170
  malloc(200) = 0x20000270
  ✓ PASS

Test 3: Fragmentation and coalescing
  Allocated 3 blocks
  Freed middle block (0x20000170)
  Allocated 200 bytes - should reuse freed block
  Address returned: 0x20000170
  ✓ PASS

Test 4: Out of memory
  malloc(1000000) returned NULL
  ✓ PASS

The Core Question You’re Answering

“How does a program allocate and free memory without an operating system?”

Concepts You Must Understand First

  1. Heap Layout and Free Lists
    • Where does the heap start and end in memory?
    • How do you track which blocks are allocated vs free?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 9 (Implicit Free Lists)
  2. Pointer Arithmetic and Memory Safety
    • How do you navigate metadata stored alongside data?
    • What metadata must you store to be able to free a block?
    • Book Reference: “The Art of ARM Assembly” - Ch. 8 (Addressing Modes)
  3. Fragmentation and Strategies
    • What’s internal vs external fragmentation?
    • How do splitting and coalescing reduce fragmentation?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 9

Questions to Guide Your Design

  1. Metadata Approach
    • How will you know the size of a block when free() is called?
    • Where will you store this size? (Often before the data block)
    • How much overhead per allocation is acceptable?
  2. Free List Strategy
    • Implicit (scan everything) vs Explicit (linked list)?
    • Best-fit vs first-fit allocation?
    • When and how do you coalesce freed blocks?
  3. Testing Strategy
    • How will you test without standard malloc to compare?
    • How will you detect leaks, fragmentation, or corrupted metadata?

Thinking Exercise

Designing Block Layout

Before coding, sketch the memory layout:

Heap:  | Block Header | Data | Block Header | Data | Block Header | Free |
       |  size=32     | .... |  size=100    | .... |  size=50     | ... |

Header contains:
  - Size of this block
  - Is it allocated or free?
  - Pointer to next block (if using explicit free list)

When you call free(ptr), you have the data pointer.
How do you find the header? Usually stored at ptr[-1] or ptr[-8].

Questions while designing:

  • If you store the size in a header before each block, how many bytes overhead?
  • If you split a large free block for a smaller allocation, what happens to the leftover space?
  • When you coalesce two free blocks, what happens to their headers?

The Interview Questions They’ll Ask

  1. “Design a simple malloc that works with fixed-size blocks. What are the trade-offs?”
  2. “Explain the difference between implicit and explicit free lists.”
  3. “You have a 1KB heap and make these calls: malloc(100), malloc(200), free(first), malloc(150). Walk through the allocations.”
  4. “What’s the difference between internal and external fragmentation? How would you measure each?”
  5. “Write pseudocode for malloc(size) that uses best-fit strategy.”
  6. “How would you detect a memory leak in a bare-metal system without standard tools?”

Hints in Layers

Hint 1: Start with a Bump Allocator The simplest working allocator just tracks a “bump pointer”—every allocation increments it, and you can never free. This tests your infrastructure without complexity.

Hint 2: Add a Free List Once bump works, convert to explicit free list:

  • Store metadata (size + allocated flag) before each block
  • Maintain linked list of free blocks
  • Search free list on malloc
  • Add to free list on free

Hint 3: Add Coalescing When a block is freed, check if adjacent blocks are also free. If so, merge them into one larger block. This reduces fragmentation.

Hint 4: Test Thoroughly Write a test suite that exercises allocation patterns:

  • Many small allocations
  • Few large allocations
  • Fragmentation patterns (alternating small/large, then freeing small)
  • Out of memory

Books That Will Help

Topic Book Chapter
Memory Allocation Algorithms “Computer Systems: A Programmer’s Perspective” Ch. 9
Free Lists and Data Structures “Mastering Algorithms with C” Ch. 1-2
ARM Pointer Arithmetic “The Art of ARM Assembly, Volume 1” Ch. 8
Testing and Debugging “The Pragmatic Programmer” Ch. 8

Common Pitfalls and Debugging

Problem 1: “Allocations return overlapping addresses”

  • Why: Didn’t update bump pointer correctly, or metadata corruption.
  • Fix: Check that each allocation advances the pointer by size + metadata.
  • Quick test: Allocate three blocks, print their addresses and sizes. Should not overlap.

Problem 2: “Free corrupts subsequent allocations”

  • Why: Metadata stored wrong, or coalescing removed important metadata.
  • Fix: Be very careful about where you store size. It must be retrievable when free() is called.
  • Quick test: After free(), allocate again. Does the new block work? If not, old metadata is corrupted.

Problem 3: “Fragmentation causes malloc to fail”

  • Why: Free list strategy allows fragmentation. Maybe no single contiguous block large enough.
  • Fix: Implement coalescing, or switch to different strategy (segregated lists, buddy allocator).
  • Quick test: Allocate, free in specific order to create fragmentation. Then allocate medium block. Success?

Definition of Done

  • Bump allocator works (simple malloc with no free)
  • Basic free list implementation compiles
  • malloc returns unique non-NULL addresses for small allocations
  • free recycles freed blocks
  • No corruption of metadata after allocation/free cycles
  • Can explain metadata layout and how you find it during free()
  • Test suite demonstrates expected behavior

Learning Milestones

  1. You implement a working allocator → You understand heap layout and lifetime management
  2. You track and reuse freed memory → You grasp the free list concept
  3. You handle fragmentation → You understand trade-offs in allocation strategies

Project 6: ARM Bootloader (Bare-Metal Boot Sequence)

View Detailed Guide

  • File: P06-bootloader.md
  • Main Programming Language: ARM Assembly + minimal C
  • Alternative Programming Languages: ARM Assembly only, C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Boot Sequences / Firmware
  • Software or Tool: ARM GCC, Linker scripts, OpenOCD
  • Main Book: “The Art of ARM Assembly, Volume 1” by Randall Hyde

What you’ll build: A minimal bootloader that initializes ARM hardware (stack, heap, runtime), sets up the C environment, and jumps to main(). No OS, no runtime—you’re writing the code that makes the CPU ready to run C code.

Why it teaches ARM: Bootloaders are the most fundamental code. They run before everything else. Understanding how a bootloader works teaches you CPU initialization, memory layout, control flow at the lowest level, and why C’s runtime assumptions exist. You’ll see raw hardware waking up and being prepared.

Core challenges you’ll face:

  • CPU-specific initialization sequences → Maps to understanding processor state on reset
  • Memory layout (code, data, BSS, stack, heap) → Maps to linker script design
  • Exception vectors and interrupt setup → Maps to real-time responsiveness
  • Handoff to C runtime (frame pointer, stack alignment) → Maps to calling conventions
  • Flash vs RAM code execution → Maps to understanding execution context

Key Concepts:

  • ARM Reset Behavior: ARM Architecture Reference Manual - ARMv7 section
  • Linker Scripts: “The GNU Make Book” by John Graham-Cumming + linker documentation
  • Exception Vectors: ARM Cortex-M Technical Reference Manual
  • Memory Layout: “Low-Level Programming” by Igor Zhirkov - Ch. 3

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites:

  • Completed Project 1 (understand ARM instruction format)
  • Completed Project 3 (bare-metal GPIO, register access)
  • Strong ARM assembly knowledge
  • Comfortable reading datasheets

Real World Outcome

After this project, you’ll have a bootloader that:

  1. Initializes the stack
  2. Copies initialized data from flash to RAM
  3. Zeros out BSS section (uninitialized data)
  4. Sets up exception vectors
  5. Jumps to main() with the C runtime ready

Test output:

$ arm-none-eabi-objdump -d bootloader.elf | head -50

00000000 <_start>:
   0:   20003000    movw    sp, #0x3000  ; Set stack to end of SRAM
   4:   e3a00000    mov     r0, #0
   8:   e3a01000    mov     r1, #0x1000
   c:   e3a02000    mov     r2, #0x2000
  10:   ea000001    b       20 <copy_loop>
  14:   e5903004    ldr     r3, [r0, #4]!
  18:   e5813004    str     r3, [r1, #4]!
  1c:   e1510002    cmp     r1, r2
  20:   dafb        blt     14 <copy_loop>
  24:   e3a00000    mov     r0, #0
  28:   e3a01800    mov     r1, #0x800
  2c:   ea000001    b       38 <bss_zero>
  30:   e5803004    str     r0, [r3, #4]!
  34:   e1530001    cmp     r3, r1
  38:   dafb        blt     30 <bss_zero>
  3c:   eb000000    bl      #main

When you flash this to a board, it initializes everything and runs your main() with a proper C environment ready.

The Core Question You’re Answering

“What code runs BEFORE main(), and why is it necessary?”

Concepts You Must Understand First

  1. ARM Reset Behavior
    • What’s the state of the CPU when power turns on?
    • Where does the first instruction come from?
    • What’s at memory address 0x00000000?
    • Book Reference: ARM Cortex-M Technical Reference Manual - Section 2
  2. Memory Sections in Compiled Code
    • What’s the difference between .text, .data, and .bss?
    • Why is .data in flash but needs to be in RAM?
    • Book Reference: “Low-Level Programming” - Ch. 3
  3. Linker Scripts
    • How does the linker script define memory layout?
    • Why must stack and heap not collide?
    • Book Reference: GNU LD documentation

Questions to Guide Your Design

  1. Memory Initialization
    • How do you copy code from flash to RAM? (If needed for your platform)
    • Which register holds the stack pointer after reset?
    • How do you zero the BSS section in assembly?
  2. Exception Vectors
    • Where does the processor look for interrupt handlers?
    • What’s in the vector table at address 0x00000000?
    • How do you ensure vectors are at the right address?
  3. Handoff to C
    • What must be true about the stack before calling a C function?
    • Does the frame pointer matter for your bootloader?
    • How do you guarantee alignment (Cortex-M requires 8-byte stack alignment)?

Thinking Exercise

Tracing Boot Sequence

Before coding, trace what happens at power-on:

1. Power-on Reset
   - CPU is at address 0x00000000 (vector table)
   - SP (stack pointer) is loaded from memory [0x00000000]
   - PC (program counter) is loaded from memory [0x00000004]

2. First Instruction (at PC = 0x00000004)
   - This is your _start label in bootloader
   - CPU registers are in unknown state (assume 0 by convention)

3. Bootloader Tasks
   - Load SP from vector table or hardcoded address
   - Copy .data section from flash to RAM
   - Zero out .bss section in RAM
   - Call main()

4. If main() returns
   - Bootloader should loop forever (halt instruction)

Questions while tracing:

  • Where does the first PC value come from? (Memory location 0x00000004 in the binary)
  • If you don’t set up the stack before calling main(), what happens?
  • Why is zeroing BSS important if we’re not using it?

The Interview Questions They’ll Ask

  1. “Draw the memory layout of an ARM program showing .text, .data, .bss, stack, and heap.”
  2. “Walk through the first 10 instructions executed when an ARM microcontroller powers on.”
  3. “Why must .data be in flash initially, then copied to RAM? Why not just stay in flash?”
  4. “How does a linker script ensure the vector table is at address 0x00000000?”
  5. “Write pseudocode for a bootloader that copies .data from flash to RAM.”
  6. “What happens if your bootloader sets the stack pointer to the middle of your .data section?”

Hints in Layers

Hint 1: Understand Your Platform’s Memory Map Get the STM32F407 datasheet. Find:

  • Flash starts at 0x08000000 (1 MB)
  • SRAM starts at 0x20000000 (192 KB)
  • Reset vector table location
  • After reset, SP comes from memory at 0x00000000
  • After reset, PC comes from memory at 0x00000004

Hint 2: Create a Minimal Linker Script Linker script maps your sections:

MEMORY {
  FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 1M
  RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 192K
}

SECTIONS {
  .text : {
    *(.vectors)
    *(.text*)
  } > FLASH

  .data : {
    *(.data*)
  } > RAM AT > FLASH

  .bss : {
    *(.bss*)
  } > RAM
}

Hint 3: Write _start in Assembly Pseudocode:

_start:
  LDR sp, =__stack_top       ; Load stack pointer
  BL  copy_data               ; Copy .data from flash to RAM
  BL  zero_bss                ; Zero .bss
  BL  main                    ; Jump to main
  B   .                       ; Loop forever

Hint 4: Build and Test Use arm-none-eabi-objdump to verify section layout. Use GDB to step through the bootloader and watch registers change.

Books That Will Help

Topic Book Chapter
ARM Reset Behavior ARM Cortex-M Technical Reference Section 2
Memory Layout “Low-Level Programming” by Igor Zhirkov Ch. 3
Linker Scripts GNU LD Documentation Scripts section
ARM Assembly “The Art of ARM Assembly, Volume 1” All

Common Pitfalls and Debugging

Problem 1: “Program doesn’t run at all”

  • Why: Vector table not at 0x00000000 or first instruction is invalid.
  • Fix: Check linker script. .vectors section must be first in .text section in flash.
  • Quick test: Use objdump to check first bytes of binary. Should match your vector table.

Problem 2: “Program jumps to random address”

  • Why: .data section not properly copied. Vector table contents are corrupted.
  • Fix: Verify copy_data loop is correct. Check that destination is RAM, not flash.
  • Quick test: Halt debugger after copy_data. Check RAM location that should contain .data.

Problem 3: “main() crashes immediately”

  • Why: Stack not initialized, or BSS corrupted main()’s memory.
  • Fix: Verify SP is set to valid SRAM address before calling main().
  • Quick test: Single-step through main() entry. Watch for invalid memory access.

Definition of Done

  • Linker script correctly maps memory sections
  • Bootloader compiles and assembles without errors
  • Vector table is at address 0x00000000
  • First instruction after reset is from _start
  • Stack pointer is set to valid SRAM address before calling main()
  • .data section is copied from flash to RAM
  • .bss section is zeroed in RAM
  • main() can execute and access global variables
  • Can explain each step of the boot sequence

Learning Milestones

  1. You understand what happens before main() → You grasp program initialization at the CPU level
  2. You manage memory layout with linker scripts → You control exactly where code and data live
  3. You see the CPU wake up and run your code → You’ve written the most fundamental code layer

Project 7: Context Switcher (Multitasking on Bare Metal)

View Detailed Guide

  • File: P07-context-switcher.md
  • Main Programming Language: ARM Assembly (mostly) + C
  • Alternative Programming Languages: ARM Assembly, C with inline assembly
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Operating Systems / Scheduling
  • Software or Tool: ARM GCC, GDB, SysTick timer
  • Main Book: “Operating Systems: Three Easy Pieces” by Remzi H. Arpaci-Dusseau

What you’ll build: A simple context switcher that allows two or more “tasks” to run on a single core by sharing CPU time. When a timer interrupt fires, save one task’s state, load another’s, and jump back to execute it. This is the core of any multitasking OS.

Why it teaches ARM: Context switching is where abstraction meets reality. The CPU has only one instruction pointer at a time. How does it run multiple programs? By rapidly switching. You’ll learn about the stack, registers as state, interrupts for preemption, and how OSes make multitasking possible.

Core challenges you’ll face:

  • Saving and restoring CPU state → Maps to ARM register file and stack
  • Interrupt handlers and preemption → Maps to exception handling
  • Task state representation → Maps to data structures for OS abstractions
  • Scheduler logic → Maps to algorithms for fairness and responsiveness
  • Debugging concurrent execution → Maps to testing and verification

Key Concepts:

  • Process State and Stacks: “Operating Systems: Three Easy Pieces” - Ch. 4
  • CPU State and Interrupts: “Computer Systems: A Programmer’s Perspective” - Ch. 8
  • ARM Exception Model: ARM Cortex-M Technical Reference Manual
  • Task Switching: “Operating Systems: Three Easy Pieces” - Ch. 6

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites:

  • Completed Project 3 (understand GPIO and hardware)
  • Completed Project 4 (understand interrupts)
  • Completed Project 6 (understand bootloader and memory layout)
  • Strong ARM assembly knowledge
  • Understanding of function calls and stack frames

Real World Outcome

When complete, you’ll have two “tasks” running concurrently on one ARM core. Each task might blink a different LED, and both run independently without explicit cooperation. The output:

# Two tasks running, switching every ~100ms due to SysTick interrupt

Task 0: Iteration 1, LED_A on  (runs in ~10ms with SysTick preemption)
Task 1: Iteration 1, LED_B on
Task 0: Iteration 2, LED_A blink
Task 1: Iteration 2, LED_B blink
Task 0: Iteration 3, LED_A on
Task 1: Iteration 3, LED_B on

# In reality: both LEDs blink independently because the CPU switches
# between tasks so fast (1000s of times per second) that it LOOKS
# like they run simultaneously.

The Core Question You’re Answering

“How does one CPU run multiple programs at the same time?”

Concepts You Must Understand First

  1. Process vs Thread (Conceptual)
    • What is a “task” in a multitasking system?
    • What state must be saved to switch between tasks?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Ch. 4
  2. Interrupts and Preemption
    • What forces a task to stop and switch?
    • How do you guarantee a context switch happens?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 8
  3. CPU State: The Stack and Registers
    • What’s in R0-R15 that matters for a task?
    • How do you save and restore them?
    • Book Reference: ARM Cortex-M Technical Reference Manual

Questions to Guide Your Design

  1. Task Representation
    • How will you store a task’s state (registers, stack)?
    • How many registers must you save? (hint: all of them)
    • Where will you store the saved stack pointer for each task?
  2. Scheduler
    • How will you decide which task runs next?
    • Round-robin (each task gets equal time)?
    • Priority-based?
  3. Interrupt Handling
    • Which interrupt will trigger context switches? (SysTick is common)
    • How long can an interrupt handler take?
    • What happens if an interrupt fires during an interrupt?

Thinking Exercise

Tracing a Context Switch

Before coding, understand what happens:

Scenario: Two tasks, Task A and Task B.
Timer interrupt fires every 100ms.

Timeline:
0ms:     Task A runs (normal execution)
50ms:    Task A still running
100ms:   SysTick interrupt fires
         - CPU stops Task A
         - Push current registers onto Task A's stack
         - Load Task B's stack pointer
         - Pop Task B's registers
         - Return from interrupt → Task B runs
150ms:   Task B running
200ms:   SysTick fires again
         - Switch from Task B back to Task A
         - Task A resumes from where it was stopped

Questions while tracing:

  • What if Task A and Task B both want to use R0? (Each has its own copy)
  • If an interrupt handler overwrites registers, do we need to save them? (Yes!)
  • What would happen if you forgot to save the stack pointer?

The Interview Questions They’ll Ask

  1. “Explain what happens in a context switch from a CPU’s perspective.”
  2. “You have two tasks with separate stacks. When you switch contexts, what register changes and how?”
  3. “Draw the stack layout for a task that’s been interrupted mid-execution.”
  4. “Why must you save ALL registers, not just some?”
  5. “Write pseudocode for a SysTick interrupt handler that switches tasks.”
  6. “If you have 10 tasks and 100 interrupts per second, how much CPU time does each task get?”

Hints in Layers

Hint 1: Understand Task State Each task needs its own stack. The stack pointer (SP) is the task’s identity. When you change SP, you switch tasks. The hard part: you must save/restore ALL registers.

Minimal task structure:

struct task {
    uint32_t* stack_pointer;
    uint32_t stack[STACK_SIZE];
    uint8_t state;  // READY, RUNNING, etc.
};

Hint 2: Design the Context Switch Pseudocode:

// Called from SysTick interrupt handler

save_context(current_task) {
    // All ARM registers (R0-R12, LR) are pushed
    // onto current_task->stack by hardware exception entry
    current_task->stack_pointer = SP
}

load_context(next_task) {
    SP = next_task->stack_pointer
    // Exception return will pop registers
}

// Choose next task (round-robin for simplicity)
next_task = get_next_task()
save_context(current_task)
load_context(next_task)
// Return from interrupt - CPU restores state

Hint 3: Implement Task Init Each task’s stack must be set up so that when switched to, it can start executing its function:

task_init(struct task* t, void (*func)()) {
    // Build a fake stack frame as if task had been interrupted
    uint32_t* sp = &t->stack[STACK_SIZE - 1];
    *sp-- = (uint32_t)func;      // PC (program counter)
    *sp-- = 0;                   // LR (link register)
    *sp-- = 0; *sp-- = 0;        // R12, R3
    *sp-- = 0; *sp-- = 0;        // R2, R1
    *sp-- = 0; *sp-- = 0;        // R0, R11
    // ... fill all registers
    t->stack_pointer = sp;
}

Hint 4: Test with LEDs Two tasks that blink different LEDs. If both tasks run, you see two independent blink patterns. Use LED timing to prove switching is happening.

Books That Will Help

Topic Book Chapter
Processes and Context “Operating Systems: Three Easy Pieces” Ch. 4, 6
CPU Exceptions “Computer Systems: A Programmer’s Perspective” Ch. 8
ARM Exception Model ARM Cortex-M Technical Reference Section 7
Assembly Details “The Art of ARM Assembly, Volume 1” All

Common Pitfalls and Debugging

Problem 1: “Second task never runs”

  • Why: Task’s initial stack is wrong, or scheduler always picks task 0.
  • Fix: Verify task_init sets up a valid stack frame. Check scheduler alternates between tasks.
  • Quick test: Print task IDs on context switch. Should alternate.

Problem 2: “Program crashes during switch”

  • Why: Saved stack pointer is invalid, or register save/restore is wrong.
  • Fix: Check that SP is within the task’s stack bounds after switch.
  • Quick test: Use GDB to single-step through context switch. Watch SP and registers.

Problem 3: “LEDs don’t blink independently”

  • Why: Tasks aren’t running concurrently, or both tasks are doing the same thing.
  • Fix: Ensure SysTick interrupt is firing and switching tasks. Verify each task does different GPIO.
  • Quick test: Put delays in each task. If LEDs blink at different rates, tasks are switching.

Definition of Done

  • Two task structures created with separate stacks
  • SysTick timer configured to interrupt regularly
  • Context save/restore works for all registers
  • Tasks initialize correctly and start executing
  • Task scheduler alternates between tasks
  • Both tasks can run and produce independent behavior (LEDs blink differently)
  • No crashes or register corruption during switches
  • Can explain the context switch flow in detail

Learning Milestones

  1. You save and restore CPU state → You understand that “programs” are just register values and memory
  2. You implement preemptive switching → You see how OS abstractions emerge from hardware
  3. You run multiple “tasks” simultaneously → You’ve built the core of an OS

Project 8: ARM Emulator (Cycle-Accurate CPU Simulator)

View Detailed Guide

  • File: P08-arm-emulator.md
  • Main Programming Language: C / C++
  • Alternative Programming Languages: Rust, Python (slow but educational)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: CPU Architecture / Emulation
  • Software or Tool: ARM GCC, any assembler
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A software-based ARM CPU emulator that can execute ARM binaries, simulating the processor’s state (registers, memory, flags) and implementing the instruction execution cycle.

Why it teaches ARM: By building an emulator, you’ll understand EVERY instruction in intimate detail. You’ll see exactly how registers change, how memory is accessed, how flags are set. This is the deepest possible understanding of CPU behavior.

Core challenges: Implementing 50+ ARM instructions, managing CPU state, flag calculation, addressing modes, instruction decoding.

Key Concepts:

  • Instruction Fetch-Decode-Execute Cycle: “Computer Organization and Design” by Patterson & Hennessy
  • ARM Instruction Set Reference: ARM Manual - All instruction definitions
  • CPU State Management: “Computer Systems: A Programmer’s Perspective” - Ch. 4
  • Testing and Validation: Your decoded binaries must match real ARM behavior

Difficulty: Advanced Time estimate: 3-4 weeks

Prerequisites:

  • Completed Project 1 (understand instruction format)
  • Completed Project 2 (ARM assembly fluency)
  • Strong C or C++ programming skills
  • Comfortable with bitwise operations and state machines

Real World Outcome

$ ./arm_emulator firmware.bin
[CPU] Starting execution at 0x8000000
[CPU] Loaded 1024 bytes into memory
[CPU] R0 = 0x00000000, R1 = 0x00000000, ..., PC = 0x8000000

Step 1: Fetch [PC=0x8000000] = 0xE3A00001 (MOV R0, #1)
Step 2: Decode: MOV instruction, Rd=R0, Operand=#1
Step 3: Execute: R0 ← 1
        Flags: Z=0, C=0, V=0, N=0
        PC ← PC + 4

[Execution continues...]

Final State:
  R0 = 0x00000042
  R1 = 0x00000100
  PC = 0x800001C
  Cycles: 47

The Core Question You’re Answering

“How does a CPU actually execute instructions?”

Concepts You Must Understand First

  1. Fetch-Decode-Execute Cycle: Every CPU iteration has three steps. You’ll implement each one.
  2. Instruction Format Decoding: How to extract bits and determine instruction type.
  3. State Representation: How to store and update registers, flags, memory.

Definition of Done

  • Emulator loads ARM binaries
  • At least 20 ARM instructions working (MOV, ADD, SUB, LDR, STR, BL, BEQ, etc.)
  • Registers update correctly
  • Memory reads/writes work
  • Flags (Z, C, V, N) calculate correctly
  • Can step through execution and display state
  • Test binaries from Project 1 execute correctly

Project 9: Exception Handler & Interrupt Table

View Detailed Guide

  • File: P09-exception-handler.md
  • Main Programming Language: ARM Assembly + C
  • Alternative Programming Languages: ARM Assembly
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Exception Handling / Real-time Systems
  • Software or Tool: STM32F407, OpenOCD
  • Main Book: ARM Cortex-M Technical Reference Manual

What you’ll build: A complete exception vector table and handlers for at least 5 different exceptions (Reset, HardFault, UsageFault, SysTick, GPIO interrupts), each with proper context saving and handler logic.

Why it teaches ARM: Exceptions are how real hardware wakes up and responds. Understanding exception vectors, priorities, and handlers is fundamental to real-time systems, kernel development, and security research.

Core challenges: Exception priority, nested interrupts, context preservation, exception stack frames.

Key Concepts:

  • Exception Model: ARM Cortex-M Technical Reference Manual - Section 7
  • Vector Table: STM32F4 Reference Manual - NVIC section
  • Exception Stack Frames: ARM Cortex-M Architecture details

Difficulty: Intermediate Time estimate: 1-2 weeks

Prerequisites:

  • Completed Project 3 (hardware interaction)
  • Completed Project 6 (bootloader, startup code)
  • Understanding of interrupts and priorities

Real World Outcome

When you trigger different exceptions, handlers execute and you can see results:

# Press button (GPIO interrupt)
GPIO interrupt triggered (EXTI Line 0)
  Handler executed at: 0x800xxxx
  Interrupt priority: 15
  Execution time: 234 cycles

# Divide by zero (UsageFault)
UsageFault triggered: Division by zero
  Fault address: 0x800yyyy
  Instruction causing fault: [disassembled]

# Unaligned memory access (HardFault)
HardFault triggered: Unaligned memory access
  Attempted address: 0x20000001 (odd address)
  Corrective action: Exception handler running...

The Core Question You’re Answering

“How does the CPU know which handler to call when an interrupt happens?”


Project 10: I2C/OLED Display Driver

View Detailed Guide

  • File: P10-i2c-oled-driver.md
  • Main Programming Language: C with ARM Assembly
  • Alternative Programming Languages: C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Protocols / Communication / Display
  • Software or Tool: STM32F407, I2C OLED display (SSD1306)
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: An I2C driver that communicates with an OLED display, implementing the I2C protocol from scratch and displaying text/graphics on the screen.

Why it teaches ARM: I2C is a real-world protocol with timing requirements, acknowledge bits, and addressing. You’ll learn protocol implementation, state machines, timing constraints, and how to build a reusable driver.

Core challenges: I2C timing, start/stop/repeated start conditions, ACK/NACK handling, OLED command set, graphics rendering.

Key Concepts:

  • I2C Protocol: I2C Specification documentation
  • OLED Display Control: SSD1306 datasheet
  • State Machines: “The Art of ARM Assembly” - Ch. 10
  • Timing Constraints: “Making Embedded Systems” - Ch. 4

Difficulty: Intermediate Time estimate: 1-2 weeks

Prerequisites:

  • Completed Project 4 (serial communication basics)
  • Completed Project 3 (GPIO and timing)
  • Access to I2C OLED breakout board

Real World Outcome

When your I2C driver works, you’ll see text appear on the physical OLED display:

┌─────────────────────┐
│ Hello from ARM!     │
│                     │
│ Temperature: 42°C   │
│ CPU Cycles: 245,678 │
│                     │
│ Uptime: 0:01:23     │
└─────────────────────┘

Project 11: SPI/SD Card Driver

View Detailed Guide

  • File: P11-spi-sd-driver.md
  • Main Programming Language: C with ARM Assembly
  • Alternative Programming Languages: C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Storage / Filesystems
  • Software or Tool: STM32F407, SD Card
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: An SPI driver and basic SD card interface that can read blocks from an SD card, implementing the SPI protocol and SD card command sequence.

Why it teaches ARM: SPI is faster than I2C and requires different timing. SD cards have a complex initialization sequence and state machine. You’ll learn DMA setup, chip select management, and how to interact with complex peripherals.

Core challenges: SPI clock speed, SD card initialization, command format, CRC calculation, DMA setup.

Key Concepts:

  • SPI Protocol: SPI Specification + STM32 SPI peripheral
  • SD Card Commands: SD Card Association specification
  • DMA Configuration: STM32F4 Reference Manual - DMA section
  • CRC Calculation: SD card CRC-7 and CRC-16 algorithms

Difficulty: Advanced Time estimate: 2 weeks

Prerequisites:

  • Completed Project 4 (communication protocols)
  • Completed Project 10 (driver architecture)
  • Access to SD card and SPI-capable board

Real World Outcome

When working, you can read data from an SD card:

SD Card Detected: SDHC, 4GB
   Manufacturer: Kingston
   Serial: 0x12345678

Read Sector 0:
  Bytes 0-15:   55 AA 00 12 34 56 78 9A BC DE F0 11 22 33 44 55
  Bytes 16-31:  ... (continues)

Read Success! CRC verified.

Project 12: PWM Motor Control

View Detailed Guide

  • File: P12-pwm-motor.md
  • Main Programming Language: C with ARM Assembly
  • Alternative Programming Languages: C
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Real-time Control / PWM
  • Software or Tool: STM32F407, DC motor, h-bridge
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: PWM configuration that controls a DC motor speed and direction using an h-bridge and duty cycle variation.

Why it teaches ARM: PWM is fundamental to embedded control. Understanding timer configuration, frequency, duty cycle, and how they translate to physical motor behavior is essential for robotics and control systems.

Core challenges: Timer configuration for precise PWM, duty cycle calculation, frequency selection, motor control.

Key Concepts:

  • PWM Fundamentals: “Making Embedded Systems” - Ch. 5
  • Timer Peripherals: STM32F4 Reference Manual - Timer section
  • H-Bridge Control: Basic motor control electronics
  • PID Loops: Optional but educational for speed control

Difficulty: Intermediate Time estimate: 1-2 weeks

Prerequisites:

  • Completed Project 3 (GPIO and hardware)
  • Completed Project 6 (bootloader and initialization)
  • Understanding of motor electronics (voltage, current)

Real World Outcome

Motor Speed Control:
  Speed: 50% → Motor runs at 50% of max RPM
  Speed: 100% → Motor runs at full RPM
  Speed: 0% → Motor stopped

Direction Control:
  Forward → H-Bridge: in1=HIGH, in2=LOW
  Reverse → H-Bridge: in1=LOW, in2=HIGH

Acceleration Test:
  0% → 25% → 50% → 75% → 100% (smooth ramp)

Project 13: DMA Audio (Direct Memory Access + Audio)

View Detailed Guide

  • File: P13-dma-audio.md
  • Main Programming Language: C with ARM Assembly
  • Alternative Programming Languages: C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Real-time Audio / DMA
  • Software or Tool: STM32F407, I2S Audio codec
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: DMA-driven I2S audio interface that streams audio from memory to a codec with minimal CPU overhead, implementing double-buffering and sample conversion.

Why it teaches ARM: DMA is advanced: the CPU programs a transfer and the DMA controller does it independently. You’ll learn advanced interrupt handling, double-buffering, and how real-time systems handle high-bandwidth data streams.

Core challenges: DMA configuration, I2S protocol, sample buffering, interrupt-driven transfers, audio codec setup.

Key Concepts:

  • DMA Theory: STM32F4 Reference Manual - DMA section
  • I2S Protocol: I2S Specification
  • Audio Codec: Codec datasheet (WM8731 or similar)
  • Double Buffering: “Making Embedded Systems” - Real-time concepts

Difficulty: Advanced Time estimate: 2-3 weeks

Prerequisites:

  • Completed Project 4 (serial/interrupt protocols)
  • Completed Project 10 (I2C configuration)
  • Completed Project 9 (exception handling)
  • Audio hardware knowledge (optional but helpful)

Real World Outcome

Audio DMA Transfer:
  DAC configured for 48kHz, 16-bit stereo
  DMA Channel 7 → I2S peripheral

Double-Buffer Setup:
  Buffer A: Loading PCM samples from SD card
  Buffer B: Playing via I2S → Codec → Speaker

When Buffer B finishes → Swap buffers, switch to Buffer A
When Buffer A finishes → Swap back

Result: Continuous audio streaming with only brief CPU interrupt every 512 samples

Project 14: GDB Stub (Remote Debugging)

View Detailed Guide

  • File: P14-gdb-stub.md
  • Main Programming Language: ARM Assembly + C
  • Alternative Programming Languages: C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Debugging / Remote Protocol
  • Software or Tool: GDB, Serial port, ARM GCC
  • Main Book: “The Art of Debugging with GDB” by Matloff & Salzman

What you’ll build: A GDB stub that allows remote debugging of bare-metal ARM code via serial connection, supporting breakpoints, step execution, and register inspection.

Why it teaches ARM: Implementing a GDB stub forces you to understand the GDB remote protocol, exception handling, and all the machinery that debuggers use. You’ll gain insight into how debugging actually works.

Core challenges: GDB protocol implementation, breakpoint handling (software and hardware), register marshaling, exception vectors.

Key Concepts:

  • GDB Remote Protocol: GDB Documentation
  • Software Breakpoints: Exception vectors and instruction patching
  • Hardware Breakpoints: Cortex-M DWT (Data Watchpoint and Trace)
  • Serial Protocol: Checksum calculation, packet framing

Difficulty: Advanced Time estimate: 2-3 weeks

Prerequisites:

  • Completed Project 4 (UART/serial communication)
  • Completed Project 9 (exception handlers)
  • Comfortable with protocol parsing
  • Understanding of GDB basics

Real World Outcome

$ arm-none-eabi-gdb firmware.elf
(gdb) target remote /dev/ttyUSB0
Remote debugging using /dev/ttyUSB0
0x08000000 in _start ()

(gdb) break main
Breakpoint 1 at 0x8000100

(gdb) continue
Breakpoint 1, main () at main.c:15

(gdb) print array[0]
$1 = 42

(gdb) step
15    int result = calculate();

(gdb) info registers
r0             0x42        66
r1             0x100       256
...

Project 15: Tiny OS (Minimal Operating System Kernel)

View Detailed Guide

  • File: P15-tiny-os.md
  • Main Programming Language: C + ARM Assembly
  • Alternative Programming Languages: C with inline assembly
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Operating Systems / Kernels
  • Software or Tool: ARM GCC, linker scripts, OpenOCD
  • Main Book: “Operating Systems: Three Easy Pieces” by Remzi H. Arpaci-Dusseau

What you’ll build: A minimal but complete operating system kernel with task scheduling, exception handling, memory management, and system calls. A complete system from bootloader to user programs.

Why it teaches ARM: This is the capstone project. You’ll integrate everything: bootloader (Project 6), context switching (Project 7), exceptions (Project 9), memory management (Project 5), and drivers. You’ll build a complete system that could theoretically run hundreds of embedded applications.

Core challenges: All previous challenges, plus interprocess communication, privilege levels, page tables (if memory protection), system call interface.

Key Concepts:

  • Kernel Architecture: “Operating Systems: Three Easy Pieces” - All chapters
  • Privilege Levels: ARM Cortex-M privilege/user modes
  • System Calls: OS interface design
  • Task Management: Complete scheduler with priority and preemption

Difficulty: Expert Time estimate: 4-6 weeks

Prerequisites:

  • Completed ALL previous projects (1-14)
  • Deep understanding of ARM architecture
  • Strong C programming skills
  • Familiarity with OS concepts

Real World Outcome

$ # Run your tiny OS on the STM32F407
$ arm-none-eabi-gcc -o tiny_os kernel.c driver/*.c
$ openocd -f stm32f4discovery.cfg
$ arm-none-eabi-gdb tiny_os.elf
(gdb) load
(gdb) continue

# On the board:
Tiny OS Kernel v0.1
  Bootloader: OK
  Memory: 192 KB SRAM, 1 MB Flash

Loading tasks...
  Task 1: Blinker (Priority 1)
  Task 2: Monitor (Priority 2)
  Task 3: Telemetry (Priority 3)

Scheduler: Round-robin, 1000 Hz tick
  T=0ms:   Task 1 executes
  T=1ms:   Task 2 executes
  T=2ms:   Task 3 executes
  T=3ms:   Task 1 executes (resuming from context save)
  ...

System running. Press CTRL-C to halt.

Project Comparison Table

# Project Difficulty Time Coolness Prerequisites
1 Instruction Decoder Intermediate 1-2w ⭐⭐⭐ C basics
2 Assembly Calculator Beginner 1w ⭐⭐⭐ Proj 1 helpful
3 LED Blinker Advanced 1-2w ⭐⭐⭐⭐ Proj 2
4 UART Driver Advanced 1-2w ⭐⭐⭐⭐ Proj 3
5 Memory Allocator Advanced 1-2w ⭐⭐⭐ Proj 2
6 Bootloader Expert 2-3w ⭐⭐⭐⭐⭐ Proj 3
7 Context Switcher Advanced 2w ⭐⭐⭐⭐ Proj 6
8 ARM Emulator Advanced 3-4w ⭐⭐⭐⭐ Proj 1
9 Exception Handler Advanced 1-2w ⭐⭐⭐ Proj 3
10 I2C/OLED Driver Advanced 1-2w ⭐⭐⭐ Proj 4
11 SPI/SD Card Advanced 2w ⭐⭐⭐ Proj 4
12 PWM Motor Advanced 1-2w ⭐⭐⭐ Proj 4
13 DMA Audio Expert 2-3w ⭐⭐⭐⭐ Proj 4
14 GDB Stub Expert 2-3w ⭐⭐⭐⭐ Proj 9
15 Tiny OS Expert 4-6w ⭐⭐⭐⭐⭐ Proj 6,7

Recommendation

If you’re brand new to ARM: Start with Project 1 (Instruction Decoder). You need to understand how 32 bits encode an instruction before writing assembly code. This is foundational.

If you want to write code immediately: Start with Project 2 (Assembly Calculator). This gives you hands-on assembly experience quickly.

If you want to build real hardware: Start with Project 3 (LED Blinker). This is your first “success”—visible, physical feedback.

If you want to understand the entire CPU: Follow Path 2 (Systems Programmer): Projects 1→2→6→7→8→9→15. This builds from ISA through OS concepts.

If you’re time-constrained: Focus on Projects 1, 2, 3, 4, 8, 15. These cover breadth of concepts with focused depth.


Final Overall Project: ARM-Based Retro Game Console

The Goal: Integrate Projects 1-15 into a cohesive embedded system: a playable game console running on bare-metal ARM.

What you’ll build:

  • Bootloader (Project 6) loads game firmware from SD card (Project 11)
  • Game uses DMA audio (Project 13) for sound
  • PWM controls backlight brightness (Project 12)
  • I2C reads battery voltage via ADC (Project 10)
  • Task scheduler (Project 7) manages game loops and audio
  • Emulator (Project 8) runs legacy 8-bit game instructions
  • Exception handlers (Project 9) catch faults gracefully
  • UART (Project 4) provides debug console
  • malloc (Project 5) manages game assets in RAM

Success Criteria:

  • Game runs at 30+ FPS
  • Audio plays without crackling
  • No memory leaks over 10 minutes
  • Gracefully handles cartridge errors
  • Can load different game cartridges

From Learning to Production: What’s Next

Your Project Production Equivalent Gap to Fill
Instruction Decoder IDA Pro, Ghidra, objdump Disassembly with symbolic info, CFG recovery
Assembly Calculator Shell, REPL Error handling, history, floating-point
LED Blinker Device drivers Multi-tasking, resource management, hot-swapping
UART Driver Linux serial driver Buffering, flow control, error recovery
Memory Allocator dlmalloc, jemalloc Thread safety, statistics, performance tuning
Bootloader U-Boot, Barebox Secure boot, redundancy, A/B partitions
Context Switcher OS kernel Preemption, real-time priorities, SMP
Emulator QEMU Full system simulation, debugging interfaces
Exception Handler OS fault handlers Recovery strategies, core dumps
Drivers Linux kernel modules Hot-plug, power management, abstraction layers
Tiny OS Linux, RTOS Virtual memory, networking, file systems

Summary: All Projects and Languages

This learning path covers ARM architecture mastery through 15 hands-on projects.

# Project Name Main Language Difficulty Time Estimate
1 Instruction Decoder C Intermediate 1-2 weeks
2 Assembly Calculator Assembly Beginner 1 week
3 LED Blinker C + Assembly Advanced 1-2 weeks
4 UART Driver C Advanced 1-2 weeks
5 Memory Allocator C Advanced 1-2 weeks
6 Simple Bootloader Assembly + C Expert 2-3 weeks
7 Context Switcher C + Assembly Advanced 2 weeks
8 ARM Emulator C / Rust Advanced 3-4 weeks
9 Exception Handler C + Assembly Advanced 1-2 weeks
10 I2C/OLED Driver C Advanced 1-2 weeks
11 SPI/SD Card Driver C Advanced 2 weeks
12 PWM Motor Control C Advanced 1-2 weeks
13 DMA Audio Player C Expert 2-3 weeks
14 GDB Stub Debugger C Expert 2-3 weeks
15 Tiny Operating System C + Assembly Expert 4-6 weeks

For beginners: 1→2→3→4→8→15 (6-8 weeks, covers breadth and depth) For embedded engineers: 1→2→3→4→10→11→12→15 (8-10 weeks, peripheral-focused) For systems programmers: 1→2→6→7→8→9→14→15 (10-12 weeks, OS-focused) For accelerated learners: 1→2→3→5→6→8→15 (4-6 weeks full-time)

Expected Outcomes

After completing these projects, you will:

  • Understand ARM ISA from first principles - registers, instruction encoding, pipeline
  • Write ARM assembly fluently - understand control flow, calling conventions, system calls
  • Build bare-metal systems - bootloaders, device drivers, interrupt handlers
  • Optimize code for performance - understand pipeline stalls, cache, branch prediction
  • Debug embedded systems - trace execution, analyze memory layouts, find bugs
  • Reverse-engineer ARM binaries - decode instructions, find vulnerabilities
  • Implement CPU abstractions - build emulators, schedulers, memory allocators
  • Manage peripherals at register level - GPIO, UART, I2C, SPI, DMA, PWM

You’ll have built a complete embedded system from first principles—from understanding CPU instructions to running a tiny operating system with task scheduling and peripheral management.


Additional Resources and References

Official Standards and Specifications

  1. “The Art of ARM Assembly, Volume 1” - Randall Hyde (best for learning ARM-specific details)
  2. “Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron (foundational systems thinking)
  3. “Making Embedded Systems, 2nd Edition” - Elecia White (practical embedded knowledge)
  4. “The Linux Programming Interface” - Michael Kerrisk (understanding system calls and OS)
  5. “Understanding the Linux Kernel” - Bovet & Cesati (advanced OS concepts)

Web Resources

  • Azeria Labs ARM Assembly Series - Free, comprehensive ARM assembly tutorials
  • ARM Assembly By Example - armasm.com, practical examples
  • Vivonomicon’s Bare Metal STM32 Series - Step-by-step bare-metal guides
  • OSDev.org - Operating system development reference

Tools & Emulators

  • QEMU - Full-system ARM emulator
  • GNU ARM Embedded Toolchain - GCC for cross-compilation
  • OpenOCD - On-chip debugger interface
  • Unicorn Engine - Lightweight CPU emulator (Python library)

Final Note

ARM architecture powers billions of devices. By mastering it from first principles through these projects, you join an elite group of engineers who truly understand how modern computing works. You’ll be prepared for roles in embedded systems, security research, kernel development, and systems optimization—wherever deep technical knowledge is valued.

Start with Project 1. One project at a time. Build something real.

Good luck, and enjoy the journey! 🚀