← Back to all projects

LEARN ARM DEEP DIVE

Learn ARM: From Zero to ARM Architecture Master

Goal: Deeply understand ARM architecture—from basic registers and instruction sets to bare-metal programming, writing bootloaders, and building your own ARM emulator. ARM powers billions of devices, from smartphones to Raspberry Pis to Apple Silicon Macs.


Why ARM Matters

ARM (Advanced RISC Machines) is the most widely used processor architecture in the world. Over 200 billion devices contain an ARM chip. From your iPhone to your smart thermostat, from data center servers to the Nintendo Switch—ARM is everywhere.

Yet most developers treat it as a black box. After completing these projects, you will:

  • Understand every register and their purposes
  • Read and write ARM assembly fluently
  • Know how instructions flow through the pipeline
  • Build bare-metal systems without an operating system
  • Create bootloaders, drivers, and schedulers
  • Understand the difference between ARM32, Thumb, and AArch64
  • Debug ARM binaries like a professional reverse engineer

Core Concept Analysis

ARM vs x86: The Philosophy

x86 (CISC):                          ARM (RISC):
├── Complex instructions              ├── Simple, uniform instructions
├── Variable-length (1-15 bytes)      ├── Fixed-length (4 bytes ARM, 2 bytes Thumb)
├── Many addressing modes             ├── Load/Store architecture
├── Fewer registers (8-16)            ├── Many registers (16 general ARM32, 31 in AArch64)
└── Hardware does more work           └── Software does more work

ARM Register Set (32-bit ARMv7)

General Purpose Registers:
┌─────────────────────────────────┐
│ R0-R3   : Arguments/Return      │  ← Function parameters
│ R4-R11  : Callee-saved          │  ← Preserved across calls
│ R12 (IP): Intra-procedure call  │  ← Scratch register
│ R13 (SP): Stack Pointer         │  ← Points to top of stack
│ R14 (LR): Link Register         │  ← Return address
│ R15 (PC): Program Counter       │  ← Current instruction address
├─────────────────────────────────┤
│ CPSR    : Current Program Status│  ← Flags (N,Z,C,V) + mode bits
└─────────────────────────────────┘

AArch64 Register Set (64-bit ARMv8)

┌─────────────────────────────────┐
│ X0-X7   : Arguments/Return      │  ← Function parameters
│ X8      : Indirect result       │  ← Large struct returns
│ X9-X15  : Caller-saved temps    │  ← Scratch registers
│ X16-X17 : Intra-procedure call  │  ← Platform reserved
│ X18     : Platform reserved     │  ← (TLS on some platforms)
│ X19-X28 : Callee-saved          │  ← Preserved across calls
│ X29 (FP): Frame Pointer         │  ← Stack frame base
│ X30 (LR): Link Register         │  ← Return address
│ SP      : Stack Pointer         │  ← Points to top of stack
│ PC      : Program Counter       │  ← Not directly accessible
├─────────────────────────────────┤
│ W0-W30  : 32-bit views of X regs│  ← Lower 32 bits
└─────────────────────────────────┘

Fundamental Concepts You’ll Master

  1. Load/Store Architecture: ARM can only operate on registers. Data must be loaded from memory to registers, operated on, then stored back. No ADD [memory], value like x86.

  2. Conditional Execution: Most ARM instructions can be conditionally executed based on flags. ADDEQ R0, R1, R2 only adds if the Zero flag is set.

  3. Barrel Shifter: ARM’s secret weapon. Shift operands as part of any data processing instruction: ADD R0, R1, R2, LSL #2 (R0 = R1 + R2*4)

  4. Instruction Pipelining: ARM uses a 3-stage (or more) pipeline: Fetch → Decode → Execute. Understanding this explains why PC is 8 bytes ahead.

  5. Processor Modes: User, FIQ, IRQ, Supervisor, Abort, Undefined, System. Each has banked registers for fast context switching.

  6. Exception Handling: Reset, Undefined, SWI, Prefetch Abort, Data Abort, IRQ, FIQ. The vector table at address 0x00000000.

  7. Memory Management: MMU, TLB, cache hierarchies. How virtual memory maps to physical.


Project List

Projects are ordered from fundamental understanding to advanced implementations.


Project 1: ARM Instruction Decoder & Disassembler

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Parsing / Instruction Sets
  • Software or Tool: Custom Disassembler (like objdump)
  • Main Book: “The Art of ARM Assembly, Volume 1” by Randall Hyde

What you’ll build: A command-line tool that takes raw ARM binary code (or ELF files) and decodes each instruction into human-readable assembly, showing the opcode breakdown, registers used, and instruction effects.

Why it teaches ARM: Before you can write ARM assembly, you need to understand how instructions are encoded. Every ARM instruction fits into 32 bits with a specific structure. This project forces you to internalize the encoding scheme—condition codes, opcodes, register fields, immediate values, and shift operations.

Core challenges you’ll face:

  • Decoding the condition field (bits 31-28) → maps to understanding conditional execution
  • Parsing different instruction formats → maps to data processing, load/store, branch formats
  • Handling the barrel shifter encoding → maps to shift types and amounts in operand 2
  • Decoding immediate values with rotation → maps to 8-bit immediate with 4-bit rotation
  • Distinguishing instruction classes → maps to understanding opcode space allocation

Key Concepts:

  • ARM Instruction Encoding: ARM Architecture Reference Manual - Section A5
  • Condition Codes: “The Art of ARM Assembly, Vol 1” Chapter 4 - Randall Hyde
  • Binary Parsing in C: “Computer Systems: A Programmer’s Perspective” Chapter 2 - Bryant & O’Hallaron
  • ELF Format Basics: “Practical Binary Analysis” Chapter 2 - Dennis Andriesse

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming fundamentals, understanding of binary/hexadecimal, basic knowledge of what assembly language is. No prior ARM experience required.

Real world outcome:

$ ./arm-decode firmware.bin

0x00000000: E3A00001  MOV   R0, #1         ; R0 = 1
0x00000004: E3A01002  MOV   R1, #2         ; R1 = 2
0x00000008: E0802001  ADD   R2, R0, R1     ; R2 = R0 + R1
0x0000000C: E1A0F00E  MOV   PC, LR         ; Return (PC = LR)
0x00000010: 0A000003  BEQ   0x00000024     ; Branch if Z=1
0x00000014: E59F0010  LDR   R0, [PC, #16]  ; Load from PC+16+8
0x00000018: E1520000  CMP   R2, R0         ; Compare R2 with R0
0x0000001C: C2833005  ADDGT R3, R3, #5    ; If greater, R3 += 5

Instruction breakdown for 0xE0802001:
  Cond: 1110 (AL - Always)
  Type: 00 (Data Processing)
  OpCode: 0100 (ADD)
  S-bit: 0 (Don't update flags)
  Rn: 0000 (R0)
  Rd: 0010 (R2)
  Operand2: 000000000001 (R1, no shift)

Implementation Hints:

ARM instruction encoding follows a structured pattern. The top 4 bits (31-28) are always the condition code:

Condition Codes (bits 31-28):
0000 = EQ (Equal, Z=1)
0001 = NE (Not Equal, Z=0)
0010 = CS/HS (Carry Set/Unsigned Higher or Same)
...
1110 = AL (Always - most common)
1111 = NV (Never - special meaning in ARMv5+)

The next bits determine instruction class:

Bits 27-25 determine the instruction type:
000 = Data Processing / Multiply
001 = Data Processing (immediate)
010 = Load/Store (immediate offset)
011 = Load/Store (register offset)
100 = Load/Store Multiple
101 = Branch
110 = Coprocessor
111 = Software Interrupt / Coprocessor

Questions to guide your implementation:

  • How do you extract specific bit ranges from a 32-bit word?
  • What’s the difference between (instruction >> 28) & 0xF and instruction >> 28?
  • How do you build a lookup table for opcode mnemonics?
  • When Operand2 is an immediate, how does the 4-bit rotation work?

Start simple: decode just MOV and ADD with register operands. Then expand to immediates, then load/store, then branches.

Learning milestones:

  1. You decode condition codes correctly → You understand conditional execution is ARM’s core feature
  2. You parse data processing instructions → You understand the ALU instruction format
  3. You handle the barrel shifter and immediates → You understand ARM’s flexible operand encoding
  4. You decode load/store and branches → You understand the complete instruction set structure

Project 2: ARM Assembly Calculator

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: ARM Assembly
  • Alternative Programming Languages: N/A (this must be pure assembly)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: ARM Assembly / Basic Arithmetic
  • Software or Tool: QEMU (ARM emulator), or Raspberry Pi
  • Main Book: “ARM Assembly By Example” (armasm.com)

What you’ll build: A four-function calculator entirely in ARM assembly that reads two numbers and an operator from stdin, performs the calculation (add, subtract, multiply, divide), and outputs the result.

Why it teaches ARM: This is your “Hello World” for ARM assembly. You’ll learn registers, arithmetic instructions, system calls, branching, and basic program structure. By handling I/O without libc, you understand how programs interact with the operating system at the lowest level.

Core challenges you’ll face:

  • Understanding ARM calling conventions → maps to which registers hold what
  • Making Linux system calls → maps to SVC instruction and syscall numbers
  • Converting ASCII to integers and back → maps to arithmetic and loops
  • Implementing multiplication/division → maps to MUL, UDIV instructions (or software division)
  • Branching based on operator → maps to conditional execution and CMP

Key Concepts:

  • ARM Registers and Basic Instructions: Azeria Labs “Writing ARM Assembly Part 1”
  • Linux System Calls on ARM: “ARM Assembly By Example” - armasm.com
  • ASCII and Number Conversion: “Introduction to Computer Organization: ARM Edition” Chapter 7 - Robert Plantz
  • ARM Multiply Instructions: “The Art of ARM Assembly, Vol 1” Chapter 7 - Randall Hyde

Difficulty: Beginner Time estimate: Weekend Prerequisites: Understanding of basic programming concepts (variables, loops, conditionals). Ability to use command line and text editor. Project 1 helps but isn’t required.

Real world outcome:

$ ./armcalc
Enter first number: 42
Enter operator (+, -, *, /): *
Enter second number: 13
Result: 546

$ ./armcalc
Enter first number: 100
Enter operator (+, -, *, /): /
Enter second number: 7
Result: 14 remainder 2

$ ./armcalc
Enter first number: 255
Enter operator (+, -, *, /): +
Enter second number: 256
Result: 511

Implementation Hints:

Your program structure will look like this:

.data section:
    - Prompt strings ("Enter first number: ", etc.)
    - Input buffers
    - Result format strings

.text section:
    - _start: Entry point
    - read_number: Read string, convert ASCII → integer
    - print_number: Convert integer → ASCII, print
    - do_add, do_sub, do_mul, do_div: Arithmetic routines

Linux system calls on ARM use the SVC #0 instruction:

- R7 = syscall number (1=exit, 3=read, 4=write)
- R0 = first argument (fd for read/write)
- R1 = second argument (buffer address)
- R2 = third argument (count)
- Return value in R0

Key questions to answer:

  • How do you convert the character ‘5’ to the number 5? (Hint: subtract ‘0’)
  • How do you handle multi-digit numbers? (Hint: multiply accumulator by 10, add digit)
  • What if ARM doesn’t have a divide instruction on your version? (ARMv7 has UDIV, older versions need software division)
  • How do you compare the operator character to ‘+’, ‘-‘, ‘*’, ‘/’?

Start with just addition. Get input working, parsing working, then output. Once that works, add the other operations.

Learning milestones:

  1. You print “Hello, ARM!” → You understand system calls and basic structure
  2. You read a number from input → You understand data sections and I/O
  3. You perform calculations on registers → You understand arithmetic instructions
  4. You handle all four operations → You understand branching and program flow

Project 3: Bare-Metal LED Blinker

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C with ARM Assembly startup
  • Alternative Programming Languages: Pure ARM Assembly, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Bare-Metal / GPIO / Embedded Systems
  • Software or Tool: STM32 Nucleo board or Raspberry Pi (bare-metal mode)
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: An LED blinker that runs directly on ARM hardware with NO operating system. You’ll write the startup code, linker script, and GPIO control—just you and the silicon.

Why it teaches ARM: This strips away all abstractions. No OS, no libraries, no runtime. You’ll understand: how ARM boots, what the vector table does, how to configure GPIO registers directly, and what “bare metal” really means. This is the foundational skill for all embedded systems work.

Core challenges you’ll face:

  • Writing the vector table and reset handler → maps to ARM exception model
  • Creating a linker script → maps to memory layout (Flash, RAM, stack)
  • Configuring system clocks → maps to RCC registers and clock tree
  • Controlling GPIO registers → maps to memory-mapped I/O
  • Creating accurate delays without OS → maps to timer peripherals or cycle counting

Resources for key challenges:

Key Concepts:

  • ARM Startup Sequence: “Making Embedded Systems” Chapter 3 - Elecia White
  • Linker Scripts: “Bare Metal C” Chapter 2 - Steve Oualline
  • Memory-Mapped I/O: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron
  • GPIO Configuration: STM32F4 Reference Manual - ST Microelectronics (free PDF)
  • Clock Configuration: “Making Embedded Systems” Chapter 5 - Elecia White

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 2 (ARM assembly basics), C programming, understanding of hexadecimal and bitwise operations. Need either STM32 Nucleo board (~$15) or Raspberry Pi.

Real world outcome:

Physical result: An LED on your board blinks at 1Hz

Serial output (if you add UART later):
Starting bare-metal LED blinker...
System clock: 16 MHz
GPIO configured: PA5 as output
Blink cycle: 500ms on, 500ms off
[LED toggles visibly on the board]

Implementation Hints:

Your project will have this structure:

project/
├── startup.s       # Vector table, reset handler, stack init
├── main.c          # LED blink logic
├── linker.ld       # Memory layout script
├── Makefile        # Build with arm-none-eabi-gcc
└── stm32f4xx.h     # Register definitions (or write your own!)

The ARM boot sequence:

1. Power on → CPU fetches initial SP from address 0x00000000
2. CPU fetches reset vector from address 0x00000004
3. CPU jumps to reset handler address
4. Reset handler: copy .data from Flash to RAM, zero .bss, call main()

The vector table (first thing in Flash memory):

.section .vectors
vectors:
    .word   _stack_top      @ Initial Stack Pointer
    .word   reset_handler   @ Reset Handler
    .word   nmi_handler     @ NMI
    .word   hardfault_handler @ Hard Fault
    ... more exception vectors ...

GPIO control on STM32 (memory-mapped registers):

// These are memory addresses, not variables!
#define GPIOA_BASE    0x40020000
#define GPIOA_MODER   (*(volatile uint32_t *)(GPIOA_BASE + 0x00))
#define GPIOA_ODR     (*(volatile uint32_t *)(GPIOA_BASE + 0x14))

// Set PA5 as output
GPIOA_MODER |= (1 << 10);  // Mode bits for pin 5

// Toggle LED
GPIOA_ODR ^= (1 << 5);

Key questions:

  • Why must GPIO registers be declared volatile?
  • What happens if you forget to enable the GPIO clock in RCC?
  • How do you calculate the number of loop iterations for a 500ms delay at 16MHz?

Learning milestones:

  1. Your code compiles with arm-none-eabi-gcc → You understand cross-compilation
  2. You flash the binary and CPU doesn’t crash → Vector table is correct
  3. LED lights up (even if stuck) → GPIO configuration works
  4. LED blinks at correct rate → You understand timing without an OS

Project 4: UART Driver from Scratch

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: ARM Assembly, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Serial Communication / Peripheral Programming
  • Software or Tool: STM32 board, USB-to-Serial adapter
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk (for comparison)

What you’ll build: A complete UART (serial) driver for bare-metal ARM that supports both polling and interrupt-driven I/O, with configurable baud rates, and can serve as a debug console.

Why it teaches ARM: UART is the “printf debugging” of embedded systems. Building it from scratch teaches you: peripheral register programming, interrupt configuration (NVIC), baud rate calculations, and how the CPU interacts with external hardware. This becomes your foundation for all other peripheral drivers.

Core challenges you’ll face:

  • Configuring UART registers → maps to understanding peripheral register maps
  • Calculating baud rate divisors → maps to clock relationships and math
  • Implementing interrupt handlers → maps to ARM exception model and NVIC
  • Creating ring buffers for async I/O → maps to data structures for real-time systems
  • Handling framing errors → maps to error detection and recovery

Key Concepts:

  • UART Protocol Basics: “Making Embedded Systems” Chapter 9 - Elecia White
  • ARM Interrupt Handling: “The Definitive Guide to ARM Cortex-M3 and Cortex-M4” Chapter 8 - Joseph Yiu
  • Ring Buffer Implementation: “Mastering Algorithms with C” Chapter 6 - Kyle Loudon
  • Baud Rate Generation: STM32 Reference Manual Section on USART
  • NVIC Configuration: ARM Cortex-M Technical Reference Manual

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3 (bare-metal basics), understanding of serial communication concepts, interrupt concepts.

Real world outcome:

# On your computer, connected via USB-serial adapter:
$ screen /dev/ttyUSB0 115200

=== ARM UART Driver Test ===
UART initialized at 115200 baud
Echo mode active. Type characters:

> Hello, ARM!
You typed: Hello, ARM!

> test interrupt
IRQ count: 47, TX: 14, RX: 14, Errors: 0

> stats
Buffer status: RX 0/64, TX 0/64
Overruns: 0, Framing errors: 0

Implementation Hints:

UART register structure (simplified for STM32):

USART_BASE
├── SR   (Status Register)     - TXE, RXNE, ORE flags
├── DR   (Data Register)       - Read/write data here
├── BRR  (Baud Rate Register)  - Divisor for baud rate
├── CR1  (Control Register 1)  - Enable UART, TX, RX, interrupts
├── CR2  (Control Register 2)  - Stop bits
└── CR3  (Control Register 3)  - Flow control

Baud rate calculation:

BRR = fPCLK / (16 * BaudRate)

For 115200 baud at 16MHz:
BRR = 16,000,000 / (16 * 115200) = 8.68 ≈ 8 + 11/16
Mantissa = 8, Fraction = 11
BRR = (8 << 4) | 11 = 0x8B

Polling vs Interrupts:

Polling (simple but wastes CPU):
while (!(USART_SR & USART_SR_TXE));  // Wait for TX empty
USART_DR = character;

Interrupts (efficient but complex):
1. Enable RXNE interrupt in CR1
2. Configure NVIC priority for USARTx_IRQn
3. In interrupt handler:
   - Check which flag triggered (RXNE? TXE?)
   - Read/write DR as appropriate
   - Clear flags if needed

Ring buffer structure for interrupt-driven I/O:

struct ring_buffer {
    uint8_t data[64];
    volatile uint8_t head;  // Write position
    volatile uint8_t tail;  // Read position
};

// In RX interrupt: add to buffer
buf.data[buf.head] = USART_DR;
buf.head = (buf.head + 1) % 64;

Key questions:

  • Why must the buffer indices be volatile?
  • What happens if the ring buffer overflows?
  • How do you handle the case where the user types faster than you can process?

Learning milestones:

  1. You see characters in the terminal → Basic TX works
  2. Echo mode works → Both TX and RX work
  3. Interrupts don’t crash the system → NVIC is configured correctly
  4. No characters lost at high speed → Ring buffers work correctly

Project 5: ARM Memory Allocator (malloc from Scratch)

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, ARM Assembly
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Memory Management / Data Structures
  • Software or Tool: QEMU ARM emulator or bare-metal STM32
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A custom heap memory allocator for bare-metal ARM systems, implementing malloc(), free(), and realloc() with proper alignment, coalescing, and fragmentation handling.

Why it teaches ARM: ARM has specific alignment requirements (4-byte for ARM32, 8-byte for AArch64). Building malloc teaches you: memory layout, pointer arithmetic, alignment padding, and how dynamic memory really works. On bare-metal, there’s no sbrk()—you manage a fixed memory pool.

Core challenges you’ll face:

  • Maintaining free lists → maps to linked list structures in raw memory
  • Block splitting and coalescing → maps to algorithms for memory efficiency
  • Alignment requirements → maps to ARM alignment rules and performance
  • Heap corruption detection → maps to debug techniques and canary values
  • Working without sbrk → maps to fixed-pool allocation

Key Concepts:

  • Dynamic Memory Allocation: “Computer Systems: A Programmer’s Perspective” Chapter 9.9 - Bryant & O’Hallaron
  • Free List Algorithms: “The Art of Computer Programming” Vol 1, Chapter 2.5 - Donald Knuth
  • ARM Alignment Rules: “The Art of ARM Assembly, Vol 1” Chapter 3 - Randall Hyde
  • Pool Allocators: “Game Programming Patterns” Chapter 19 - Robert Nystrom

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Strong C pointer skills, understanding of memory layout, Project 3 (bare-metal environment).

Real world outcome:

// Test program output:
=== ARM Malloc Test Suite ===

Test 1: Basic allocation
  malloc(32) = 0x20001008 
  malloc(64) = 0x20001030 
  malloc(16) = 0x20001078 

Test 2: Free and reuse
  free(0x20001030)
  malloc(32) = 0x20001030   (reused!)

Test 3: Coalescing
  free(0x20001008)
  free(0x20001078)
  malloc(100) = 0x20001008   (coalesced block)

Test 4: Alignment
  malloc(1) = 0x20001074 (aligned to 8) 
  malloc(7) = 0x20001080 (aligned to 8) 

Heap stats:
  Total pool: 4096 bytes
  Allocated: 148 bytes
  Free: 3948 bytes
  Fragments: 2
  Largest free block: 3916 bytes

Implementation Hints:

Block header structure:

typedef struct block_header {
    size_t size;                    // Block size (including header)
    struct block_header *next;      // Next free block (if free)
    uint32_t magic;                 // 0xDEADBEEF = allocated, 0xFEEDFACE = free
} block_header_t;

Memory layout:

Heap Pool (e.g., 4KB starting at 0x20001000):
┌──────────────────────────────────────────────────┐
│ HDR │ User Data... │ HDR │ User Data... │ FREE...│
└──────────────────────────────────────────────────┘
  ↑                     ↑
  Block 1 (allocated)   Block 2 (allocated)

Alignment calculation:

// Round up to 8-byte boundary for AArch64, 4-byte for ARM32
#define ALIGN8(x) (((x) + 7) & ~7)

void *malloc(size_t size) {
    size_t aligned_size = ALIGN8(size + sizeof(block_header_t));
    // ... find or split a free block of at least aligned_size ...
}

Free list strategies:

  • First fit: Take the first block that’s big enough (fast, but fragments)
  • Best fit: Find the smallest adequate block (slower, less fragmentation)
  • Segregated lists: Multiple lists for different size classes (fast and efficient)

Coalescing:

Before free(B):
┌─────────┬─────────┬─────────┐
│ A (free)│ B (used)│ C (free)│
└─────────┴─────────┴─────────┘

After free(B) with coalescing:
┌─────────────────────────────┐
│      A+B+C (free)           │
└─────────────────────────────┘

Key questions:

  • How do you find neighboring blocks for coalescing?
  • What happens if someone frees a pointer twice?
  • How do you detect heap corruption?

Learning milestones:

  1. malloc/free work for single allocations → Basic structure is correct
  2. Memory is reused after free → Free list works
  3. Coalescing prevents fragmentation → Block merging works
  4. Stress test passes → Ready for real use

Project 6: Simple Bootloader

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: ARM Assembly + C
  • Alternative Programming Languages: Pure Assembly
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Boot Process / Flash Programming / Firmware
  • Software or Tool: STM32 with dual-bank Flash, or QEMU
  • Main Book: “Bare Metal C” by Steve Oualline

What you’ll build: A bootloader that initializes hardware, provides a serial interface for firmware updates, validates new firmware (CRC check), and chains to the main application. Like a mini U-Boot for your microcontroller.

Why it teaches ARM: Bootloaders are the first code that runs. You’ll understand: ARM reset sequence, Flash programming from code, memory remapping, jump to application code, and firmware update protocols. This is critical knowledge for any production embedded system.

Core challenges you’ll face:

  • Flash self-programming → maps to unlocking and writing to Flash from code
  • XMODEM or custom protocol → maps to reliable data transfer over serial
  • CRC validation → maps to integrity checking algorithms
  • Jumping to application → maps to vector table relocation, stack setup
  • Fitting in limited space → maps to code size optimization

Key Concepts:

  • ARM Boot Sequence: “Making Embedded Systems” Chapter 10 - Elecia White
  • Flash Programming: STM32 Flash Programming Manual - ST Microelectronics
  • XMODEM Protocol: Chuck Forsberg’s Original Specification
  • CRC Algorithms: “Hacker’s Delight” Chapter 14 - Henry S. Warren
  • Vector Table Offset Register: ARM Cortex-M Technical Reference Manual

Difficulty: Expert Time estimate: 2-4 weeks Prerequisites: Projects 3 and 4 (bare-metal and UART), understanding of Flash memory concepts.

Real world outcome:

$ screen /dev/ttyUSB0 115200

========================================
    ARM Custom Bootloader v1.0
========================================
Flash: 512KB (Bootloader: 32KB, App: 480KB)
Current app: CRC 0x1A2B3C4D, Valid

Commands:
  [1] Boot application
  [2] Upload new firmware (XMODEM)
  [3] Verify current firmware
  [4] Dump flash info
  [5] Erase application area

> 2
Ready to receive firmware via XMODEM...
Send file now (XMODEM protocol)

CCCC
Receiving: [====================] 48KB
CRC check: PASS (0x5E6F7A8B)
Flashing: [====================] Done

Firmware updated successfully!
Rebooting into new application...

> 1
Jumping to application at 0x08008000...

=== Application Starting ===
Hello from the new firmware!

Implementation Hints:

Memory layout:

Flash Memory (512KB example):
┌──────────────────────────────────────────────────┐
│ 0x08000000-0x08007FFF: Bootloader (32KB)         │
├──────────────────────────────────────────────────┤
│ 0x08008000-0x0807FFFF: Application (480KB)       │
│   ├── Vector Table (0x08008000)                  │
│   ├── .text (code)                               │
│   └── .rodata (constants)                        │
└──────────────────────────────────────────────────┘

Jumping to application:

void jump_to_app(uint32_t app_address) {
    // 1. Get the application's initial stack pointer
    uint32_t app_sp = *(uint32_t *)app_address;

    // 2. Get the application's reset handler address
    uint32_t app_reset = *(uint32_t *)(app_address + 4);

    // 3. Set the Vector Table Offset Register
    SCB->VTOR = app_address;

    // 4. Set stack pointer and jump
    __set_MSP(app_sp);
    void (*app_entry)(void) = (void (*)(void))app_reset;
    app_entry();  // Never returns
}

Flash programming sequence (STM32):

1. Unlock Flash: Write keys 0x45670123, 0xCDEF89AB to FLASH_KEYR
2. Wait for BSY flag to clear
3. Set PG (Program) bit in FLASH_CR
4. Write data to Flash address (word at a time)
5. Wait for BSY flag
6. Check for errors (PGERR, WRPERR)
7. Lock Flash: Set LOCK bit

XMODEM basics:

Receiver sends: 'C' (CRC mode) or NAK
Sender sends: SOH (0x01), Block#, ~Block#, 128 bytes, CRC16
Receiver sends: ACK or NAK
Repeat until EOT

Key questions:

  • Why must you disable interrupts before jumping to the app?
  • How do you ensure the bootloader can recover from a failed update?
  • What if power is lost mid-flash?

Learning milestones:

  1. Bootloader boots and shows menu → Basic structure works
  2. You can receive data over XMODEM → Protocol implementation works
  3. Flash programming succeeds → You can modify your own Flash
  4. Application boots from bootloader → Vector table and jump work

Project 7: Context Switcher & Mini-Scheduler

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: ARM Assembly + C
  • Alternative Programming Languages: Rust with inline assembly
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Operating Systems / Concurrency
  • Software or Tool: STM32 board or QEMU
  • Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you’ll build: A preemptive scheduler that runs multiple “tasks” concurrently on a single ARM core, with context switching driven by the SysTick timer. Like a tiny FreeRTOS kernel.

Why it teaches ARM: This is where you truly understand how operating systems work. You’ll learn: ARM processor modes, banked registers, exception entry/exit, stack management per task, and how multitasking is an illusion created by fast switching. This is the foundation of every RTOS.

Core challenges you’ll face:

  • Saving and restoring full context → maps to ARM register banking and stack frames
  • SysTick timer configuration → maps to periodic interrupts for preemption
  • Task Control Blocks (TCBs) → maps to per-task state management
  • Stack allocation per task → maps to memory partitioning
  • Avoiding race conditions → maps to critical sections and interrupt masking

Key Concepts:

  • Context Switching: “Operating Systems: Three Easy Pieces” Chapter 6 - Arpaci-Dusseau
  • ARM Exception Handling: “The Definitive Guide to ARM Cortex-M3” Chapter 8 - Joseph Yiu
  • SysTick Timer: “Making Embedded Systems” Chapter 6 - Elecia White
  • Process Control Blocks: “Modern Operating Systems” Chapter 2 - Tanenbaum
  • Critical Sections: “Operating Systems: Three Easy Pieces” Chapter 28

Difficulty: Expert Time estimate: 2-4 weeks Prerequisites: Projects 3-4 (bare-metal, UART), strong understanding of ARM assembly and stacks.

Real world outcome:

=== ARM Mini-Scheduler Demo ===

Creating tasks:
  Task 1: LED blinker (priority 1)
  Task 2: Serial echo (priority 2)
  Task 3: Counter display (priority 3)

Scheduler started, tick = 10ms

[00000010] Task3: Count = 1
[00000020] Task3: Count = 2
[00000025] Task2: Echo 'H'
[00000026] Task2: Echo 'i'
[00000030] Task3: Count = 3
[00000040] Task3: Count = 4
[00000500] Task1: LED ON
[00001000] Task1: LED OFF
[00001010] Task3: Count = 100

Context switches: 247
Task 1: 5% CPU, 12 switches
Task 2: 15% CPU, 89 switches
Task 3: 80% CPU, 146 switches

Implementation Hints:

Task Control Block (TCB):

typedef struct {
    uint32_t *sp;           // Saved stack pointer
    uint32_t stack[256];    // Task's private stack
    uint8_t priority;       // Scheduling priority
    uint8_t state;          // READY, RUNNING, BLOCKED
    char name[16];          // For debugging
} TCB_t;

ARM Cortex-M exception stack frame (automatically pushed on exception entry):

High addresses
    ┌─────────────┐
    │    xPSR     │  ← Original program status
    │     PC      │  ← Where to return to
    │     LR      │  ← Was the Link Register
    │     R12     │
    │     R3      │
    │     R2      │
    │     R1      │
    │     R0      │  ← SP points here after exception
    └─────────────┘
Low addresses

You must also save R4-R11 manually!

Context switch in assembly (PendSV handler):

PendSV_Handler:
    CPSID   I                   @ Disable interrupts

    MRS     R0, PSP             @ Get current task's stack pointer
    STMDB   R0!, {R4-R11}       @ Push R4-R11 onto task stack

    LDR     R1, =current_task   @ Get current TCB pointer
    LDR     R2, [R1]
    STR     R0, [R2]            @ Save SP to current TCB

    BL      scheduler           @ Call C scheduler, returns next TCB in R0

    LDR     R1, =current_task
    STR     R0, [R1]            @ Update current_task pointer
    LDR     R0, [R0]            @ Load new task's SP from TCB

    LDMIA   R0!, {R4-R11}       @ Pop R4-R11 from new task stack
    MSR     PSP, R0             @ Set PSP to new task's stack

    CPSIE   I                   @ Re-enable interrupts
    BX      LR                  @ Return (hardware pops R0-R3, R12, LR, PC, xPSR)

SysTick configuration for 10ms tick at 16MHz:

SysTick->LOAD = 16000000 / 100 - 1;  // 160000 cycles = 10ms
SysTick->VAL = 0;
SysTick->CTRL = 7;  // Enable, use processor clock, enable interrupt

Key questions:

  • Why use PendSV instead of SysTick directly for context switch?
  • What happens if a task overflows its stack?
  • How do you implement task blocking (e.g., for I/O)?

Learning milestones:

  1. SysTick fires regularly → Timer interrupt works
  2. You can switch between two tasks manually → Context save/restore works
  3. Preemption works automatically → Scheduler and PendSV integrated
  4. Three+ tasks run smoothly → Round-robin or priority scheduling works

Project 8: ARM Emulator

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: CPU Architecture / Virtualization
  • Software or Tool: Building something like a mini-QEMU
  • Main Book: “Computer Organization and Design ARM Edition” by Patterson & Hennessy

What you’ll build: An ARM emulator that can execute real ARM binaries, emulating the CPU, memory, and basic I/O. It will run simple bare-metal programs and show the internal state of registers and memory.

Why it teaches ARM: Building an emulator forces complete understanding. Every instruction must be decoded and executed correctly. You’ll internalize the entire instruction set, the pipeline effects, condition codes, and edge cases. After this, ARM will have no secrets from you.

Core challenges you’ll face:

  • Instruction decoding for all formats → maps to complete ISA understanding
  • Emulating the barrel shifter → maps to operand processing
  • Condition code evaluation → maps to CPSR flags and conditional execution
  • Memory access emulation → maps to address translation concepts
  • Handling exceptions → maps to exception model and vector table

Key Concepts:

  • ARM Instruction Set: ARM Architecture Reference Manual (ARM ARM)
  • Emulator Design: “Computer Systems: A Programmer’s Perspective” Chapter 4 - Bryant & O’Hallaron
  • CPU Pipelines: “Computer Organization and Design ARM Edition” Chapter 4 - Patterson & Hennessy
  • Condition Codes: “The Art of ARM Assembly, Vol 1” Chapter 4 - Randall Hyde
  • Memory Systems: “Computer Architecture” Chapter 5 - Hennessy & Patterson

Difficulty: Master Time estimate: 1 month+ Prerequisites: Project 1 (instruction decoder), Projects 2-3 (deep ARM understanding), strong C programming skills.

Real world outcome:

$ ./arm-emu -v program.bin

ARM Emulator v1.0
Loading binary: program.bin (256 bytes)
Memory: 64KB @ 0x00000000

[0x00000000] E3A00001  MOV R0, #1
    R0: 0x00000000 → 0x00000001

[0x00000004] E3A01002  MOV R1, #2
    R1: 0x00000000 → 0x00000002

[0x00000008] E0802001  ADD R2, R0, R1
    R2: 0x00000000 → 0x00000003

[0x0000000C] E3520005  CMP R2, #5
    CPSR: N=1 Z=0 C=0 V=0 (R2 < 5)

[0x00000010] AA000002  BGE 0x00000020
    Branch NOT taken (condition GE failed)

[0x00000014] E2822001  ADD R2, R2, #1
    R2: 0x00000003 → 0x00000004

... execution continues ...

=== Execution Complete ===
Cycles: 47
Instructions: 42
Final state:
  R0=0x00000005  R1=0x00000002  R2=0x00000007  R3=0x00000000
  R4=0x00000000  R5=0x00000000  R6=0x00000000  R7=0x00000000
  ...
  PC=0x00000038  CPSR=0x60000010 [nZCv, User mode]

Implementation Hints:

CPU state structure:

typedef struct {
    uint32_t r[16];        // R0-R15 (R13=SP, R14=LR, R15=PC)
    uint32_t cpsr;         // Current Program Status Register
    uint32_t spsr;         // Saved PSR (for exceptions)

    uint8_t *memory;       // Emulated memory
    size_t mem_size;

    uint64_t cycles;       // Cycle counter
    bool halted;           // CPU halted flag
} ARM_CPU;

Main execution loop:

void run(ARM_CPU *cpu) {
    while (!cpu->halted) {
        // 1. Fetch
        uint32_t insn = fetch(cpu);

        // 2. Check condition
        if (!condition_passed(cpu, insn >> 28)) {
            cpu->r[15] += 4;  // Skip instruction
            continue;
        }

        // 3. Decode and execute
        execute(cpu, insn);

        cpu->cycles++;
    }
}

Instruction dispatch (by bits 27-25):

void execute(ARM_CPU *cpu, uint32_t insn) {
    uint32_t type = (insn >> 25) & 0x7;

    switch (type) {
        case 0: // Data processing (register) or Multiply
            if ((insn & 0x0FC000F0) == 0x00000090)
                exec_multiply(cpu, insn);
            else
                exec_data_proc_reg(cpu, insn);
            break;
        case 1: // Data processing (immediate)
            exec_data_proc_imm(cpu, insn);
            break;
        case 2: // Load/Store (immediate offset)
            exec_load_store_imm(cpu, insn);
            break;
        case 5: // Branch
            exec_branch(cpu, insn);
            break;
        // ... more cases ...
    }
}

Barrel shifter for Operand2:

uint32_t decode_operand2(ARM_CPU *cpu, uint32_t insn, bool is_immediate, bool *carry_out) {
    if (is_immediate) {
        // 8-bit immediate rotated right by 2 * rotate
        uint32_t imm = insn & 0xFF;
        uint32_t rotate = ((insn >> 8) & 0xF) * 2;
        return (imm >> rotate) | (imm << (32 - rotate));
    } else {
        // Register with optional shift
        uint32_t rm = cpu->r[insn & 0xF];
        uint32_t shift_type = (insn >> 5) & 0x3;
        uint32_t shift_amount = (insn >> 7) & 0x1F;

        switch (shift_type) {
            case 0: return rm << shift_amount;      // LSL
            case 1: return rm >> shift_amount;      // LSR
            case 2: return (int32_t)rm >> shift_amount;  // ASR
            case 3: return (rm >> shift_amount) | (rm << (32 - shift_amount)); // ROR
        }
    }
}

Key questions:

  • What is the PC value when executing an instruction? (Hint: PC+8 in ARM mode)
  • How do you handle LDR PC, [Rx] (loading into PC = branch)?
  • What happens when an instruction modifies the PC?

Learning milestones:

  1. Simple programs run (MOV, ADD) → Basic decode/execute works
  2. Branches and loops work → Control flow is correct
  3. Fibonacci computes correctly → Arithmetic and memory work
  4. You can run real test binaries → Emulator is production-quality

Project 9: Exception Handler & Fault Analyzer

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C with ARM Assembly
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Debugging / Exception Handling
  • Software or Tool: STM32 board
  • Main Book: “The Definitive Guide to ARM Cortex-M3 and Cortex-M4” by Joseph Yiu

What you’ll build: A comprehensive fault handler that catches all ARM exceptions (HardFault, MemManage, BusFault, UsageFault), decodes the fault cause, prints a detailed stack trace, and optionally recovers or reboots.

Why it teaches ARM: Faults are inevitable in embedded development. Understanding how ARM reports errors—through fault status registers, stacked PC, and fault addresses—is essential for debugging. This project makes you the person who can diagnose any crash.

Core challenges you’ll face:

  • Understanding stacked exception frames → maps to what’s on stack at fault time
  • Decoding fault status registers → maps to CFSR, HFSR, MMFAR, BFAR
  • Determining faulting instruction → maps to stacked PC and instruction analysis
  • Stack unwinding for call trace → maps to frame pointer following
  • Safe recovery strategies → maps to exception return and system reset

Key Concepts:

  • ARM Exception Model: “The Definitive Guide to ARM Cortex-M3” Chapter 8 - Joseph Yiu
  • Fault Status Registers: “The Definitive Guide to ARM Cortex-M3” Chapter 12 - Joseph Yiu
  • Stack Unwinding: “Computer Systems: A Programmer’s Perspective” Chapter 3 - Bryant & O’Hallaron
  • Debug Features: ARM Cortex-M Technical Reference Manual

Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Projects 3-4 (bare-metal), strong understanding of stack and calling conventions.

Real world outcome:

!!! HARD FAULT DETECTED !!!

Fault Type: BusFault (Precise)
Fault Address: 0xE0100000 (invalid peripheral access)
Faulting Instruction: 0x08001234 (LDR R0, [R1])

Stacked Registers:
  R0  = 0x00000042    R1  = 0xE0100000
  R2  = 0x00000000    R3  = 0x20001234
  R12 = 0x08004567    LR  = 0x0800089B
  PC  = 0x08001234    xPSR= 0x61000000

Fault Status:
  CFSR  = 0x00000400  [PRECISERR]
  HFSR  = 0x40000000  [FORCED]
  BFAR  = 0xE0100000  (Valid)

Stack Trace:
  #0 0x08001234 in read_sensor() at sensors.c:47
  #1 0x0800089A in main() at main.c:123
  #2 0x080000A2 in Reset_Handler() at startup.s:45

Call Stack Memory:
  0x20003FF0: 0x08001234  <- Fault PC
  0x20003FEC: 0x0800089A  <- Called from
  0x20003FE8: 0x080000A2  <- Called from

Action: System reset in 5 seconds...

Implementation Hints:

Exception frame structure (pushed automatically):

typedef struct {
    uint32_t r0;
    uint32_t r1;
    uint32_t r2;
    uint32_t r3;
    uint32_t r12;
    uint32_t lr;
    uint32_t pc;      // Faulting instruction address
    uint32_t xpsr;
} exception_frame_t;

Fault status registers:

#define SCB_CFSR    (*(volatile uint32_t *)0xE000ED28)  // Configurable Fault Status
#define SCB_HFSR    (*(volatile uint32_t *)0xE000ED2C)  // HardFault Status
#define SCB_MMFAR   (*(volatile uint32_t *)0xE000ED34)  // MemManage Fault Address
#define SCB_BFAR    (*(volatile uint32_t *)0xE000ED38)  // BusFault Address

// CFSR bits
#define CFSR_IACCVIOL   (1 << 0)   // Instruction access violation
#define CFSR_DACCVIOL   (1 << 1)   // Data access violation
#define CFSR_MUNSTKERR  (1 << 3)   // MemManage fault on unstacking
#define CFSR_MSTKERR    (1 << 4)   // MemManage fault on stacking
#define CFSR_IBUSERR    (1 << 8)   // Bus fault on instruction fetch
#define CFSR_PRECISERR  (1 << 9)   // Precise data bus error
#define CFSR_IMPRECISERR (1 << 10) // Imprecise data bus error
// ... more bits ...

HardFault handler:

void HardFault_Handler_C(exception_frame_t *frame) {
    printf("\n!!! HARD FAULT !!!\n");
    printf("PC = 0x%08X (faulting instruction)\n", frame->pc);
    printf("LR = 0x%08X (return address)\n", frame->lr);

    // Decode CFSR
    uint32_t cfsr = SCB_CFSR;
    if (cfsr & CFSR_PRECISERR) {
        printf("BusFault: Precise error at 0x%08X\n", SCB_BFAR);
    }
    // ... decode more bits ...

    // Print stack trace
    unwind_stack(frame);

    // Reset
    NVIC_SystemReset();
}

// Assembly wrapper to get frame pointer
__attribute__((naked)) void HardFault_Handler(void) {
    __asm volatile (
        "TST LR, #4      \n"  // Test bit 2 of LR
        "ITE EQ          \n"
        "MRSEQ R0, MSP   \n"  // If 0, was using MSP
        "MRSNE R0, PSP   \n"  // If 1, was using PSP
        "B HardFault_Handler_C\n"
    );
}

Key questions:

  • Why is LR sometimes 0xFFFFFFFx in exception handlers?
  • How do you get a call trace without debug symbols?
  • What’s the difference between precise and imprecise bus faults?

Learning milestones:

  1. HardFault handler prints PC → Basic exception handling works
  2. You decode the fault type → Status register parsing works
  3. Stack trace shows call chain → Unwinding works
  4. System recovers gracefully → Exception return understood

Project 10: I2C Driver for OLED Display

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, MicroPython (for comparison)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Serial Protocols / Peripheral Programming
  • Software or Tool: STM32 + SSD1306 OLED display
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A complete I2C driver from scratch that communicates with an SSD1306 OLED display, drawing pixels, text, and simple graphics without using any libraries.

Why it teaches ARM: I2C is the most common peripheral protocol. Building it from scratch teaches you: GPIO alternate functions, timing constraints, clock stretching, ACK/NACK handling, and the general pattern for all peripheral drivers. Plus, you get visible output!

Core challenges you’ll face:

  • Configuring GPIO for I2C → maps to alternate functions and open-drain
  • Generating proper timing → maps to I2C clock configuration
  • Handling ACK/NACK → maps to protocol state machine
  • Sending commands vs data → maps to SSD1306 protocol specifics
  • Frame buffer management → maps to memory organization for display

Key Concepts:

  • I2C Protocol: “Making Embedded Systems” Chapter 9 - Elecia White
  • SSD1306 Controller: SSD1306 Datasheet (Solomon Systech)
  • GPIO Alternate Functions: STM32 Reference Manual GPIO chapter
  • Graphics Primitives: “Computer Graphics from Scratch” Chapter 1 - Gabriel Gambetta

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3 (bare-metal basics), Project 4 (peripheral experience), understanding of bit manipulation.

Real world outcome:

Physical result: 128x64 OLED display shows:

┌──────────────────────────────────────┐
│  ARM I2C Demo                        │
│                                      │
│  ╭──────────────────────────╮        │
│  │    CPU: 72MHz            │        │
│  │    Temp: 32°C            │        │
│  │    Time: 14:23:45        │        │
│  ╰──────────────────────────╯        │
│                      ___             │
│     ★              /ARM \            │
│                    \_____/           │
└──────────────────────────────────────┘

Serial output:
I2C initialized at 400kHz
SSD1306 found at address 0x3C
Display initialized (128x64)
Drawing test pattern...
Framebuffer: 1024 bytes (128x64/8)

Implementation Hints:

I2C transaction structure:

START → Address+W → ACK → Data → ACK → ... → STOP
        [7 bits][R/W]     [8 bits]

I2C register configuration (STM32):

// 1. Enable clocks for I2C and GPIO
RCC->APB1ENR |= RCC_APB1ENR_I2C1EN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOBEN;

// 2. Configure GPIO for I2C (open-drain, alternate function)
GPIOB->MODER |= (2 << (6*2)) | (2 << (7*2));  // AF mode
GPIOB->OTYPER |= (1 << 6) | (1 << 7);         // Open drain
GPIOB->AFR[0] |= (4 << (6*4)) | (4 << (7*4)); // AF4 = I2C

// 3. Configure I2C timing for 400kHz
I2C1->CR1 = 0;                    // Disable I2C
I2C1->CR2 = 36;                   // APB1 clock = 36MHz
I2C1->CCR = 90;                   // Clock control for 400kHz
I2C1->TRISE = 11;                 // Rise time
I2C1->CR1 = I2C_CR1_PE;           // Enable I2C

I2C write sequence:

void i2c_write(uint8_t addr, uint8_t *data, uint8_t len) {
    // 1. Generate START
    I2C1->CR1 |= I2C_CR1_START;
    while (!(I2C1->SR1 & I2C_SR1_SB));

    // 2. Send address with write bit
    I2C1->DR = addr << 1;
    while (!(I2C1->SR1 & I2C_SR1_ADDR));
    (void)I2C1->SR2;  // Clear ADDR flag

    // 3. Send data bytes
    for (int i = 0; i < len; i++) {
        I2C1->DR = data[i];
        while (!(I2C1->SR1 & I2C_SR1_TXE));
    }

    // 4. Generate STOP
    I2C1->CR1 |= I2C_CR1_STOP;
}

SSD1306 initialization sequence:

uint8_t init_cmds[] = {
    0xAE,       // Display off
    0xD5, 0x80, // Clock divide
    0xA8, 0x3F, // Multiplex ratio (64-1)
    0xD3, 0x00, // Display offset
    0x40,       // Start line
    0x8D, 0x14, // Charge pump on
    0x20, 0x00, // Horizontal addressing mode
    0xA1,       // Segment remap
    0xC8,       // COM scan direction
    0xDA, 0x12, // COM pins config
    0x81, 0xCF, // Contrast
    0xD9, 0xF1, // Pre-charge
    0xDB, 0x40, // VCOMH deselect
    0xA4,       // Display from RAM
    0xA6,       // Normal display
    0xAF        // Display on
};

Key questions:

  • What happens if you forget open-drain configuration?
  • How do you handle clock stretching by the slave?
  • Why does SSD1306 use 8 vertical pixels per byte?

Learning milestones:

  1. I2C peripheral responds → Address ACK received
  2. Display turns on → Init sequence works
  3. Single pixel lights up → Framebuffer to display works
  4. Text and graphics display → Higher-level drawing works

Project 11: SPI Driver for SD Card

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Serial Protocols / File Systems
  • Software or Tool: STM32 + MicroSD card module
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: An SPI driver that initializes an SD card, reads/writes sectors, and optionally implements FAT16/FAT32 to read actual files.

Why it teaches ARM: SPI is ARM’s other workhorse protocol (along with I2C). SD cards require precise timing, CRC checks, and a complex initialization sequence. This teaches you: DMA for high-speed transfers, protocol state machines, and persistent storage.

Core challenges you’ll face:

  • SD card SPI initialization dance → maps to protocol timing and command sequences
  • Command/response handling → maps to state machine design
  • CRC calculation → maps to data integrity
  • Block read/write → maps to 512-byte sector handling
  • FAT filesystem parsing → maps to data structure interpretation

Key Concepts:

  • SPI Protocol: “Making Embedded Systems” Chapter 9 - Elecia White
  • SD Card SPI Mode: SD Physical Layer Specification (SDA)
  • FAT Filesystem: “FAT32 File System Specification” - Microsoft
  • DMA Transfers: STM32 Reference Manual DMA chapter

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 3 (bare-metal), Project 4 (peripheral experience), bit manipulation skills.

Real world outcome:

=== SD Card Driver Test ===

Initializing SPI at 400kHz...
Sending CMD0 (GO_IDLE)... OK (R1=0x01)
Sending CMD8 (SEND_IF_COND)... OK, SDHC card detected
Sending ACMD41 (SD_SEND_OP_COND)...
  Attempt 1: busy
  Attempt 2: busy
  Attempt 3: ready!
Switching to 25MHz SPI mode

Card info:
  Type: SDHC
  Capacity: 16 GB (31116288 sectors)
  Manufacturer: SanDisk

Reading sector 0 (MBR)...
  Boot signature: 0x55AA ✓
  Partition 1: FAT32, starts at sector 2048

Mounting FAT32...
  Sectors per cluster: 64
  Reserved sectors: 32
  Root directory: cluster 2

Directory listing of /:
  HELLO.TXT       42 bytes
  FIRMWARE.BIN    65536 bytes
  DATA/           <DIR>

Reading HELLO.TXT:
"Hello from SD card!"

Write test:
  Writing "Test 12345" to TEST.TXT... OK
  Reading back... "Test 12345"

Implementation Hints:

SPI configuration:

// Configure SPI1 for SD card
SPI1->CR1 = 0;
SPI1->CR1 |= SPI_CR1_MSTR;        // Master mode
SPI1->CR1 |= SPI_CR1_BR_2;        // Clock = PCLK/32 (slow for init)
SPI1->CR1 |= SPI_CR1_SSM | SPI_CR1_SSI; // Software CS
SPI1->CR1 |= SPI_CR1_SPE;         // Enable SPI

SD command format (6 bytes):

┌────────┬──────────┬────────┐
│ 01xxxxxx│ argument │ CRC7+1 │
│ (cmd)   │ (4 bytes)│        │
└────────┴──────────┴────────┘

SD initialization sequence:

bool sd_init(void) {
    // 1. Send 80+ clock pulses with CS high
    cs_high();
    for (int i = 0; i < 10; i++) spi_transfer(0xFF);

    // 2. CMD0: GO_IDLE_STATE
    cs_low();
    sd_command(0, 0x00000000);  // Expect R1 = 0x01 (idle)

    // 3. CMD8: SEND_IF_COND (check if SDv2)
    sd_command(8, 0x000001AA);  // 2.7-3.6V, check pattern

    // 4. ACMD41: SD_SEND_OP_COND (wait for ready)
    while (1) {
        sd_command(55, 0);  // APP_CMD prefix
        uint8_t r = sd_command(41, 0x40000000);  // HCS bit
        if (r == 0) break;  // Ready!
    }

    // 5. CMD58: Read OCR (check CCS bit for SDHC)

    // 6. Switch to high-speed SPI
    spi_set_speed(25000000);

    return true;
}

Reading a sector:

bool sd_read_sector(uint32_t sector, uint8_t *buffer) {
    // For SDHC, sector number is already in blocks
    sd_command(17, sector);  // READ_SINGLE_BLOCK

    // Wait for data token (0xFE)
    while (spi_transfer(0xFF) != 0xFE);

    // Read 512 bytes
    for (int i = 0; i < 512; i++) {
        buffer[i] = spi_transfer(0xFF);
    }

    // Read (and ignore) CRC
    spi_transfer(0xFF);
    spi_transfer(0xFF);

    return true;
}

Key questions:

  • Why must you start at 400kHz and only go faster after init?
  • What’s the difference between SD (SDSC) and SDHC addressing?
  • How do you handle multi-block transfers for speed?

Learning milestones:

  1. Card responds to CMD0 → SPI and timing work
  2. Initialization completes → Protocol state machine works
  3. You read the MBR → Sector reads work
  4. You read a file → FAT parsing works

Project 12: Timer-Based PWM Motor Controller

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, MicroPython (for comparison)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Timers / PWM / Motor Control
  • Software or Tool: STM32 + DC motor or servo
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A PWM generator using ARM timer peripherals to control motor speed or servo position, with smooth acceleration and position feedback via encoder.

Why it teaches ARM: Timers are the most complex ARM peripherals. PWM generation teaches you: timer configurations, compare/capture units, dead-time insertion, and DMA-triggered updates. This is essential for robotics and power electronics.

Core challenges you’ll face:

  • Timer clock configuration → maps to prescaler and period calculations
  • PWM mode setup → maps to output compare modes
  • Smooth speed ramping → maps to software control loops
  • Encoder reading → maps to timer encoder mode
  • Dead-time for H-bridge → maps to complementary outputs

Key Concepts:

  • PWM Fundamentals: “Making Embedded Systems” Chapter 6 - Elecia White
  • ARM Timer Architecture: STM32 Reference Manual TIM chapter
  • PID Control: “Making Embedded Systems” Chapter 11 - Elecia White
  • Encoder Interface: STM32 Timer Application Notes

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 3 (bare-metal), basic understanding of DC motors.

Real world outcome:

=== PWM Motor Controller ===

Timer: TIM1, 72MHz base, PWM @ 20kHz
Motor: DC brushed, H-bridge driver

Commands:
  S<n>  - Set speed (-100 to 100)
  A<n>  - Set acceleration (1-100)
  P     - Print position

> S50
Ramping to 50%...
  10% -> 20% -> 30% -> 40% -> 50%
Current: 50%, Duty: 500/1000

> S-30
Reversing direction...
  50% -> 40% -> 30% -> 20% -> 10% -> 0%
  0% -> -10% -> -20% -> -30%
Current: -30%, Duty: 300/1000 (reversed)

> P
Encoder count: 4523
Estimated RPM: 1200
Position: 12.6 revolutions

Physical result: Motor smoothly accelerates, holds speed, reverses

Implementation Hints:

Timer PWM configuration:

// TIM1 for PWM, 72MHz clock, 20kHz PWM
RCC->APB2ENR |= RCC_APB2ENR_TIM1EN;

TIM1->PSC = 0;               // No prescaler
TIM1->ARR = 3600 - 1;        // 72MHz / 3600 = 20kHz
TIM1->CCR1 = 1800;           // 50% duty cycle

// Configure channel 1 as PWM mode 1
TIM1->CCMR1 = (6 << 4);      // OC1M = PWM mode 1
TIM1->CCER = TIM_CCER_CC1E;  // Enable output

// Enable main output (required for TIM1)
TIM1->BDTR |= TIM_BDTR_MOE;

TIM1->CR1 |= TIM_CR1_CEN;    // Start timer

Smooth ramping:

void set_speed_smooth(int target) {
    while (current_speed != target) {
        if (current_speed < target) current_speed++;
        else current_speed--;

        update_pwm(current_speed);
        delay_ms(10);  // Ramp rate
    }
}

Encoder mode (reading motor position):

// TIM2 in encoder mode
TIM2->SMCR = (3 << 0);       // Encoder mode 3 (count both edges)
TIM2->CCMR1 = (1 << 0) | (1 << 8);  // CC1/CC2 as inputs
TIM2->CCER = 0;              // Non-inverted
TIM2->CNT = 0;               // Reset count
TIM2->CR1 |= TIM_CR1_CEN;    // Start

// Read position anytime:
int32_t position = (int16_t)TIM2->CNT;  // Signed for direction

Key questions:

  • Why is 20kHz a good PWM frequency for motors?
  • How do you handle direction reversal (H-bridge control)?
  • What happens if encoder counts overflow?

Learning milestones:

  1. LED brightness varies → Basic PWM works
  2. Motor spins at controlled speed → PWM duty cycle correct
  3. Motor reverses smoothly → Direction and ramping work
  4. Position feedback accurate → Encoder mode works

Project 13: DMA-Driven Audio Player

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 4: Expert
  • Knowledge Area: DMA / DAC / Audio
  • Software or Tool: STM32 with DAC + speaker/headphones
  • Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A WAV file player that uses DMA to stream audio from SD card to DAC with zero CPU intervention during playback.

Why it teaches ARM: DMA (Direct Memory Access) is how real systems achieve high performance. By offloading memory transfers to hardware, the CPU is free for other tasks. This project combines: DMA configuration, double-buffering, DAC output, and real-time audio constraints.

Core challenges you’ll face:

  • DMA configuration → maps to peripheral-to-memory and memory-to-peripheral
  • Double buffering → maps to avoiding audio glitches
  • DAC timing → maps to timer-triggered DAC updates
  • WAV parsing → maps to file format understanding
  • Sample rate conversion → maps to timer frequency calculations

Key Concepts:

  • DMA Controllers: “Making Embedded Systems” Chapter 7 - Elecia White
  • DAC Operation: STM32 Reference Manual DAC chapter
  • WAV File Format: RIFF specification
  • Double Buffering: “Game Programming Patterns” Chapter 8 - Robert Nystrom

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 10-11 (I2C/SPI), understanding of audio concepts, DMA basics.

Real world outcome:

=== DMA Audio Player ===

SD card mounted, FAT32
DMA: Circular mode, half/full interrupts
DAC: 12-bit, TIM6 triggered

Loading: music.wav
  Format: PCM
  Channels: 2 (stereo, mixing to mono)
  Sample rate: 44100 Hz
  Bits per sample: 16

Configuring TIM6 for 44.1kHz...
  Timer clock: 72MHz
  Prescaler: 0
  Period: 1632 (actual: 44117 Hz, 0.04% error)

Playing... [=====>              ] 25%
  Buffer: 2048 samples, double-buffered
  DMA interrupts: 1247 (half), 1247 (full)
  CPU usage: 3% (mostly SD reads)

Press 'p' to pause, 's' to stop, '+/-' for volume

Physical result: Audio plays through speaker/headphones clearly!

Implementation Hints:

DMA configuration for DAC:

// Configure DMA1 Channel 3 for DAC1
DMA1_Channel3->CPAR = (uint32_t)&DAC->DHR12R1;  // Destination: DAC
DMA1_Channel3->CMAR = (uint32_t)audio_buffer;   // Source: buffer
DMA1_Channel3->CNDTR = BUFFER_SIZE;              // Transfer count

DMA1_Channel3->CCR = 0;
DMA1_Channel3->CCR |= DMA_CCR_MINC;    // Memory increment
DMA1_Channel3->CCR |= DMA_CCR_CIRC;    // Circular mode
DMA1_Channel3->CCR |= DMA_CCR_DIR;     // Memory to peripheral
DMA1_Channel3->CCR |= DMA_CCR_MSIZE_0; // 16-bit memory
DMA1_Channel3->CCR |= DMA_CCR_PSIZE_0; // 16-bit peripheral
DMA1_Channel3->CCR |= DMA_CCR_HTIE;    // Half-transfer interrupt
DMA1_Channel3->CCR |= DMA_CCR_TCIE;    // Transfer complete interrupt
DMA1_Channel3->CCR |= DMA_CCR_EN;      // Enable

Double-buffering strategy:

Buffer: [    First Half    |    Second Half    ]
         ^^^^^^^^^^^^^^^^
         DMA playing this
                            ^^^^^^^^^^^^^^^^^^
                            CPU filling this

When DMA finishes first half: HT interrupt, CPU fills first half
When DMA finishes second half: TC interrupt, CPU fills second half

Timer-triggered DAC:

// TIM6 triggers DAC at sample rate
TIM6->PSC = 0;
TIM6->ARR = (72000000 / 44100) - 1;  // ~1632
TIM6->CR2 |= TIM_CR2_MMS_1;  // TRGO on update
TIM6->CR1 |= TIM_CR1_CEN;

// DAC configuration
DAC->CR |= DAC_CR_TEN1;      // Trigger enable
DAC->CR |= DAC_CR_TSEL1_0;   // TIM6 TRGO trigger
DAC->CR |= DAC_CR_DMAEN1;    // DMA enable
DAC->CR |= DAC_CR_EN1;       // Enable DAC

Key questions:

  • Why circular DMA mode for audio?
  • What happens if CPU can’t fill buffer fast enough?
  • How do you handle 16-bit audio on a 12-bit DAC?

Learning milestones:

  1. DAC outputs a sine wave → Basic DAC works
  2. DMA runs without CPU → DMA configuration correct
  3. Audio plays continuously → Double-buffering works
  4. Music sounds correct → Sample rate and bit depth right

Project 14: ARM Debugger (GDB Stub)

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Debugging / Debug Hardware
  • Software or Tool: STM32 (target) + USB-serial
  • Main Book: “Building a Debugger” by Sy Brand

What you’ll build: A GDB remote stub that runs on your ARM target, allowing GDB to connect over serial and debug programs—set breakpoints, single-step, inspect memory and registers.

Why it teaches ARM: This requires deep understanding of: ARM debug architecture, software breakpoints (BKPT instruction), the debug monitor exception, and the GDB remote protocol. You’re building the tool that debugs other tools.

Core challenges you’ll face:

  • Implementing GDB Remote Serial Protocol → maps to packet format, checksums
  • Software breakpoints → maps to BKPT instruction insertion
  • Single-stepping → maps to debug monitor and step flags
  • Register access → maps to reading stacked exception frames
  • Memory read/write → maps to arbitrary memory access safely

Resources for key challenges:

Key Concepts:

  • GDB Remote Protocol: GDB Documentation
  • ARM Debug Architecture: “The Definitive Guide to ARM Cortex-M3” Chapter 14 - Joseph Yiu
  • Software Breakpoints: “Building a Debugger” - Sy Brand
  • Debug Monitor: ARM Cortex-M Debug Technical Reference

Difficulty: Master Time estimate: 1 month+ Prerequisites: Projects 4, 9 (UART, exception handling), strong understanding of ARM internals.

Real world outcome:

# On your computer:
$ arm-none-eabi-gdb program.elf
(gdb) target remote /dev/ttyUSB0
Remote debugging using /dev/ttyUSB0
0x08000100 in Reset_Handler ()

(gdb) break main
Breakpoint 1 at 0x08000234: file main.c, line 12.

(gdb) continue
Continuing.
Breakpoint 1, main () at main.c:12
12	    int x = 42;

(gdb) print x
$1 = 0

(gdb) step
13	    int y = x * 2;

(gdb) print x
$2 = 42

(gdb) info registers
r0             0x42                66
r1             0x0                 0
r2             0x20000100          536871168
...
pc             0x8000238           0x8000238 <main+4>
cpsr           0x61000000          1627389952

(gdb) x/4x 0x20000000
0x20000000:	0x00000042	0x00000054	0x00000000	0x00000000

Implementation Hints:

GDB packet format:

$<data>#<checksum>

Examples:
  $g#67             - Read all registers
  $m8000000,10#xx   - Read 16 bytes at 0x08000000
  $M8000000,4:12345678#xx - Write 4 bytes
  $c#63             - Continue execution
  $s#73             - Single step
  $Z0,8000234,2#xx  - Set breakpoint at 0x08000234

Main stub loop:

void gdb_stub_main(void) {
    while (1) {
        char packet[256];
        gdb_receive_packet(packet);

        switch (packet[0]) {
            case 'g':  // Read registers
                send_registers();
                break;
            case 'G':  // Write registers
                write_registers(packet + 1);
                send_ok();
                break;
            case 'm':  // Read memory
                read_memory(packet);
                break;
            case 'M':  // Write memory
                write_memory(packet);
                send_ok();
                break;
            case 'c':  // Continue
                continue_execution();
                break;
            case 's':  // Step
                single_step();
                break;
            case 'Z':  // Set breakpoint
                set_breakpoint(packet);
                break;
            // ... more commands ...
        }
    }
}

Software breakpoints:

// BKPT instruction: 0xBExx (Thumb) or 0xE12xxxxx (ARM)
void set_breakpoint(uint32_t addr) {
    // Save original instruction
    breakpoints[n].addr = addr;
    breakpoints[n].original = *(uint16_t *)addr;

    // Insert BKPT
    *(uint16_t *)addr = 0xBE00;  // BKPT #0

    // Flush I-cache if needed
    __DSB();
    __ISB();
}

Debug monitor exception:

void DebugMon_Handler(void) {
    // Check if BKPT or single-step
    if (SCB->DFSR & SCB_DFSR_BKPT) {
        // Breakpoint hit
        save_context();
        gdb_send_stop_reply('T', 5);  // SIGTRAP
        gdb_stub_main();
    }
}

Key questions:

  • How do you handle the case where a breakpoint is in the delay slot?
  • What if the user sets a breakpoint on the GDB stub itself?
  • How do you implement hardware breakpoints (limited number)?

Learning milestones:

  1. GDB connects and reads registers → Basic protocol works
  2. Memory dump works → Memory access correct
  3. Breakpoints stop execution → BKPT and DebugMon work
  4. Single-step works → Step flag handling correct

Project 15: Tiny Operating System

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C + ARM Assembly
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Operating Systems / Kernel Development
  • Software or Tool: STM32 board
  • Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you’ll build: A minimal but complete operating system with: preemptive multitasking, memory protection (using MPU), IPC (semaphores, queues), and a simple shell—like a tiny FreeRTOS you built yourself.

Why it teaches ARM: This is the capstone that ties everything together. You’ll implement: protected vs unprivileged modes, MPU regions, SVCall for system calls, proper task isolation, and all the OS primitives. After this, you understand both ARM and operating systems at the deepest level.

Core challenges you’ll face:

  • User/Kernel mode separation → maps to ARM privilege levels
  • MPU configuration → maps to memory protection regions
  • System call interface → maps to SVC instruction and handler
  • Inter-task communication → maps to queues, semaphores
  • Priority-based scheduling → maps to scheduler algorithms

Key Concepts:

  • OS Fundamentals: “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau
  • ARM Privilege Levels: “The Definitive Guide to ARM Cortex-M3” Chapter 3 - Joseph Yiu
  • MPU Configuration: “The Definitive Guide to ARM Cortex-M3” Chapter 11 - Joseph Yiu
  • System Calls: “Operating Systems: Three Easy Pieces” Chapter 6
  • Synchronization: “Operating Systems: Three Easy Pieces” Chapters 27-31

Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects, especially 6-7, 9. Strong OS theory background.

Real world outcome:

=== TinyOS v1.0 ===
Kernel: 8KB Flash, 2KB RAM
User space: 120KB Flash, 30KB RAM
Tasks: 8 max, priority 0-7

Boot sequence:
  [OK] MPU configured: 8 regions
  [OK] Kernel in privileged mode
  [OK] User tasks in unprivileged mode
  [OK] SysTick @ 1ms
  [OK] Shell task started

TinyOS> ps
PID  NAME       PRI  STATE     STACK  CPU%
  1  idle         7  READY     128    85%
  2  shell        2  RUNNING   512     2%
  3  blinker      3  READY     256     1%
  4  sensor       1  BLOCKED   256    12%

TinyOS> exec counter
Starting task 'counter' (PID 5)

TinyOS> kill 3
Task 'blinker' terminated

TinyOS> mem
Kernel heap: 1024/2048 bytes used
User heap:   4096/30720 bytes used
Task stacks: 1280 bytes total

TinyOS> sem
SEM       VALUE  WAITERS
uart_tx       1  (none)
sensor_rdy    0  sensor(4)

TinyOS> msg
QUEUE     SIZE  PENDING
cmd_q     16    2 messages
data_q    64    0 messages

Implementation Hints:

System call mechanism:

// User space: request service via SVC
int sys_write(int fd, const char *buf, int len) {
    register int r0 __asm("r0") = fd;
    register const char *r1 __asm("r1") = buf;
    register int r2 __asm("r2") = len;
    register int result __asm("r0");

    __asm volatile (
        "SVC #1"  // System call number in immediate
        : "=r" (result)
        : "r" (r0), "r" (r1), "r" (r2)
        : "memory"
    );
    return result;
}

// Kernel: SVC handler
void SVC_Handler(void) {
    // Get stacked PC, read SVC instruction to get number
    uint32_t *sp = get_psp();
    uint32_t pc = sp[6];
    uint8_t svc_num = ((uint8_t *)pc)[-2];  // SVC number

    switch (svc_num) {
        case 0: syscall_yield(); break;
        case 1: syscall_write(sp[0], (void*)sp[1], sp[2]); break;
        case 2: syscall_read(sp[0], (void*)sp[1], sp[2]); break;
        // ... more syscalls ...
    }
}

MPU configuration:

void mpu_configure_task(TCB_t *task) {
    // Region 0: Code (read-only, execute)
    MPU->RBAR = task->code_start | MPU_RBAR_VALID | 0;
    MPU->RASR = MPU_RASR_ENABLE | REGION_32K |
                MPU_RASR_AP_RO_RO | MPU_RASR_XN_NO;

    // Region 1: Data (read-write, no execute)
    MPU->RBAR = task->data_start | MPU_RBAR_VALID | 1;
    MPU->RASR = MPU_RASR_ENABLE | REGION_4K |
                MPU_RASR_AP_RW_RW | MPU_RASR_XN_YES;

    // Region 2: Stack (read-write, no execute)
    MPU->RBAR = task->stack_start | MPU_RBAR_VALID | 2;
    MPU->RASR = MPU_RASR_ENABLE | REGION_1K |
                MPU_RASR_AP_RW_RW | MPU_RASR_XN_YES;
}

Task state machine:

         ┌─────────────────────────────────────┐
         ▼                                     │
    ┌─────────┐    schedule    ┌─────────┐    │
    │  READY  │───────────────▶│ RUNNING │    │
    └─────────┘                └─────────┘    │
         ▲                          │         │
         │                          │ wait    │ preempt
    signal│                          ▼         │
         │                     ┌─────────┐    │
         └─────────────────────│ BLOCKED │    │
                               └─────────┘    │
                                              │
         ┌─────────┐                          │
         │ ZOMBIE  │◀─────────────────────────┘
         └─────────┘     exit

Key questions:

  • How do you switch from Handler mode to Thread mode with unprivileged access?
  • What happens when a user task tries to access kernel memory?
  • How do you implement priority inheritance for mutexes?

Learning milestones:

  1. Tasks run in unprivileged mode → Privilege separation works
  2. MPU faults on bad access → Memory protection works
  3. System calls work → SVC mechanism works
  4. Shell can spawn/kill tasks → Process management works

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Instruction Decoder Intermediate 1-2 weeks ⭐⭐⭐⭐⭐ (ISA encoding) ⭐⭐⭐
2. Assembly Calculator Beginner Weekend ⭐⭐⭐ (registers, syscalls) ⭐⭐⭐
3. Bare-Metal LED Advanced 1-2 weeks ⭐⭐⭐⭐⭐ (boot, GPIO) ⭐⭐⭐⭐⭐
4. UART Driver Advanced 1-2 weeks ⭐⭐⭐⭐ (peripherals, IRQ) ⭐⭐⭐⭐
5. Memory Allocator Advanced 1-2 weeks ⭐⭐⭐⭐ (heap, alignment) ⭐⭐⭐
6. Bootloader Expert 2-4 weeks ⭐⭐⭐⭐⭐ (flash, boot) ⭐⭐⭐⭐⭐
7. Context Switcher Expert 2-4 weeks ⭐⭐⭐⭐⭐ (multitasking) ⭐⭐⭐⭐⭐
8. ARM Emulator Master 1 month+ ⭐⭐⭐⭐⭐ (complete ISA) ⭐⭐⭐⭐⭐
9. Fault Analyzer Expert 1-2 weeks ⭐⭐⭐⭐ (exceptions) ⭐⭐⭐⭐
10. I2C OLED Driver Advanced 1-2 weeks ⭐⭐⭐ (I2C protocol) ⭐⭐⭐⭐⭐
11. SPI SD Card Advanced 2-3 weeks ⭐⭐⭐⭐ (SPI, filesystem) ⭐⭐⭐⭐
12. PWM Motor Control Intermediate 1 week ⭐⭐⭐ (timers, PWM) ⭐⭐⭐⭐⭐
13. DMA Audio Player Expert 2-3 weeks ⭐⭐⭐⭐ (DMA, DAC) ⭐⭐⭐⭐⭐
14. GDB Stub Master 1 month+ ⭐⭐⭐⭐⭐ (debug arch) ⭐⭐⭐⭐
15. Tiny OS Master 1-2 months ⭐⭐⭐⭐⭐ (everything) ⭐⭐⭐⭐⭐

If you’re completely new to ARM:

Start here:

  1. Project 2: Assembly Calculator - Get comfortable with ARM assembly syntax
  2. Project 1: Instruction Decoder - Understand how instructions are encoded
  3. Project 3: Bare-Metal LED - Your first real hardware project

If you have some embedded experience:

Start here:

  1. Project 3: Bare-Metal LED - Verify your bare-metal skills
  2. Project 4: UART Driver - Build your debug console
  3. Project 10: I2C OLED - Visual feedback is motivating!
  4. Project 7: Context Switcher - Core RTOS concept

If you want the deep understanding:

Follow this path:

  1. Projects 1-3 (foundations)
  2. Projects 4-5 (peripherals and memory)
  3. Projects 6-7 (boot and multitasking)
  4. Project 8: ARM Emulator - The ultimate learning project
  5. Project 15: Tiny OS - Put it all together

Hardware Requirements

Minimum kit (~$25):

  • STM32 Nucleo-F411RE or Nucleo-F446RE board
  • USB cable (included)
  • Breadboard and jumper wires

Recommended additions (~$50 more):

  • SSD1306 OLED display (I2C)
  • MicroSD card module
  • Small speaker/buzzer
  • DC motor with L298N driver
  • Rotary encoder

Final Capstone: ARM-Based Retro Game Console

  • File: LEARN_ARM_DEEP_DIVE.md
  • Main Programming Language: C + ARM Assembly
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 5: Master
  • Knowledge Area: Complete System / Graphics / Audio / Input
  • Software or Tool: STM32F4 + LCD + Buttons + Speaker
  • Main Book: “Computer Graphics from Scratch” by Gabriel Gambetta

What you’ll build: A complete handheld game console with: color LCD display (SPI), audio output (DAC/PWM), button input, game ROM loading from SD card, and a simple game (Tetris/Snake/Breakout).

Why this is the capstone: This project integrates everything:

  • Project 3: Bare-metal initialization and GPIO
  • Project 4: UART for debugging
  • Project 11: SPI for LCD and SD card
  • Project 12: PWM for audio
  • Project 13: DMA for efficient transfers
  • Project 7: Game loop timing (optional RTOS)
  • Project 5: Memory management for game assets

Core challenges you’ll face:

  • Fast LCD updates → SPI + DMA for 60fps
  • Double-buffered graphics → Tear-free rendering
  • Game timing → Consistent frame rate
  • Audio mixing → Multiple sound effects
  • Asset management → Loading sprites/sounds from SD
  • Power management → Battery-friendly operation

Key Concepts:

  • Game Loop Architecture: “Game Programming Patterns” Chapter 1 - Robert Nystrom
  • Sprite Rendering: “Computer Graphics from Scratch” - Gabriel Gambetta
  • Audio Synthesis: “The Audio Programming Book” - Boulanger & Lazzarini
  • Embedded Graphics: “Making Embedded Systems” Chapter 8 - Elecia White

Difficulty: Master Time estimate: 2-3 months Prerequisites: Most previous projects, especially 3, 10-13, graphics/game programming interest.

Real world outcome:

Physical device: A handheld game console you built from scratch!

┌─────────────────────────────────┐
│      ARM Game Console v1.0      │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │   ████████████████████    │  │
│  │   ████  TETRIS   ████    │  │
│  │   ████████████████████    │  │
│  │                           │  │
│  │     Score: 12450          │  │
│  │     Level: 5              │  │
│  │                           │  │
│  │      ░░██░░               │  │
│  │      ░░██░░               │  │
│  │    ██████████             │  │
│  │    ████░░████             │  │
│  │    ██████████             │  │
│  └───────────────────────────┘  │
│                                 │
│   [←] [→]    [↓]    [A] [B]    │
│                                 │
└─────────────────────────────────┘

Serial debug output:
FPS: 60.02 | CPU: 45% | DMA: active
Audio: 22kHz mono, 2 channels mixed
LCD: 320x240 RGB565, SPI @ 40MHz

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────┐
│                   Game Loop                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │  Input   │─▶│  Update  │─▶│  Render  │       │
│  └──────────┘  └──────────┘  └──────────┘       │
│       │                            │             │
│       ▼                            ▼             │
│  GPIO Buttons              Framebuffer[2]        │
│                                    │             │
│                                    ▼             │
│                            DMA to LCD (SPI)      │
└─────────────────────────────────────────────────┘
          │
          ▼
    Timer Interrupt (60Hz)
          │
          ▼
    Audio DMA (background)

LCD configuration (ILI9341 example):

void lcd_init(void) {
    // Hardware reset
    gpio_clear(LCD_RST); delay_ms(10);
    gpio_set(LCD_RST); delay_ms(120);

    lcd_cmd(0x01);  // Software reset
    delay_ms(5);

    lcd_cmd(0x11);  // Sleep out
    delay_ms(120);

    lcd_cmd(0x3A); lcd_data(0x55);  // 16-bit color
    lcd_cmd(0x36); lcd_data(0x48);  // Rotation
    lcd_cmd(0x29);  // Display on

    // Configure DMA for fast writes
    setup_lcd_dma();
}

Double buffering:

uint16_t framebuffer[2][320 * 240];  // RGB565, ~150KB each
volatile uint8_t current_buffer = 0;

void swap_buffers(void) {
    // Wait for previous DMA to complete
    while (dma_busy);

    // Start DMA transfer of current buffer
    lcd_dma_start(framebuffer[current_buffer], 320 * 240);

    // Switch to other buffer for rendering
    current_buffer ^= 1;
}

Simple sprite blitting:

void draw_sprite(int x, int y, const uint16_t *sprite, int w, int h) {
    uint16_t *fb = framebuffer[current_buffer];
    for (int row = 0; row < h; row++) {
        for (int col = 0; col < w; col++) {
            uint16_t color = sprite[row * w + col];
            if (color != TRANSPARENT) {
                fb[(y + row) * 320 + (x + col)] = color;
            }
        }
    }
}

Game loop timing:

#define TARGET_FPS 60
#define FRAME_TIME_MS (1000 / TARGET_FPS)

void game_loop(void) {
    uint32_t last_time = get_ms();

    while (1) {
        // Input
        uint8_t buttons = read_buttons();

        // Update game state
        game_update(buttons);

        // Render to back buffer
        render_game();

        // Swap buffers (triggers DMA)
        swap_buffers();

        // Wait for frame timing
        uint32_t elapsed = get_ms() - last_time;
        if (elapsed < FRAME_TIME_MS) {
            delay_ms(FRAME_TIME_MS - elapsed);
        }
        last_time = get_ms();
    }
}

Key questions:

  • How do you handle button debouncing in real-time?
  • How do you manage RAM when you only have 128KB?
  • How do you achieve 60fps with limited SPI bandwidth?

Learning milestones:

  1. LCD displays solid color → SPI and LCD init work
  2. Sprites render correctly → Framebuffer and blitting work
  3. Game runs smoothly → Timing and input work
  4. Sound plays during game → Audio integration works

Essential Resources

Official Documentation

Books (Priority Order)

  1. “The Art of ARM Assembly, Volume 1” by Randall Hyde - The definitive modern ARM assembly book
  2. “The Definitive Guide to ARM Cortex-M3 and Cortex-M4” by Joseph Yiu - Essential for embedded ARM
  3. “Making Embedded Systems, 2nd Edition” by Elecia White - Practical embedded development
  4. “Computer Organization and Design ARM Edition” by Patterson & Hennessy - Computer architecture fundamentals
  5. “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau - OS concepts (free online)

Online Tutorials

Video Courses

Hardware Recommendations

  • STM32 Nucleo-F446RE (~$15) - Great all-around board, Cortex-M4F
  • Raspberry Pi - Linux-based ARM, good for assembly practice
  • STM32F4 Discovery (~$20) - More peripherals built-in

Summary: All Projects and Languages

# Project Main Language Alternative Languages
1 ARM Instruction Decoder C Rust, Python, Go
2 ARM Assembly Calculator ARM Assembly N/A
3 Bare-Metal LED Blinker C + ARM Assembly Pure Assembly, Rust
4 UART Driver C ARM Assembly, Rust
5 Memory Allocator C Rust, ARM Assembly
6 Simple Bootloader ARM Assembly + C Pure Assembly
7 Context Switcher ARM Assembly + C Rust with inline asm
8 ARM Emulator C Rust, C++, Go
9 Exception Handler C + ARM Assembly Rust
10 I2C OLED Driver C Rust, MicroPython
11 SPI SD Card Driver C Rust
12 PWM Motor Controller C Rust, MicroPython
13 DMA Audio Player C Rust
14 GDB Stub C Rust
15 Tiny Operating System C + ARM Assembly Rust
🎮 Capstone: Game Console C + ARM Assembly Rust

Remember: ARM is learned by doing, not just reading. Pick a project that excites you, get the hardware, and start building. Every bug you fix teaches you something new about how ARM really works.

Good luck on your ARM journey!