LEARN ARM DEEP DIVE

Learn ARM: From Zero to ARM Architecture Master

Goal: Deeply understand ARM architecture—from basic registers and instruction sets to bare-metal programming, writing bootloaders, and building your own ARM emulator. ARM powers billions of devices, from smartphones to Raspberry Pis to Apple Silicon Macs.

Why ARM Matters

ARM (Advanced RISC Machines) is the most widely used processor architecture in the world. Over 200 billion devices contain an ARM chip. From your iPhone to your smart thermostat, from data center servers to the Nintendo Switch—ARM is everywhere.

Yet most developers treat it as a black box. After completing these projects, you will:

Understand every register and their purposes
Read and write ARM assembly fluently
Know how instructions flow through the pipeline
Build bare-metal systems without an operating system
Create bootloaders, drivers, and schedulers
Understand the difference between ARM32, Thumb, and AArch64
Debug ARM binaries like a professional reverse engineer

Core Concept Analysis

ARM vs x86: The Philosophy

x86 (CISC):                          ARM (RISC):
├── Complex instructions              ├── Simple, uniform instructions
├── Variable-length (1-15 bytes)      ├── Fixed-length (4 bytes ARM, 2 bytes Thumb)
├── Many addressing modes             ├── Load/Store architecture
├── Fewer registers (8-16)            ├── Many registers (16 general ARM32, 31 in AArch64)
└── Hardware does more work           └── Software does more work

ARM Register Set (32-bit ARMv7)

General Purpose Registers:
┌─────────────────────────────────┐
│ R0-R3   : Arguments/Return      │  ← Function parameters
│ R4-R11  : Callee-saved          │  ← Preserved across calls
│ R12 (IP): Intra-procedure call  │  ← Scratch register
│ R13 (SP): Stack Pointer         │  ← Points to top of stack
│ R14 (LR): Link Register         │  ← Return address
│ R15 (PC): Program Counter       │  ← Current instruction address
├─────────────────────────────────┤
│ CPSR    : Current Program Status│  ← Flags (N,Z,C,V) + mode bits
└─────────────────────────────────┘

AArch64 Register Set (64-bit ARMv8)

┌─────────────────────────────────┐
│ X0-X7   : Arguments/Return      │  ← Function parameters
│ X8      : Indirect result       │  ← Large struct returns
│ X9-X15  : Caller-saved temps    │  ← Scratch registers
│ X16-X17 : Intra-procedure call  │  ← Platform reserved
│ X18     : Platform reserved     │  ← (TLS on some platforms)
│ X19-X28 : Callee-saved          │  ← Preserved across calls
│ X29 (FP): Frame Pointer         │  ← Stack frame base
│ X30 (LR): Link Register         │  ← Return address
│ SP      : Stack Pointer         │  ← Points to top of stack
│ PC      : Program Counter       │  ← Not directly accessible
├─────────────────────────────────┤
│ W0-W30  : 32-bit views of X regs│  ← Lower 32 bits
└─────────────────────────────────┘

Fundamental Concepts You’ll Master

Load/Store Architecture: ARM can only operate on registers. Data must be loaded from memory to registers, operated on, then stored back. No ADD [memory], value like x86.
Conditional Execution: Most ARM instructions can be conditionally executed based on flags. ADDEQ R0, R1, R2 only adds if the Zero flag is set.
Barrel Shifter: ARM’s secret weapon. Shift operands as part of any data processing instruction: ADD R0, R1, R2, LSL #2 (R0 = R1 + R2*4)
Instruction Pipelining: ARM uses a 3-stage (or more) pipeline: Fetch → Decode → Execute. Understanding this explains why PC is 8 bytes ahead.
Processor Modes: User, FIQ, IRQ, Supervisor, Abort, Undefined, System. Each has banked registers for fast context switching.
Exception Handling: Reset, Undefined, SWI, Prefetch Abort, Data Abort, IRQ, FIQ. The vector table at address 0x00000000.
Memory Management: MMU, TLB, cache hierarchies. How virtual memory maps to physical.

Project List

Projects are ordered from fundamental understanding to advanced implementations.

Project 1: ARM Instruction Decoder & Disassembler

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Binary Parsing / Instruction Sets
Software or Tool: Custom Disassembler (like objdump)
Main Book: “The Art of ARM Assembly, Volume 1” by Randall Hyde

What you’ll build: A command-line tool that takes raw ARM binary code (or ELF files) and decodes each instruction into human-readable assembly, showing the opcode breakdown, registers used, and instruction effects.

Why it teaches ARM: Before you can write ARM assembly, you need to understand how instructions are encoded. Every ARM instruction fits into 32 bits with a specific structure. This project forces you to internalize the encoding scheme—condition codes, opcodes, register fields, immediate values, and shift operations.

Core challenges you’ll face:

Decoding the condition field (bits 31-28) → maps to understanding conditional execution
Parsing different instruction formats → maps to data processing, load/store, branch formats
Handling the barrel shifter encoding → maps to shift types and amounts in operand 2
Decoding immediate values with rotation → maps to 8-bit immediate with 4-bit rotation
Distinguishing instruction classes → maps to understanding opcode space allocation

Key Concepts:

ARM Instruction Encoding: ARM Architecture Reference Manual - Section A5
Condition Codes: “The Art of ARM Assembly, Vol 1” Chapter 4 - Randall Hyde
Binary Parsing in C: “Computer Systems: A Programmer’s Perspective” Chapter 2 - Bryant & O’Hallaron
ELF Format Basics: “Practical Binary Analysis” Chapter 2 - Dennis Andriesse

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming fundamentals, understanding of binary/hexadecimal, basic knowledge of what assembly language is. No prior ARM experience required.

Real world outcome:

$ ./arm-decode firmware.bin

0x00000000: E3A00001  MOV   R0, #1         ; R0 = 1
0x00000004: E3A01002  MOV   R1, #2         ; R1 = 2
0x00000008: E0802001  ADD   R2, R0, R1     ; R2 = R0 + R1
0x0000000C: E1A0F00E  MOV   PC, LR         ; Return (PC = LR)
0x00000010: 0A000003  BEQ   0x00000024     ; Branch if Z=1
0x00000014: E59F0010  LDR   R0, [PC, #16]  ; Load from PC+16+8
0x00000018: E1520000  CMP   R2, R0         ; Compare R2 with R0
0x0000001C: C2833005  ADDGT R3, R3, #5    ; If greater, R3 += 5

Instruction breakdown for 0xE0802001:
  Cond: 1110 (AL - Always)
  Type: 00 (Data Processing)
  OpCode: 0100 (ADD)
  S-bit: 0 (Don't update flags)
  Rn: 0000 (R0)
  Rd: 0010 (R2)
  Operand2: 000000000001 (R1, no shift)

Implementation Hints:

ARM instruction encoding follows a structured pattern. The top 4 bits (31-28) are always the condition code:

Condition Codes (bits 31-28):
= EQ (Equal, Z=1)
= NE (Not Equal, Z=0)
= CS/HS (Carry Set/Unsigned Higher or Same)
...
= AL (Always - most common)
= NV (Never - special meaning in ARMv5+)

The next bits determine instruction class:

Bits 27-25 determine the instruction type:
= Data Processing / Multiply
= Data Processing (immediate)
= Load/Store (immediate offset)
= Load/Store (register offset)
= Load/Store Multiple
= Branch
= Coprocessor
= Software Interrupt / Coprocessor

Questions to guide your implementation:

How do you extract specific bit ranges from a 32-bit word?
What’s the difference between (instruction >> 28) & 0xF and instruction >> 28?
How do you build a lookup table for opcode mnemonics?
When Operand2 is an immediate, how does the 4-bit rotation work?

Start simple: decode just MOV and ADD with register operands. Then expand to immediates, then load/store, then branches.

Learning milestones:

You decode condition codes correctly → You understand conditional execution is ARM’s core feature
You parse data processing instructions → You understand the ALU instruction format
You handle the barrel shifter and immediates → You understand ARM’s flexible operand encoding
You decode load/store and branches → You understand the complete instruction set structure

Project 2: ARM Assembly Calculator

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: ARM Assembly
Alternative Programming Languages: N/A (this must be pure assembly)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: ARM Assembly / Basic Arithmetic
Software or Tool: QEMU (ARM emulator), or Raspberry Pi
Main Book: “ARM Assembly By Example” (armasm.com)

What you’ll build: A four-function calculator entirely in ARM assembly that reads two numbers and an operator from stdin, performs the calculation (add, subtract, multiply, divide), and outputs the result.

Why it teaches ARM: This is your “Hello World” for ARM assembly. You’ll learn registers, arithmetic instructions, system calls, branching, and basic program structure. By handling I/O without libc, you understand how programs interact with the operating system at the lowest level.

Core challenges you’ll face:

Understanding ARM calling conventions → maps to which registers hold what
Making Linux system calls → maps to SVC instruction and syscall numbers
Converting ASCII to integers and back → maps to arithmetic and loops
Implementing multiplication/division → maps to MUL, UDIV instructions (or software division)
Branching based on operator → maps to conditional execution and CMP

Key Concepts:

ARM Registers and Basic Instructions: Azeria Labs “Writing ARM Assembly Part 1”
Linux System Calls on ARM: “ARM Assembly By Example” - armasm.com
ASCII and Number Conversion: “Introduction to Computer Organization: ARM Edition” Chapter 7 - Robert Plantz
ARM Multiply Instructions: “The Art of ARM Assembly, Vol 1” Chapter 7 - Randall Hyde

Difficulty: Beginner Time estimate: Weekend Prerequisites: Understanding of basic programming concepts (variables, loops, conditionals). Ability to use command line and text editor. Project 1 helps but isn’t required.

Real world outcome:

$ ./armcalc
Enter first number: 42
Enter operator (+, -, *, /): *
Enter second number: 13
Result: 546

$ ./armcalc
Enter first number: 100
Enter operator (+, -, *, /): /
Enter second number: 7
Result: 14 remainder 2

$ ./armcalc
Enter first number: 255
Enter operator (+, -, *, /): +
Enter second number: 256
Result: 511

Implementation Hints:

Your program structure will look like this:

.data section:
    - Prompt strings ("Enter first number: ", etc.)
    - Input buffers
    - Result format strings

.text section:
    - _start: Entry point
    - read_number: Read string, convert ASCII → integer
    - print_number: Convert integer → ASCII, print
    - do_add, do_sub, do_mul, do_div: Arithmetic routines

Linux system calls on ARM use the SVC #0 instruction:

- R7 = syscall number (1=exit, 3=read, 4=write)
- R0 = first argument (fd for read/write)
- R1 = second argument (buffer address)
- R2 = third argument (count)
- Return value in R0

Key questions to answer:

How do you convert the character ‘5’ to the number 5? (Hint: subtract ‘0’)
How do you handle multi-digit numbers? (Hint: multiply accumulator by 10, add digit)
What if ARM doesn’t have a divide instruction on your version? (ARMv7 has UDIV, older versions need software division)
How do you compare the operator character to ‘+’, ‘-‘, ‘*’, ‘/’?

Start with just addition. Get input working, parsing working, then output. Once that works, add the other operations.

Learning milestones:

You print “Hello, ARM!” → You understand system calls and basic structure
You read a number from input → You understand data sections and I/O
You perform calculations on registers → You understand arithmetic instructions
You handle all four operations → You understand branching and program flow

Project 3: Bare-Metal LED Blinker

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C with ARM Assembly startup
Alternative Programming Languages: Pure ARM Assembly, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Bare-Metal / GPIO / Embedded Systems
Software or Tool: STM32 Nucleo board or Raspberry Pi (bare-metal mode)
Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: An LED blinker that runs directly on ARM hardware with NO operating system. You’ll write the startup code, linker script, and GPIO control—just you and the silicon.

Why it teaches ARM: This strips away all abstractions. No OS, no libraries, no runtime. You’ll understand: how ARM boots, what the vector table does, how to configure GPIO registers directly, and what “bare metal” really means. This is the foundational skill for all embedded systems work.

Core challenges you’ll face:

Writing the vector table and reset handler → maps to ARM exception model
Creating a linker script → maps to memory layout (Flash, RAM, stack)
Configuring system clocks → maps to RCC registers and clock tree
Controlling GPIO registers → maps to memory-mapped I/O
Creating accurate delays without OS → maps to timer peripherals or cycle counting

Resources for key challenges:

“cpq/bare-metal-programming-guide” - Complete walkthrough for STM32F4
“Vivonomicon’s Bare Metal STM32 Series” - Excellent step-by-step guide

Key Concepts:

ARM Startup Sequence: “Making Embedded Systems” Chapter 3 - Elecia White
Linker Scripts: “Bare Metal C” Chapter 2 - Steve Oualline
Memory-Mapped I/O: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron
GPIO Configuration: STM32F4 Reference Manual - ST Microelectronics (free PDF)
Clock Configuration: “Making Embedded Systems” Chapter 5 - Elecia White

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 2 (ARM assembly basics), C programming, understanding of hexadecimal and bitwise operations. Need either STM32 Nucleo board (~$15) or Raspberry Pi.

Real world outcome:

Physical result: An LED on your board blinks at 1Hz

Serial output (if you add UART later):
Starting bare-metal LED blinker...
System clock: 16 MHz
GPIO configured: PA5 as output
Blink cycle: 500ms on, 500ms off
[LED toggles visibly on the board]

Implementation Hints:

Your project will have this structure:

project/
├── startup.s       # Vector table, reset handler, stack init
├── main.c          # LED blink logic
├── linker.ld       # Memory layout script
├── Makefile        # Build with arm-none-eabi-gcc
└── stm32f4xx.h     # Register definitions (or write your own!)

The ARM boot sequence:

Power on → CPU fetches initial SP from address 0x00000000
CPU fetches reset vector from address 0x00000004
CPU jumps to reset handler address
Reset handler: copy .data from Flash to RAM, zero .bss, call main()

The vector table (first thing in Flash memory):

.section .vectors
vectors:
    .word   _stack_top      @ Initial Stack Pointer
    .word   reset_handler   @ Reset Handler
    .word   nmi_handler     @ NMI
    .word   hardfault_handler @ Hard Fault
    ... more exception vectors ...

GPIO control on STM32 (memory-mapped registers):

// These are memory addresses, not variables!
#define GPIOA_BASE    0x40020000
#define GPIOA_MODER   (*(volatile uint32_t *)(GPIOA_BASE + 0x00))
#define GPIOA_ODR     (*(volatile uint32_t *)(GPIOA_BASE + 0x14))

// Set PA5 as output
GPIOA_MODER |= (1 << 10);  // Mode bits for pin 5

// Toggle LED
GPIOA_ODR ^= (1 << 5);

Key questions:

Why must GPIO registers be declared volatile?
What happens if you forget to enable the GPIO clock in RCC?
How do you calculate the number of loop iterations for a 500ms delay at 16MHz?

Learning milestones:

Your code compiles with arm-none-eabi-gcc → You understand cross-compilation
You flash the binary and CPU doesn’t crash → Vector table is correct
LED lights up (even if stuck) → GPIO configuration works
LED blinks at correct rate → You understand timing without an OS

Project 4: UART Driver from Scratch

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: ARM Assembly, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Serial Communication / Peripheral Programming
Software or Tool: STM32 board, USB-to-Serial adapter
Main Book: “The Linux Programming Interface” by Michael Kerrisk (for comparison)

What you’ll build: A complete UART (serial) driver for bare-metal ARM that supports both polling and interrupt-driven I/O, with configurable baud rates, and can serve as a debug console.

Why it teaches ARM: UART is the “printf debugging” of embedded systems. Building it from scratch teaches you: peripheral register programming, interrupt configuration (NVIC), baud rate calculations, and how the CPU interacts with external hardware. This becomes your foundation for all other peripheral drivers.

Core challenges you’ll face:

Configuring UART registers → maps to understanding peripheral register maps
Calculating baud rate divisors → maps to clock relationships and math
Implementing interrupt handlers → maps to ARM exception model and NVIC
Creating ring buffers for async I/O → maps to data structures for real-time systems
Handling framing errors → maps to error detection and recovery

Key Concepts:

UART Protocol Basics: “Making Embedded Systems” Chapter 9 - Elecia White
ARM Interrupt Handling: “The Definitive Guide to ARM Cortex-M3 and Cortex-M4” Chapter 8 - Joseph Yiu
Ring Buffer Implementation: “Mastering Algorithms with C” Chapter 6 - Kyle Loudon
Baud Rate Generation: STM32 Reference Manual Section on USART
NVIC Configuration: ARM Cortex-M Technical Reference Manual

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3 (bare-metal basics), understanding of serial communication concepts, interrupt concepts.

Real world outcome:

# On your computer, connected via USB-serial adapter:
$ screen /dev/ttyUSB0 115200

=== ARM UART Driver Test ===
UART initialized at 115200 baud
Echo mode active. Type characters:

> Hello, ARM!
You typed: Hello, ARM!

> test interrupt
IRQ count: 47, TX: 14, RX: 14, Errors: 0

> stats
Buffer status: RX 0/64, TX 0/64
Overruns: 0, Framing errors: 0

Implementation Hints:

UART register structure (simplified for STM32):

USART_BASE
├── SR   (Status Register)     - TXE, RXNE, ORE flags
├── DR   (Data Register)       - Read/write data here
├── BRR  (Baud Rate Register)  - Divisor for baud rate
├── CR1  (Control Register 1)  - Enable UART, TX, RX, interrupts
├── CR2  (Control Register 2)  - Stop bits
└── CR3  (Control Register 3)  - Flow control

Baud rate calculation:

BRR = fPCLK / (16 * BaudRate)

For 115200 baud at 16MHz:
BRR = 16,000,000 / (16 * 115200) = 8.68 ≈ 8 + 11/16
Mantissa = 8, Fraction = 11
BRR = (8 << 4) | 11 = 0x8B

Polling vs Interrupts:

Polling (simple but wastes CPU):
while (!(USART_SR & USART_SR_TXE));  // Wait for TX empty
USART_DR = character;

Interrupts (efficient but complex):
1. Enable RXNE interrupt in CR1
2. Configure NVIC priority for USARTx_IRQn
3. In interrupt handler:
   - Check which flag triggered (RXNE? TXE?)
   - Read/write DR as appropriate
   - Clear flags if needed

Ring buffer structure for interrupt-driven I/O:

struct ring_buffer {
    uint8_t data[64];
    volatile uint8_t head;  // Write position
    volatile uint8_t tail;  // Read position
};

// In RX interrupt: add to buffer
buf.data[buf.head] = USART_DR;
buf.head = (buf.head + 1) % 64;

Key questions:

Why must the buffer indices be volatile?
What happens if the ring buffer overflows?
How do you handle the case where the user types faster than you can process?

Learning milestones:

You see characters in the terminal → Basic TX works
Echo mode works → Both TX and RX work
Interrupts don’t crash the system → NVIC is configured correctly
No characters lost at high speed → Ring buffers work correctly

Project 5: ARM Memory Allocator (malloc from Scratch)

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, ARM Assembly
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Memory Management / Data Structures
Software or Tool: QEMU ARM emulator or bare-metal STM32
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A custom heap memory allocator for bare-metal ARM systems, implementing malloc(), free(), and realloc() with proper alignment, coalescing, and fragmentation handling.

Why it teaches ARM: ARM has specific alignment requirements (4-byte for ARM32, 8-byte for AArch64). Building malloc teaches you: memory layout, pointer arithmetic, alignment padding, and how dynamic memory really works. On bare-metal, there’s no sbrk()—you manage a fixed memory pool.

Core challenges you’ll face:

Maintaining free lists → maps to linked list structures in raw memory
Block splitting and coalescing → maps to algorithms for memory efficiency
Alignment requirements → maps to ARM alignment rules and performance
Heap corruption detection → maps to debug techniques and canary values
Working without sbrk → maps to fixed-pool allocation

Key Concepts:

Dynamic Memory Allocation: “Computer Systems: A Programmer’s Perspective” Chapter 9.9 - Bryant & O’Hallaron
Free List Algorithms: “The Art of Computer Programming” Vol 1, Chapter 2.5 - Donald Knuth
ARM Alignment Rules: “The Art of ARM Assembly, Vol 1” Chapter 3 - Randall Hyde
Pool Allocators: “Game Programming Patterns” Chapter 19 - Robert Nystrom

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Strong C pointer skills, understanding of memory layout, Project 3 (bare-metal environment).

Real world outcome:

// Test program output:
=== ARM Malloc Test Suite ===

Test 1: Basic allocation
  malloc(32) = 0x20001008 ✓
  malloc(64) = 0x20001030 ✓
  malloc(16) = 0x20001078 ✓

Test 2: Free and reuse
  free(0x20001030)
  malloc(32) = 0x20001030 ✓  (reused!)

Test 3: Coalescing
  free(0x20001008)
  free(0x20001078)
  malloc(100) = 0x20001008 ✓  (coalesced block)

Test 4: Alignment
  malloc(1) = 0x20001074 (aligned to 8) ✓
  malloc(7) = 0x20001080 (aligned to 8) ✓

Heap stats:
  Total pool: 4096 bytes
  Allocated: 148 bytes
  Free: 3948 bytes
  Fragments: 2
  Largest free block: 3916 bytes

Implementation Hints:

Block header structure:

typedef struct block_header {
    size_t size;                    // Block size (including header)
    struct block_header *next;      // Next free block (if free)
    uint32_t magic;                 // 0xDEADBEEF = allocated, 0xFEEDFACE = free
} block_header_t;

Memory layout:

Heap Pool (e.g., 4KB starting at 0x20001000):
┌──────────────────────────────────────────────────┐
│ HDR │ User Data... │ HDR │ User Data... │ FREE...│
└──────────────────────────────────────────────────┘
  ↑                     ↑
  Block 1 (allocated)   Block 2 (allocated)

Alignment calculation:

// Round up to 8-byte boundary for AArch64, 4-byte for ARM32
#define ALIGN8(x) (((x) + 7) & ~7)

void *malloc(size_t size) {
    size_t aligned_size = ALIGN8(size + sizeof(block_header_t));
    // ... find or split a free block of at least aligned_size ...
}

Free list strategies:

First fit: Take the first block that’s big enough (fast, but fragments)
Best fit: Find the smallest adequate block (slower, less fragmentation)
Segregated lists: Multiple lists for different size classes (fast and efficient)

Coalescing:

Before free(B):
┌─────────┬─────────┬─────────┐
│ A (free)│ B (used)│ C (free)│
└─────────┴─────────┴─────────┘

After free(B) with coalescing:
┌─────────────────────────────┐
│      A+B+C (free)           │
└─────────────────────────────┘

Key questions:

How do you find neighboring blocks for coalescing?
What happens if someone frees a pointer twice?
How do you detect heap corruption?

Learning milestones:

malloc/free work for single allocations → Basic structure is correct
Memory is reused after free → Free list works
Coalescing prevents fragmentation → Block merging works
Stress test passes → Ready for real use

Project 6: Simple Bootloader

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Pure Assembly
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Boot Process / Flash Programming / Firmware
Software or Tool: STM32 with dual-bank Flash, or QEMU
Main Book: “Bare Metal C” by Steve Oualline

What you’ll build: A bootloader that initializes hardware, provides a serial interface for firmware updates, validates new firmware (CRC check), and chains to the main application. Like a mini U-Boot for your microcontroller.

Why it teaches ARM: Bootloaders are the first code that runs. You’ll understand: ARM reset sequence, Flash programming from code, memory remapping, jump to application code, and firmware update protocols. This is critical knowledge for any production embedded system.

Core challenges you’ll face:

Flash self-programming → maps to unlocking and writing to Flash from code
XMODEM or custom protocol → maps to reliable data transfer over serial
CRC validation → maps to integrity checking algorithms
Jumping to application → maps to vector table relocation, stack setup
Fitting in limited space → maps to code size optimization

Key Concepts:

ARM Boot Sequence: “Making Embedded Systems” Chapter 10 - Elecia White
Flash Programming: STM32 Flash Programming Manual - ST Microelectronics
XMODEM Protocol: Chuck Forsberg’s Original Specification
CRC Algorithms: “Hacker’s Delight” Chapter 14 - Henry S. Warren
Vector Table Offset Register: ARM Cortex-M Technical Reference Manual

Difficulty: Expert Time estimate: 2-4 weeks Prerequisites: Projects 3 and 4 (bare-metal and UART), understanding of Flash memory concepts.

Real world outcome:

$ screen /dev/ttyUSB0 115200

========================================
    ARM Custom Bootloader v1.0
========================================
Flash: 512KB (Bootloader: 32KB, App: 480KB)
Current app: CRC 0x1A2B3C4D, Valid

Commands:
  [1] Boot application
  [2] Upload new firmware (XMODEM)
  [3] Verify current firmware
  [4] Dump flash info
  [5] Erase application area

> 2
Ready to receive firmware via XMODEM...
Send file now (XMODEM protocol)

CCCC
Receiving: [====================] 48KB
CRC check: PASS (0x5E6F7A8B)
Flashing: [====================] Done

Firmware updated successfully!
Rebooting into new application...

> 1
Jumping to application at 0x08008000...

=== Application Starting ===
Hello from the new firmware!

Implementation Hints:

Memory layout:

Flash Memory (512KB example):
┌──────────────────────────────────────────────────┐
│ 0x08000000-0x08007FFF: Bootloader (32KB)         │
├──────────────────────────────────────────────────┤
│ 0x08008000-0x0807FFFF: Application (480KB)       │
│   ├── Vector Table (0x08008000)                  │
│   ├── .text (code)                               │
│   └── .rodata (constants)                        │
└──────────────────────────────────────────────────┘

Jumping to application:

void jump_to_app(uint32_t app_address) {
    // 1. Get the application's initial stack pointer
    uint32_t app_sp = *(uint32_t *)app_address;

    // 2. Get the application's reset handler address
    uint32_t app_reset = *(uint32_t *)(app_address + 4);

    // 3. Set the Vector Table Offset Register
    SCB->VTOR = app_address;

    // 4. Set stack pointer and jump
    __set_MSP(app_sp);
    void (*app_entry)(void) = (void (*)(void))app_reset;
    app_entry();  // Never returns
}

Flash programming sequence (STM32):

Unlock Flash: Write keys 0x45670123, 0xCDEF89AB to FLASH_KEYR
Wait for BSY flag to clear
Set PG (Program) bit in FLASH_CR
Write data to Flash address (word at a time)
Wait for BSY flag
Check for errors (PGERR, WRPERR)
Lock Flash: Set LOCK bit

XMODEM basics:

Receiver sends: 'C' (CRC mode) or NAK
Sender sends: SOH (0x01), Block#, ~Block#, 128 bytes, CRC16
Receiver sends: ACK or NAK
Repeat until EOT

Key questions:

Why must you disable interrupts before jumping to the app?
How do you ensure the bootloader can recover from a failed update?
What if power is lost mid-flash?

Learning milestones:

Bootloader boots and shows menu → Basic structure works
You can receive data over XMODEM → Protocol implementation works
Flash programming succeeds → You can modify your own Flash
Application boots from bootloader → Vector table and jump work

Project 7: Context Switcher & Mini-Scheduler

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust with inline assembly
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Operating Systems / Concurrency
Software or Tool: STM32 board or QEMU
Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you’ll build: A preemptive scheduler that runs multiple “tasks” concurrently on a single ARM core, with context switching driven by the SysTick timer. Like a tiny FreeRTOS kernel.

Why it teaches ARM: This is where you truly understand how operating systems work. You’ll learn: ARM processor modes, banked registers, exception entry/exit, stack management per task, and how multitasking is an illusion created by fast switching. This is the foundation of every RTOS.

Core challenges you’ll face:

Saving and restoring full context → maps to ARM register banking and stack frames
SysTick timer configuration → maps to periodic interrupts for preemption
Task Control Blocks (TCBs) → maps to per-task state management
Stack allocation per task → maps to memory partitioning
Avoiding race conditions → maps to critical sections and interrupt masking

Key Concepts:

Context Switching: “Operating Systems: Three Easy Pieces” Chapter 6 - Arpaci-Dusseau
ARM Exception Handling: “The Definitive Guide to ARM Cortex-M3” Chapter 8 - Joseph Yiu
SysTick Timer: “Making Embedded Systems” Chapter 6 - Elecia White
Process Control Blocks: “Modern Operating Systems” Chapter 2 - Tanenbaum
Critical Sections: “Operating Systems: Three Easy Pieces” Chapter 28

Difficulty: Expert Time estimate: 2-4 weeks Prerequisites: Projects 3-4 (bare-metal, UART), strong understanding of ARM assembly and stacks.

Real world outcome:

=== ARM Mini-Scheduler Demo ===

Creating tasks:
  Task 1: LED blinker (priority 1)
  Task 2: Serial echo (priority 2)
  Task 3: Counter display (priority 3)

Scheduler started, tick = 10ms

[00000010] Task3: Count = 1
[00000020] Task3: Count = 2
[00000025] Task2: Echo 'H'
[00000026] Task2: Echo 'i'
[00000030] Task3: Count = 3
[00000040] Task3: Count = 4
[00000500] Task1: LED ON
[00001000] Task1: LED OFF
[00001010] Task3: Count = 100

Context switches: 247
Task 1: 5% CPU, 12 switches
Task 2: 15% CPU, 89 switches
Task 3: 80% CPU, 146 switches

Implementation Hints:

Task Control Block (TCB):

typedef struct {
    uint32_t *sp;           // Saved stack pointer
    uint32_t stack[256];    // Task's private stack
    uint8_t priority;       // Scheduling priority
    uint8_t state;          // READY, RUNNING, BLOCKED
    char name[16];          // For debugging
} TCB_t;

ARM Cortex-M exception stack frame (automatically pushed on exception entry):

High addresses
    ┌─────────────┐
    │    xPSR     │  ← Original program status
    │     PC      │  ← Where to return to
    │     LR      │  ← Was the Link Register
    │     R12     │
    │     R3      │
    │     R2      │
    │     R1      │
    │     R0      │  ← SP points here after exception
    └─────────────┘
Low addresses

You must also save R4-R11 manually!

Context switch in assembly (PendSV handler):

PendSV_Handler:
    CPSID   I                   @ Disable interrupts

    MRS     R0, PSP             @ Get current task's stack pointer
    STMDB   R0!, {R4-R11}       @ Push R4-R11 onto task stack

    LDR     R1, =current_task   @ Get current TCB pointer
    LDR     R2, [R1]
    STR     R0, [R2]            @ Save SP to current TCB

    BL      scheduler           @ Call C scheduler, returns next TCB in R0

    LDR     R1, =current_task
    STR     R0, [R1]            @ Update current_task pointer
    LDR     R0, [R0]            @ Load new task's SP from TCB

    LDMIA   R0!, {R4-R11}       @ Pop R4-R11 from new task stack
    MSR     PSP, R0             @ Set PSP to new task's stack

    CPSIE   I                   @ Re-enable interrupts
    BX      LR                  @ Return (hardware pops R0-R3, R12, LR, PC, xPSR)

SysTick configuration for 10ms tick at 16MHz:

SysTick->LOAD = 16000000 / 100 - 1;  // 160000 cycles = 10ms
SysTick->VAL = 0;
SysTick->CTRL = 7;  // Enable, use processor clock, enable interrupt

Key questions:

Why use PendSV instead of SysTick directly for context switch?
What happens if a task overflows its stack?
How do you implement task blocking (e.g., for I/O)?

Learning milestones:

SysTick fires regularly → Timer interrupt works
You can switch between two tasks manually → Context save/restore works
Preemption works automatically → Scheduler and PendSV integrated
Three+ tasks run smoothly → Round-robin or priority scheduling works

Project 8: ARM Emulator

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: CPU Architecture / Virtualization
Software or Tool: Building something like a mini-QEMU
Main Book: “Computer Organization and Design ARM Edition” by Patterson & Hennessy

What you’ll build: An ARM emulator that can execute real ARM binaries, emulating the CPU, memory, and basic I/O. It will run simple bare-metal programs and show the internal state of registers and memory.

Why it teaches ARM: Building an emulator forces complete understanding. Every instruction must be decoded and executed correctly. You’ll internalize the entire instruction set, the pipeline effects, condition codes, and edge cases. After this, ARM will have no secrets from you.

Core challenges you’ll face:

Instruction decoding for all formats → maps to complete ISA understanding
Emulating the barrel shifter → maps to operand processing
Condition code evaluation → maps to CPSR flags and conditional execution
Memory access emulation → maps to address translation concepts
Handling exceptions → maps to exception model and vector table

Key Concepts:

ARM Instruction Set: ARM Architecture Reference Manual (ARM ARM)
Emulator Design: “Computer Systems: A Programmer’s Perspective” Chapter 4 - Bryant & O’Hallaron
CPU Pipelines: “Computer Organization and Design ARM Edition” Chapter 4 - Patterson & Hennessy
Condition Codes: “The Art of ARM Assembly, Vol 1” Chapter 4 - Randall Hyde
Memory Systems: “Computer Architecture” Chapter 5 - Hennessy & Patterson

Difficulty: Master Time estimate: 1 month+ Prerequisites: Project 1 (instruction decoder), Projects 2-3 (deep ARM understanding), strong C programming skills.

Real world outcome:

$ ./arm-emu -v program.bin

ARM Emulator v1.0
Loading binary: program.bin (256 bytes)
Memory: 64KB @ 0x00000000

[0x00000000] E3A00001  MOV R0, #1
    R0: 0x00000000 → 0x00000001

[0x00000004] E3A01002  MOV R1, #2
    R1: 0x00000000 → 0x00000002

[0x00000008] E0802001  ADD R2, R0, R1
    R2: 0x00000000 → 0x00000003

[0x0000000C] E3520005  CMP R2, #5
    CPSR: N=1 Z=0 C=0 V=0 (R2 < 5)

[0x00000010] AA000002  BGE 0x00000020
    Branch NOT taken (condition GE failed)

[0x00000014] E2822001  ADD R2, R2, #1
    R2: 0x00000003 → 0x00000004

... execution continues ...

=== Execution Complete ===
Cycles: 47
Instructions: 42
Final state:
  R0=0x00000005  R1=0x00000002  R2=0x00000007  R3=0x00000000
  R4=0x00000000  R5=0x00000000  R6=0x00000000  R7=0x00000000
  ...
  PC=0x00000038  CPSR=0x60000010 [nZCv, User mode]

Implementation Hints:

CPU state structure:

typedef struct {
    uint32_t r[16];        // R0-R15 (R13=SP, R14=LR, R15=PC)
    uint32_t cpsr;         // Current Program Status Register
    uint32_t spsr;         // Saved PSR (for exceptions)

    uint8_t *memory;       // Emulated memory
    size_t mem_size;

    uint64_t cycles;       // Cycle counter
    bool halted;           // CPU halted flag
} ARM_CPU;

Main execution loop:

void run(ARM_CPU *cpu) {
    while (!cpu->halted) {
        // 1. Fetch
        uint32_t insn = fetch(cpu);

        // 2. Check condition
        if (!condition_passed(cpu, insn >> 28)) {
            cpu->r[15] += 4;  // Skip instruction
            continue;
        }

        // 3. Decode and execute
        execute(cpu, insn);

        cpu->cycles++;
    }
}

Instruction dispatch (by bits 27-25):

void execute(ARM_CPU *cpu, uint32_t insn) {
    uint32_t type = (insn >> 25) & 0x7;

    switch (type) {
        case 0: // Data processing (register) or Multiply
            if ((insn & 0x0FC000F0) == 0x00000090)
                exec_multiply(cpu, insn);
            else
                exec_data_proc_reg(cpu, insn);
            break;
        case 1: // Data processing (immediate)
            exec_data_proc_imm(cpu, insn);
            break;
        case 2: // Load/Store (immediate offset)
            exec_load_store_imm(cpu, insn);
            break;
        case 5: // Branch
            exec_branch(cpu, insn);
            break;
        // ... more cases ...
    }
}

Barrel shifter for Operand2:

uint32_t decode_operand2(ARM_CPU *cpu, uint32_t insn, bool is_immediate, bool *carry_out) {
    if (is_immediate) {
        // 8-bit immediate rotated right by 2 * rotate
        uint32_t imm = insn & 0xFF;
        uint32_t rotate = ((insn >> 8) & 0xF) * 2;
        return (imm >> rotate) | (imm << (32 - rotate));
    } else {
        // Register with optional shift
        uint32_t rm = cpu->r[insn & 0xF];
        uint32_t shift_type = (insn >> 5) & 0x3;
        uint32_t shift_amount = (insn >> 7) & 0x1F;

        switch (shift_type) {
            case 0: return rm << shift_amount;      // LSL
            case 1: return rm >> shift_amount;      // LSR
            case 2: return (int32_t)rm >> shift_amount;  // ASR
            case 3: return (rm >> shift_amount) | (rm << (32 - shift_amount)); // ROR
        }
    }
}

Key questions:

What is the PC value when executing an instruction? (Hint: PC+8 in ARM mode)
How do you handle LDR PC, [Rx] (loading into PC = branch)?
What happens when an instruction modifies the PC?

Learning milestones:

Simple programs run (MOV, ADD) → Basic decode/execute works
Branches and loops work → Control flow is correct
Fibonacci computes correctly → Arithmetic and memory work
You can run real test binaries → Emulator is production-quality

Project 9: Exception Handler & Fault Analyzer

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C with ARM Assembly
Alternative Programming Languages: Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Debugging / Exception Handling
Software or Tool: STM32 board
Main Book: “The Definitive Guide to ARM Cortex-M3 and Cortex-M4” by Joseph Yiu

What you’ll build: A comprehensive fault handler that catches all ARM exceptions (HardFault, MemManage, BusFault, UsageFault), decodes the fault cause, prints a detailed stack trace, and optionally recovers or reboots.

Why it teaches ARM: Faults are inevitable in embedded development. Understanding how ARM reports errors—through fault status registers, stacked PC, and fault addresses—is essential for debugging. This project makes you the person who can diagnose any crash.

Core challenges you’ll face:

Understanding stacked exception frames → maps to what’s on stack at fault time
Decoding fault status registers → maps to CFSR, HFSR, MMFAR, BFAR
Determining faulting instruction → maps to stacked PC and instruction analysis
Stack unwinding for call trace → maps to frame pointer following
Safe recovery strategies → maps to exception return and system reset

Key Concepts:

ARM Exception Model: “The Definitive Guide to ARM Cortex-M3” Chapter 8 - Joseph Yiu
Fault Status Registers: “The Definitive Guide to ARM Cortex-M3” Chapter 12 - Joseph Yiu
Stack Unwinding: “Computer Systems: A Programmer’s Perspective” Chapter 3 - Bryant & O’Hallaron
Debug Features: ARM Cortex-M Technical Reference Manual

Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Projects 3-4 (bare-metal), strong understanding of stack and calling conventions.

Real world outcome:

!!! HARD FAULT DETECTED !!!

Fault Type: BusFault (Precise)
Fault Address: 0xE0100000 (invalid peripheral access)
Faulting Instruction: 0x08001234 (LDR R0, [R1])

Stacked Registers:
  R0  = 0x00000042    R1  = 0xE0100000
  R2  = 0x00000000    R3  = 0x20001234
  R12 = 0x08004567    LR  = 0x0800089B
  PC  = 0x08001234    xPSR= 0x61000000

Fault Status:
  CFSR  = 0x00000400  [PRECISERR]
  HFSR  = 0x40000000  [FORCED]
  BFAR  = 0xE0100000  (Valid)

Stack Trace:
  #0 0x08001234 in read_sensor() at sensors.c:47
  #1 0x0800089A in main() at main.c:123
  #2 0x080000A2 in Reset_Handler() at startup.s:45

Call Stack Memory:
  0x20003FF0: 0x08001234  <- Fault PC
  0x20003FEC: 0x0800089A  <- Called from
  0x20003FE8: 0x080000A2  <- Called from

Action: System reset in 5 seconds...

Implementation Hints:

Exception frame structure (pushed automatically):

typedef struct {
    uint32_t r0;
    uint32_t r1;
    uint32_t r2;
    uint32_t r3;
    uint32_t r12;
    uint32_t lr;
    uint32_t pc;      // Faulting instruction address
    uint32_t xpsr;
} exception_frame_t;

Fault status registers:

#define SCB_CFSR    (*(volatile uint32_t *)0xE000ED28)  // Configurable Fault Status
#define SCB_HFSR    (*(volatile uint32_t *)0xE000ED2C)  // HardFault Status
#define SCB_MMFAR   (*(volatile uint32_t *)0xE000ED34)  // MemManage Fault Address
#define SCB_BFAR    (*(volatile uint32_t *)0xE000ED38)  // BusFault Address

// CFSR bits
#define CFSR_IACCVIOL   (1 << 0)   // Instruction access violation
#define CFSR_DACCVIOL   (1 << 1)   // Data access violation
#define CFSR_MUNSTKERR  (1 << 3)   // MemManage fault on unstacking
#define CFSR_MSTKERR    (1 << 4)   // MemManage fault on stacking
#define CFSR_IBUSERR    (1 << 8)   // Bus fault on instruction fetch
#define CFSR_PRECISERR  (1 << 9)   // Precise data bus error
#define CFSR_IMPRECISERR (1 << 10) // Imprecise data bus error
// ... more bits ...

HardFault handler:

void HardFault_Handler_C(exception_frame_t *frame) {
    printf("\n!!! HARD FAULT !!!\n");
    printf("PC = 0x%08X (faulting instruction)\n", frame->pc);
    printf("LR = 0x%08X (return address)\n", frame->lr);

    // Decode CFSR
    uint32_t cfsr = SCB_CFSR;
    if (cfsr & CFSR_PRECISERR) {
        printf("BusFault: Precise error at 0x%08X\n", SCB_BFAR);
    }
    // ... decode more bits ...

    // Print stack trace
    unwind_stack(frame);

    // Reset
    NVIC_SystemReset();
}

// Assembly wrapper to get frame pointer
__attribute__((naked)) void HardFault_Handler(void) {
    __asm volatile (
        "TST LR, #4      \n"  // Test bit 2 of LR
        "ITE EQ          \n"
        "MRSEQ R0, MSP   \n"  // If 0, was using MSP
        "MRSNE R0, PSP   \n"  // If 1, was using PSP
        "B HardFault_Handler_C\n"
    );
}

Key questions:

Why is LR sometimes 0xFFFFFFFx in exception handlers?
How do you get a call trace without debug symbols?
What’s the difference between precise and imprecise bus faults?

Learning milestones:

HardFault handler prints PC → Basic exception handling works
You decode the fault type → Status register parsing works
Stack trace shows call chain → Unwinding works
System recovers gracefully → Exception return understood

Project 10: I2C Driver for OLED Display

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, MicroPython (for comparison)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Serial Protocols / Peripheral Programming
Software or Tool: STM32 + SSD1306 OLED display
Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A complete I2C driver from scratch that communicates with an SSD1306 OLED display, drawing pixels, text, and simple graphics without using any libraries.

Why it teaches ARM: I2C is the most common peripheral protocol. Building it from scratch teaches you: GPIO alternate functions, timing constraints, clock stretching, ACK/NACK handling, and the general pattern for all peripheral drivers. Plus, you get visible output!

Core challenges you’ll face:

Configuring GPIO for I2C → maps to alternate functions and open-drain
Generating proper timing → maps to I2C clock configuration
Handling ACK/NACK → maps to protocol state machine
Sending commands vs data → maps to SSD1306 protocol specifics
Frame buffer management → maps to memory organization for display

Key Concepts:

I2C Protocol: “Making Embedded Systems” Chapter 9 - Elecia White
SSD1306 Controller: SSD1306 Datasheet (Solomon Systech)
GPIO Alternate Functions: STM32 Reference Manual GPIO chapter
Graphics Primitives: “Computer Graphics from Scratch” Chapter 1 - Gabriel Gambetta

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3 (bare-metal basics), Project 4 (peripheral experience), understanding of bit manipulation.

Real world outcome:

Physical result: 128x64 OLED display shows:

┌──────────────────────────────────────┐
│  ARM I2C Demo                        │
│                                      │
│  ╭──────────────────────────╮        │
│  │    CPU: 72MHz            │        │
│  │    Temp: 32°C            │        │
│  │    Time: 14:23:45        │        │
│  ╰──────────────────────────╯        │
│                      ___             │
│     ★              /ARM \            │
│                    \_____/           │
└──────────────────────────────────────┘

Serial output:
I2C initialized at 400kHz
SSD1306 found at address 0x3C
Display initialized (128x64)
Drawing test pattern...
Framebuffer: 1024 bytes (128x64/8)

Implementation Hints:

I2C transaction structure:

START → Address+W → ACK → Data → ACK → ... → STOP
        [7 bits][R/W]     [8 bits]

I2C register configuration (STM32):

// 1. Enable clocks for I2C and GPIO
RCC->APB1ENR |= RCC_APB1ENR_I2C1EN;
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOBEN;

// 2. Configure GPIO for I2C (open-drain, alternate function)
GPIOB->MODER |= (2 << (6*2)) | (2 << (7*2));  // AF mode
GPIOB->OTYPER |= (1 << 6) | (1 << 7);         // Open drain
GPIOB->AFR[0] |= (4 << (6*4)) | (4 << (7*4)); // AF4 = I2C

// 3. Configure I2C timing for 400kHz
I2C1->CR1 = 0;                    // Disable I2C
I2C1->CR2 = 36;                   // APB1 clock = 36MHz
I2C1->CCR = 90;                   // Clock control for 400kHz
I2C1->TRISE = 11;                 // Rise time
I2C1->CR1 = I2C_CR1_PE;           // Enable I2C

I2C write sequence:

void i2c_write(uint8_t addr, uint8_t *data, uint8_t len) {
    // 1. Generate START
    I2C1->CR1 |= I2C_CR1_START;
    while (!(I2C1->SR1 & I2C_SR1_SB));

    // 2. Send address with write bit
    I2C1->DR = addr << 1;
    while (!(I2C1->SR1 & I2C_SR1_ADDR));
    (void)I2C1->SR2;  // Clear ADDR flag

    // 3. Send data bytes
    for (int i = 0; i < len; i++) {
        I2C1->DR = data[i];
        while (!(I2C1->SR1 & I2C_SR1_TXE));
    }

    // 4. Generate STOP
    I2C1->CR1 |= I2C_CR1_STOP;
}

SSD1306 initialization sequence:

uint8_t init_cmds[] = {
    0xAE,       // Display off
    0xD5, 0x80, // Clock divide
    0xA8, 0x3F, // Multiplex ratio (64-1)
    0xD3, 0x00, // Display offset
    0x40,       // Start line
    0x8D, 0x14, // Charge pump on
    0x20, 0x00, // Horizontal addressing mode
    0xA1,       // Segment remap
    0xC8,       // COM scan direction
    0xDA, 0x12, // COM pins config
    0x81, 0xCF, // Contrast
    0xD9, 0xF1, // Pre-charge
    0xDB, 0x40, // VCOMH deselect
    0xA4,       // Display from RAM
    0xA6,       // Normal display
    0xAF        // Display on
};

Key questions:

What happens if you forget open-drain configuration?
How do you handle clock stretching by the slave?
Why does SSD1306 use 8 vertical pixels per byte?

Learning milestones:

I2C peripheral responds → Address ACK received
Display turns on → Init sequence works
Single pixel lights up → Framebuffer to display works
Text and graphics display → Higher-level drawing works

Project 11: SPI Driver for SD Card

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Serial Protocols / File Systems
Software or Tool: STM32 + MicroSD card module
Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: An SPI driver that initializes an SD card, reads/writes sectors, and optionally implements FAT16/FAT32 to read actual files.

Why it teaches ARM: SPI is ARM’s other workhorse protocol (along with I2C). SD cards require precise timing, CRC checks, and a complex initialization sequence. This teaches you: DMA for high-speed transfers, protocol state machines, and persistent storage.

Core challenges you’ll face:

SD card SPI initialization dance → maps to protocol timing and command sequences
Command/response handling → maps to state machine design
CRC calculation → maps to data integrity
Block read/write → maps to 512-byte sector handling
FAT filesystem parsing → maps to data structure interpretation

Key Concepts:

SPI Protocol: “Making Embedded Systems” Chapter 9 - Elecia White
SD Card SPI Mode: SD Physical Layer Specification (SDA)
FAT Filesystem: “FAT32 File System Specification” - Microsoft
DMA Transfers: STM32 Reference Manual DMA chapter

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 3 (bare-metal), Project 4 (peripheral experience), bit manipulation skills.

Real world outcome:

=== SD Card Driver Test ===

Initializing SPI at 400kHz...
Sending CMD0 (GO_IDLE)... OK (R1=0x01)
Sending CMD8 (SEND_IF_COND)... OK, SDHC card detected
Sending ACMD41 (SD_SEND_OP_COND)...
  Attempt 1: busy
  Attempt 2: busy
  Attempt 3: ready!
Switching to 25MHz SPI mode

Card info:
  Type: SDHC
  Capacity: 16 GB (31116288 sectors)
  Manufacturer: SanDisk

Reading sector 0 (MBR)...
  Boot signature: 0x55AA ✓
  Partition 1: FAT32, starts at sector 2048

Mounting FAT32...
  Sectors per cluster: 64
  Reserved sectors: 32
  Root directory: cluster 2

Directory listing of /:
  HELLO.TXT       42 bytes
  FIRMWARE.BIN    65536 bytes
  DATA/           <DIR>

Reading HELLO.TXT:
"Hello from SD card!"

Write test:
  Writing "Test 12345" to TEST.TXT... OK
  Reading back... "Test 12345" ✓

Implementation Hints:

SPI configuration:

// Configure SPI1 for SD card
SPI1->CR1 = 0;
SPI1->CR1 |= SPI_CR1_MSTR;        // Master mode
SPI1->CR1 |= SPI_CR1_BR_2;        // Clock = PCLK/32 (slow for init)
SPI1->CR1 |= SPI_CR1_SSM | SPI_CR1_SSI; // Software CS
SPI1->CR1 |= SPI_CR1_SPE;         // Enable SPI

SD command format (6 bytes):

┌────────┬──────────┬────────┐
│ 01xxxxxx│ argument │ CRC7+1 │
│ (cmd)   │ (4 bytes)│        │
└────────┴──────────┴────────┘

SD initialization sequence:

bool sd_init(void) {
    // 1. Send 80+ clock pulses with CS high
    cs_high();
    for (int i = 0; i < 10; i++) spi_transfer(0xFF);

    // 2. CMD0: GO_IDLE_STATE
    cs_low();
    sd_command(0, 0x00000000);  // Expect R1 = 0x01 (idle)

    // 3. CMD8: SEND_IF_COND (check if SDv2)
    sd_command(8, 0x000001AA);  // 2.7-3.6V, check pattern

    // 4. ACMD41: SD_SEND_OP_COND (wait for ready)
    while (1) {
        sd_command(55, 0);  // APP_CMD prefix
        uint8_t r = sd_command(41, 0x40000000);  // HCS bit
        if (r == 0) break;  // Ready!
    }

    // 5. CMD58: Read OCR (check CCS bit for SDHC)

    // 6. Switch to high-speed SPI
    spi_set_speed(25000000);

    return true;
}

Reading a sector:

bool sd_read_sector(uint32_t sector, uint8_t *buffer) {
    // For SDHC, sector number is already in blocks
    sd_command(17, sector);  // READ_SINGLE_BLOCK

    // Wait for data token (0xFE)
    while (spi_transfer(0xFF) != 0xFE);

    // Read 512 bytes
    for (int i = 0; i < 512; i++) {
        buffer[i] = spi_transfer(0xFF);
    }

    // Read (and ignore) CRC
    spi_transfer(0xFF);
    spi_transfer(0xFF);

    return true;
}

Key questions:

Why must you start at 400kHz and only go faster after init?
What’s the difference between SD (SDSC) and SDHC addressing?
How do you handle multi-block transfers for speed?

Learning milestones:

Card responds to CMD0 → SPI and timing work
Initialization completes → Protocol state machine works
You read the MBR → Sector reads work
You read a file → FAT parsing works

Project 12: Timer-Based PWM Motor Controller

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, MicroPython (for comparison)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Timers / PWM / Motor Control
Software or Tool: STM32 + DC motor or servo
Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A PWM generator using ARM timer peripherals to control motor speed or servo position, with smooth acceleration and position feedback via encoder.

Why it teaches ARM: Timers are the most complex ARM peripherals. PWM generation teaches you: timer configurations, compare/capture units, dead-time insertion, and DMA-triggered updates. This is essential for robotics and power electronics.

Core challenges you’ll face:

Timer clock configuration → maps to prescaler and period calculations
PWM mode setup → maps to output compare modes
Smooth speed ramping → maps to software control loops
Encoder reading → maps to timer encoder mode
Dead-time for H-bridge → maps to complementary outputs

Key Concepts:

PWM Fundamentals: “Making Embedded Systems” Chapter 6 - Elecia White
ARM Timer Architecture: STM32 Reference Manual TIM chapter
PID Control: “Making Embedded Systems” Chapter 11 - Elecia White
Encoder Interface: STM32 Timer Application Notes

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 3 (bare-metal), basic understanding of DC motors.

Real world outcome:

=== PWM Motor Controller ===

Timer: TIM1, 72MHz base, PWM @ 20kHz
Motor: DC brushed, H-bridge driver

Commands:
  S<n>  - Set speed (-100 to 100)
  A<n>  - Set acceleration (1-100)
  P     - Print position

> S50
Ramping to 50%...
  10% -> 20% -> 30% -> 40% -> 50%
Current: 50%, Duty: 500/1000

> S-30
Reversing direction...
  50% -> 40% -> 30% -> 20% -> 10% -> 0%
  0% -> -10% -> -20% -> -30%
Current: -30%, Duty: 300/1000 (reversed)

> P
Encoder count: 4523
Estimated RPM: 1200
Position: 12.6 revolutions

Physical result: Motor smoothly accelerates, holds speed, reverses

Implementation Hints:

Timer PWM configuration:

// TIM1 for PWM, 72MHz clock, 20kHz PWM
RCC->APB2ENR |= RCC_APB2ENR_TIM1EN;

TIM1->PSC = 0;               // No prescaler
TIM1->ARR = 3600 - 1;        // 72MHz / 3600 = 20kHz
TIM1->CCR1 = 1800;           // 50% duty cycle

// Configure channel 1 as PWM mode 1
TIM1->CCMR1 = (6 << 4);      // OC1M = PWM mode 1
TIM1->CCER = TIM_CCER_CC1E;  // Enable output

// Enable main output (required for TIM1)
TIM1->BDTR |= TIM_BDTR_MOE;

TIM1->CR1 |= TIM_CR1_CEN;    // Start timer

Smooth ramping:

void set_speed_smooth(int target) {
    while (current_speed != target) {
        if (current_speed < target) current_speed++;
        else current_speed--;

        update_pwm(current_speed);
        delay_ms(10);  // Ramp rate
    }
}

Encoder mode (reading motor position):

// TIM2 in encoder mode
TIM2->SMCR = (3 << 0);       // Encoder mode 3 (count both edges)
TIM2->CCMR1 = (1 << 0) | (1 << 8);  // CC1/CC2 as inputs
TIM2->CCER = 0;              // Non-inverted
TIM2->CNT = 0;               // Reset count
TIM2->CR1 |= TIM_CR1_CEN;    // Start

// Read position anytime:
int32_t position = (int16_t)TIM2->CNT;  // Signed for direction

Key questions:

Why is 20kHz a good PWM frequency for motors?
How do you handle direction reversal (H-bridge control)?
What happens if encoder counts overflow?

Learning milestones:

LED brightness varies → Basic PWM works
Motor spins at controlled speed → PWM duty cycle correct
Motor reverses smoothly → Direction and ramping work
Position feedback accurate → Encoder mode works

Project 13: DMA-Driven Audio Player

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 4: Expert
Knowledge Area: DMA / DAC / Audio
Software or Tool: STM32 with DAC + speaker/headphones
Main Book: “Making Embedded Systems, 2nd Edition” by Elecia White

What you’ll build: A WAV file player that uses DMA to stream audio from SD card to DAC with zero CPU intervention during playback.

Why it teaches ARM: DMA (Direct Memory Access) is how real systems achieve high performance. By offloading memory transfers to hardware, the CPU is free for other tasks. This project combines: DMA configuration, double-buffering, DAC output, and real-time audio constraints.

Core challenges you’ll face:

DMA configuration → maps to peripheral-to-memory and memory-to-peripheral
Double buffering → maps to avoiding audio glitches
DAC timing → maps to timer-triggered DAC updates
WAV parsing → maps to file format understanding
Sample rate conversion → maps to timer frequency calculations

Key Concepts:

DMA Controllers: “Making Embedded Systems” Chapter 7 - Elecia White
DAC Operation: STM32 Reference Manual DAC chapter
WAV File Format: RIFF specification
Double Buffering: “Game Programming Patterns” Chapter 8 - Robert Nystrom

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 10-11 (I2C/SPI), understanding of audio concepts, DMA basics.

Real world outcome:

=== DMA Audio Player ===

SD card mounted, FAT32
DMA: Circular mode, half/full interrupts
DAC: 12-bit, TIM6 triggered

Loading: music.wav
  Format: PCM
  Channels: 2 (stereo, mixing to mono)
  Sample rate: 44100 Hz
  Bits per sample: 16

Configuring TIM6 for 44.1kHz...
  Timer clock: 72MHz
  Prescaler: 0
  Period: 1632 (actual: 44117 Hz, 0.04% error)

Playing... [=====>              ] 25%
  Buffer: 2048 samples, double-buffered
  DMA interrupts: 1247 (half), 1247 (full)
  CPU usage: 3% (mostly SD reads)

Press 'p' to pause, 's' to stop, '+/-' for volume

Physical result: Audio plays through speaker/headphones clearly!

Implementation Hints:

DMA configuration for DAC:

// Configure DMA1 Channel 3 for DAC1
DMA1_Channel3->CPAR = (uint32_t)&DAC->DHR12R1;  // Destination: DAC
DMA1_Channel3->CMAR = (uint32_t)audio_buffer;   // Source: buffer
DMA1_Channel3->CNDTR = BUFFER_SIZE;              // Transfer count

DMA1_Channel3->CCR = 0;
DMA1_Channel3->CCR |= DMA_CCR_MINC;    // Memory increment
DMA1_Channel3->CCR |= DMA_CCR_CIRC;    // Circular mode
DMA1_Channel3->CCR |= DMA_CCR_DIR;     // Memory to peripheral
DMA1_Channel3->CCR |= DMA_CCR_MSIZE_0; // 16-bit memory
DMA1_Channel3->CCR |= DMA_CCR_PSIZE_0; // 16-bit peripheral
DMA1_Channel3->CCR |= DMA_CCR_HTIE;    // Half-transfer interrupt
DMA1_Channel3->CCR |= DMA_CCR_TCIE;    // Transfer complete interrupt
DMA1_Channel3->CCR |= DMA_CCR_EN;      // Enable

Double-buffering strategy:

Buffer: [    First Half    |    Second Half    ]
         ^^^^^^^^^^^^^^^^
         DMA playing this
                            ^^^^^^^^^^^^^^^^^^
                            CPU filling this

When DMA finishes first half: HT interrupt, CPU fills first half
When DMA finishes second half: TC interrupt, CPU fills second half

Timer-triggered DAC:

// TIM6 triggers DAC at sample rate
TIM6->PSC = 0;
TIM6->ARR = (72000000 / 44100) - 1;  // ~1632
TIM6->CR2 |= TIM_CR2_MMS_1;  // TRGO on update
TIM6->CR1 |= TIM_CR1_CEN;

// DAC configuration
DAC->CR |= DAC_CR_TEN1;      // Trigger enable
DAC->CR |= DAC_CR_TSEL1_0;   // TIM6 TRGO trigger
DAC->CR |= DAC_CR_DMAEN1;    // DMA enable
DAC->CR |= DAC_CR_EN1;       // Enable DAC

Key questions:

Why circular DMA mode for audio?
What happens if CPU can’t fill buffer fast enough?
How do you handle 16-bit audio on a 12-bit DAC?

Learning milestones:

DAC outputs a sine wave → Basic DAC works
DMA runs without CPU → DMA configuration correct
Audio plays continuously → Double-buffering works
Music sounds correct → Sample rate and bit depth right

Project 14: ARM Debugger (GDB Stub)

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Debugging / Debug Hardware
Software or Tool: STM32 (target) + USB-serial
Main Book: “Building a Debugger” by Sy Brand

What you’ll build: A GDB remote stub that runs on your ARM target, allowing GDB to connect over serial and debug programs—set breakpoints, single-step, inspect memory and registers.

Why it teaches ARM: This requires deep understanding of: ARM debug architecture, software breakpoints (BKPT instruction), the debug monitor exception, and the GDB remote protocol. You’re building the tool that debugs other tools.

Core challenges you’ll face:

Implementing GDB Remote Serial Protocol → maps to packet format, checksums
Software breakpoints → maps to BKPT instruction insertion
Single-stepping → maps to debug monitor and step flags
Register access → maps to reading stacked exception frames
Memory read/write → maps to arbitrary memory access safely

Resources for key challenges:

“GDB Remote Serial Protocol” - Official specification

Key Concepts:

GDB Remote Protocol: GDB Documentation
ARM Debug Architecture: “The Definitive Guide to ARM Cortex-M3” Chapter 14 - Joseph Yiu
Software Breakpoints: “Building a Debugger” - Sy Brand
Debug Monitor: ARM Cortex-M Debug Technical Reference

Difficulty: Master Time estimate: 1 month+ Prerequisites: Projects 4, 9 (UART, exception handling), strong understanding of ARM internals.

Real world outcome:

# On your computer:
$ arm-none-eabi-gdb program.elf
(gdb) target remote /dev/ttyUSB0
Remote debugging using /dev/ttyUSB0
0x08000100 in Reset_Handler ()

(gdb) break main
Breakpoint 1 at 0x08000234: file main.c, line 12.

(gdb) continue
Continuing.
Breakpoint 1, main () at main.c:12
12	    int x = 42;

(gdb) print x
$1 = 0

(gdb) step
13	    int y = x * 2;

(gdb) print x
$2 = 42

(gdb) info registers
r0             0x42                66
r1             0x0                 0
r2             0x20000100          536871168
...
pc             0x8000238           0x8000238 <main+4>
cpsr           0x61000000          1627389952

(gdb) x/4x 0x20000000
0x20000000:	0x00000042	0x00000054	0x00000000	0x00000000

Implementation Hints:

GDB packet format:

$<data>#<checksum>

Examples:
  $g#67             - Read all registers
  $m8000000,10#xx   - Read 16 bytes at 0x08000000
  $M8000000,4:12345678#xx - Write 4 bytes
  $c#63             - Continue execution
  $s#73             - Single step
  $Z0,8000234,2#xx  - Set breakpoint at 0x08000234

Main stub loop:

void gdb_stub_main(void) {
    while (1) {
        char packet[256];
        gdb_receive_packet(packet);

        switch (packet[0]) {
            case 'g':  // Read registers
                send_registers();
                break;
            case 'G':  // Write registers
                write_registers(packet + 1);
                send_ok();
                break;
            case 'm':  // Read memory
                read_memory(packet);
                break;
            case 'M':  // Write memory
                write_memory(packet);
                send_ok();
                break;
            case 'c':  // Continue
                continue_execution();
                break;
            case 's':  // Step
                single_step();
                break;
            case 'Z':  // Set breakpoint
                set_breakpoint(packet);
                break;
            // ... more commands ...
        }
    }
}

Software breakpoints:

// BKPT instruction: 0xBExx (Thumb) or 0xE12xxxxx (ARM)
void set_breakpoint(uint32_t addr) {
    // Save original instruction
    breakpoints[n].addr = addr;
    breakpoints[n].original = *(uint16_t *)addr;

    // Insert BKPT
    *(uint16_t *)addr = 0xBE00;  // BKPT #0

    // Flush I-cache if needed
    __DSB();
    __ISB();
}

Debug monitor exception:

void DebugMon_Handler(void) {
    // Check if BKPT or single-step
    if (SCB->DFSR & SCB_DFSR_BKPT) {
        // Breakpoint hit
        save_context();
        gdb_send_stop_reply('T', 5);  // SIGTRAP
        gdb_stub_main();
    }
}

Key questions:

How do you handle the case where a breakpoint is in the delay slot?
What if the user sets a breakpoint on the GDB stub itself?
How do you implement hardware breakpoints (limited number)?

Learning milestones:

GDB connects and reads registers → Basic protocol works
Memory dump works → Memory access correct
Breakpoints stop execution → BKPT and DebugMon work
Single-step works → Step flag handling correct

Project 15: Tiny Operating System

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C + ARM Assembly
Alternative Programming Languages: Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Operating Systems / Kernel Development
Software or Tool: STM32 board
Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau

What you’ll build: A minimal but complete operating system with: preemptive multitasking, memory protection (using MPU), IPC (semaphores, queues), and a simple shell—like a tiny FreeRTOS you built yourself.

Why it teaches ARM: This is the capstone that ties everything together. You’ll implement: protected vs unprivileged modes, MPU regions, SVCall for system calls, proper task isolation, and all the OS primitives. After this, you understand both ARM and operating systems at the deepest level.

Core challenges you’ll face:

User/Kernel mode separation → maps to ARM privilege levels
MPU configuration → maps to memory protection regions
System call interface → maps to SVC instruction and handler
Inter-task communication → maps to queues, semaphores
Priority-based scheduling → maps to scheduler algorithms

Key Concepts:

OS Fundamentals: “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau
ARM Privilege Levels: “The Definitive Guide to ARM Cortex-M3” Chapter 3 - Joseph Yiu
MPU Configuration: “The Definitive Guide to ARM Cortex-M3” Chapter 11 - Joseph Yiu
System Calls: “Operating Systems: Three Easy Pieces” Chapter 6
Synchronization: “Operating Systems: Three Easy Pieces” Chapters 27-31

Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects, especially 6-7, 9. Strong OS theory background.

Real world outcome:

=== TinyOS v1.0 ===
Kernel: 8KB Flash, 2KB RAM
User space: 120KB Flash, 30KB RAM
Tasks: 8 max, priority 0-7

Boot sequence:
  [OK] MPU configured: 8 regions
  [OK] Kernel in privileged mode
  [OK] User tasks in unprivileged mode
  [OK] SysTick @ 1ms
  [OK] Shell task started

TinyOS> ps
PID  NAME       PRI  STATE     STACK  CPU%
  1  idle         7  READY     128    85%
  2  shell        2  RUNNING   512     2%
  3  blinker      3  READY     256     1%
  4  sensor       1  BLOCKED   256    12%

TinyOS> exec counter
Starting task 'counter' (PID 5)

TinyOS> kill 3
Task 'blinker' terminated

TinyOS> mem
Kernel heap: 1024/2048 bytes used
User heap:   4096/30720 bytes used
Task stacks: 1280 bytes total

TinyOS> sem
SEM       VALUE  WAITERS
uart_tx       1  (none)
sensor_rdy    0  sensor(4)

TinyOS> msg
QUEUE     SIZE  PENDING
cmd_q     16    2 messages
data_q    64    0 messages

Implementation Hints:

System call mechanism:

// User space: request service via SVC
int sys_write(int fd, const char *buf, int len) {
    register int r0 __asm("r0") = fd;
    register const char *r1 __asm("r1") = buf;
    register int r2 __asm("r2") = len;
    register int result __asm("r0");

    __asm volatile (
        "SVC #1"  // System call number in immediate
        : "=r" (result)
        : "r" (r0), "r" (r1), "r" (r2)
        : "memory"
    );
    return result;
}

// Kernel: SVC handler
void SVC_Handler(void) {
    // Get stacked PC, read SVC instruction to get number
    uint32_t *sp = get_psp();
    uint32_t pc = sp[6];
    uint8_t svc_num = ((uint8_t *)pc)[-2];  // SVC number

    switch (svc_num) {
        case 0: syscall_yield(); break;
        case 1: syscall_write(sp[0], (void*)sp[1], sp[2]); break;
        case 2: syscall_read(sp[0], (void*)sp[1], sp[2]); break;
        // ... more syscalls ...
    }
}

MPU configuration:

void mpu_configure_task(TCB_t *task) {
    // Region 0: Code (read-only, execute)
    MPU->RBAR = task->code_start | MPU_RBAR_VALID | 0;
    MPU->RASR = MPU_RASR_ENABLE | REGION_32K |
                MPU_RASR_AP_RO_RO | MPU_RASR_XN_NO;

    // Region 1: Data (read-write, no execute)
    MPU->RBAR = task->data_start | MPU_RBAR_VALID | 1;
    MPU->RASR = MPU_RASR_ENABLE | REGION_4K |
                MPU_RASR_AP_RW_RW | MPU_RASR_XN_YES;

    // Region 2: Stack (read-write, no execute)
    MPU->RBAR = task->stack_start | MPU_RBAR_VALID | 2;
    MPU->RASR = MPU_RASR_ENABLE | REGION_1K |
                MPU_RASR_AP_RW_RW | MPU_RASR_XN_YES;
}

Task state machine:

         ┌─────────────────────────────────────┐
         ▼                                     │
    ┌─────────┐    schedule    ┌─────────┐    │
    │  READY  │───────────────▶│ RUNNING │    │
    └─────────┘                └─────────┘    │
         ▲                          │         │
         │                          │ wait    │ preempt
    signal│                          ▼         │
         │                     ┌─────────┐    │
         └─────────────────────│ BLOCKED │    │
                               └─────────┘    │
                                              │
         ┌─────────┐                          │
         │ ZOMBIE  │◀─────────────────────────┘
         └─────────┘     exit

Key questions:

How do you switch from Handler mode to Thread mode with unprivileged access?
What happens when a user task tries to access kernel memory?
How do you implement priority inheritance for mutexes?

Learning milestones:

Tasks run in unprivileged mode → Privilege separation works
MPU faults on bad access → Memory protection works
System calls work → SVC mechanism works
Shell can spawn/kill tasks → Process management works

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Instruction Decoder	Intermediate	1-2 weeks	⭐⭐⭐⭐⭐ (ISA encoding)	⭐⭐⭐
2. Assembly Calculator	Beginner	Weekend	⭐⭐⭐ (registers, syscalls)	⭐⭐⭐
3. Bare-Metal LED	Advanced	1-2 weeks	⭐⭐⭐⭐⭐ (boot, GPIO)	⭐⭐⭐⭐⭐
4. UART Driver	Advanced	1-2 weeks	⭐⭐⭐⭐ (peripherals, IRQ)	⭐⭐⭐⭐
5. Memory Allocator	Advanced	1-2 weeks	⭐⭐⭐⭐ (heap, alignment)	⭐⭐⭐
6. Bootloader	Expert	2-4 weeks	⭐⭐⭐⭐⭐ (flash, boot)	⭐⭐⭐⭐⭐
7. Context Switcher	Expert	2-4 weeks	⭐⭐⭐⭐⭐ (multitasking)	⭐⭐⭐⭐⭐
8. ARM Emulator	Master	1 month+	⭐⭐⭐⭐⭐ (complete ISA)	⭐⭐⭐⭐⭐
9. Fault Analyzer	Expert	1-2 weeks	⭐⭐⭐⭐ (exceptions)	⭐⭐⭐⭐
10. I2C OLED Driver	Advanced	1-2 weeks	⭐⭐⭐ (I2C protocol)	⭐⭐⭐⭐⭐
11. SPI SD Card	Advanced	2-3 weeks	⭐⭐⭐⭐ (SPI, filesystem)	⭐⭐⭐⭐
12. PWM Motor Control	Intermediate	1 week	⭐⭐⭐ (timers, PWM)	⭐⭐⭐⭐⭐
13. DMA Audio Player	Expert	2-3 weeks	⭐⭐⭐⭐ (DMA, DAC)	⭐⭐⭐⭐⭐
14. GDB Stub	Master	1 month+	⭐⭐⭐⭐⭐ (debug arch)	⭐⭐⭐⭐
15. Tiny OS	Master	1-2 months	⭐⭐⭐⭐⭐ (everything)	⭐⭐⭐⭐⭐

Recommended Learning Path

If you’re completely new to ARM:

Start here:

Project 2: Assembly Calculator - Get comfortable with ARM assembly syntax
Project 1: Instruction Decoder - Understand how instructions are encoded
Project 3: Bare-Metal LED - Your first real hardware project

If you have some embedded experience:

Start here:

Project 3: Bare-Metal LED - Verify your bare-metal skills
Project 4: UART Driver - Build your debug console
Project 10: I2C OLED - Visual feedback is motivating!
Project 7: Context Switcher - Core RTOS concept

If you want the deep understanding:

Follow this path:

Projects 1-3 (foundations)
Projects 4-5 (peripherals and memory)
Projects 6-7 (boot and multitasking)
Project 8: ARM Emulator - The ultimate learning project
Project 15: Tiny OS - Put it all together

Hardware Requirements

Minimum kit (~$25):

STM32 Nucleo-F411RE or Nucleo-F446RE board
USB cable (included)
Breadboard and jumper wires

Recommended additions (~$50 more):

SSD1306 OLED display (I2C)
MicroSD card module
Small speaker/buzzer
DC motor with L298N driver
Rotary encoder

Final Capstone: ARM-Based Retro Game Console

File: LEARN_ARM_DEEP_DIVE.md
Main Programming Language: C + ARM Assembly
Alternative Programming Languages: Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 5: Master
Knowledge Area: Complete System / Graphics / Audio / Input
Software or Tool: STM32F4 + LCD + Buttons + Speaker
Main Book: “Computer Graphics from Scratch” by Gabriel Gambetta

What you’ll build: A complete handheld game console with: color LCD display (SPI), audio output (DAC/PWM), button input, game ROM loading from SD card, and a simple game (Tetris/Snake/Breakout).

Why this is the capstone: This project integrates everything:

Project 3: Bare-metal initialization and GPIO
Project 4: UART for debugging
Project 11: SPI for LCD and SD card
Project 12: PWM for audio
Project 13: DMA for efficient transfers
Project 7: Game loop timing (optional RTOS)
Project 5: Memory management for game assets

Core challenges you’ll face:

Fast LCD updates → SPI + DMA for 60fps
Double-buffered graphics → Tear-free rendering
Game timing → Consistent frame rate
Audio mixing → Multiple sound effects
Asset management → Loading sprites/sounds from SD
Power management → Battery-friendly operation

Key Concepts:

Game Loop Architecture: “Game Programming Patterns” Chapter 1 - Robert Nystrom
Sprite Rendering: “Computer Graphics from Scratch” - Gabriel Gambetta
Audio Synthesis: “The Audio Programming Book” - Boulanger & Lazzarini
Embedded Graphics: “Making Embedded Systems” Chapter 8 - Elecia White

Difficulty: Master Time estimate: 2-3 months Prerequisites: Most previous projects, especially 3, 10-13, graphics/game programming interest.

Real world outcome:

Physical device: A handheld game console you built from scratch!

┌─────────────────────────────────┐
│      ARM Game Console v1.0      │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │   ████████████████████    │  │
│  │   ████  TETRIS   ████    │  │
│  │   ████████████████████    │  │
│  │                           │  │
│  │     Score: 12450          │  │
│  │     Level: 5              │  │
│  │                           │  │
│  │      ░░██░░               │  │
│  │      ░░██░░               │  │
│  │    ██████████             │  │
│  │    ████░░████             │  │
│  │    ██████████             │  │
│  └───────────────────────────┘  │
│                                 │
│   [←] [→]    [↓]    [A] [B]    │
│                                 │
└─────────────────────────────────┘

Serial debug output:
FPS: 60.02 | CPU: 45% | DMA: active
Audio: 22kHz mono, 2 channels mixed
LCD: 320x240 RGB565, SPI @ 40MHz

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────┐
│                   Game Loop                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │  Input   │─▶│  Update  │─▶│  Render  │       │
│  └──────────┘  └──────────┘  └──────────┘       │
│       │                            │             │
│       ▼                            ▼             │
│  GPIO Buttons              Framebuffer[2]        │
│                                    │             │
│                                    ▼             │
│                            DMA to LCD (SPI)      │
└─────────────────────────────────────────────────┘
          │
          ▼
    Timer Interrupt (60Hz)
          │
          ▼
    Audio DMA (background)

LCD configuration (ILI9341 example):

void lcd_init(void) {
    // Hardware reset
    gpio_clear(LCD_RST); delay_ms(10);
    gpio_set(LCD_RST); delay_ms(120);

    lcd_cmd(0x01);  // Software reset
    delay_ms(5);

    lcd_cmd(0x11);  // Sleep out
    delay_ms(120);

    lcd_cmd(0x3A); lcd_data(0x55);  // 16-bit color
    lcd_cmd(0x36); lcd_data(0x48);  // Rotation
    lcd_cmd(0x29);  // Display on

    // Configure DMA for fast writes
    setup_lcd_dma();
}

Double buffering:

uint16_t framebuffer[2][320 * 240];  // RGB565, ~150KB each
volatile uint8_t current_buffer = 0;

void swap_buffers(void) {
    // Wait for previous DMA to complete
    while (dma_busy);

    // Start DMA transfer of current buffer
    lcd_dma_start(framebuffer[current_buffer], 320 * 240);

    // Switch to other buffer for rendering
    current_buffer ^= 1;
}

Simple sprite blitting:

void draw_sprite(int x, int y, const uint16_t *sprite, int w, int h) {
    uint16_t *fb = framebuffer[current_buffer];
    for (int row = 0; row < h; row++) {
        for (int col = 0; col < w; col++) {
            uint16_t color = sprite[row * w + col];
            if (color != TRANSPARENT) {
                fb[(y + row) * 320 + (x + col)] = color;
            }
        }
    }
}

Game loop timing:

#define TARGET_FPS 60
#define FRAME_TIME_MS (1000 / TARGET_FPS)

void game_loop(void) {
    uint32_t last_time = get_ms();

    while (1) {
        // Input
        uint8_t buttons = read_buttons();

        // Update game state
        game_update(buttons);

        // Render to back buffer
        render_game();

        // Swap buffers (triggers DMA)
        swap_buffers();

        // Wait for frame timing
        uint32_t elapsed = get_ms() - last_time;
        if (elapsed < FRAME_TIME_MS) {
            delay_ms(FRAME_TIME_MS - elapsed);
        }
        last_time = get_ms();
    }
}

Key questions:

How do you handle button debouncing in real-time?
How do you manage RAM when you only have 128KB?
How do you achieve 60fps with limited SPI bandwidth?

Learning milestones:

LCD displays solid color → SPI and LCD init work
Sprites render correctly → Framebuffer and blitting work
Game runs smoothly → Timing and input work
Sound plays during game → Audio integration works

Essential Resources

Official Documentation

Books (Priority Order)

“The Art of ARM Assembly, Volume 1” by Randall Hyde - The definitive modern ARM assembly book
“The Definitive Guide to ARM Cortex-M3 and Cortex-M4” by Joseph Yiu - Essential for embedded ARM
“Making Embedded Systems, 2nd Edition” by Elecia White - Practical embedded development
“Computer Organization and Design ARM Edition” by Patterson & Hennessy - Computer architecture fundamentals
“Operating Systems: Three Easy Pieces” by Arpaci-Dusseau - OS concepts (free online)

Online Tutorials

Azeria Labs - Writing ARM Assembly - Security-focused ARM assembly
ARM Assembly By Example - Hands-on exercises
Kevin Boone - Raspberry Pi ARM Assembly - Progressive examples
cpq/bare-metal-programming-guide - Excellent STM32 bare-metal guide

Video Courses

freeCodeCamp - Assembly Language Programming with ARM - Comprehensive free course
Udemy - ARM Assembly Language From Ground Up - Paid, thorough course

Hardware Recommendations

STM32 Nucleo-F446RE (~$15) - Great all-around board, Cortex-M4F
Raspberry Pi - Linux-based ARM, good for assembly practice
STM32F4 Discovery (~$20) - More peripherals built-in

Summary: All Projects and Languages

#	Project	Main Language	Alternative Languages
1	ARM Instruction Decoder	C	Rust, Python, Go
2	ARM Assembly Calculator	ARM Assembly	N/A
3	Bare-Metal LED Blinker	C + ARM Assembly	Pure Assembly, Rust
4	UART Driver	C	ARM Assembly, Rust
5	Memory Allocator	C	Rust, ARM Assembly
6	Simple Bootloader	ARM Assembly + C	Pure Assembly
7	Context Switcher	ARM Assembly + C	Rust with inline asm
8	ARM Emulator	C	Rust, C++, Go
9	Exception Handler	C + ARM Assembly	Rust
10	I2C OLED Driver	C	Rust, MicroPython
11	SPI SD Card Driver	C	Rust
12	PWM Motor Controller	C	Rust, MicroPython
13	DMA Audio Player	C	Rust
14	GDB Stub	C	Rust
15	Tiny Operating System	C + ARM Assembly	Rust
🎮	Capstone: Game Console	C + ARM Assembly	Rust

Remember: ARM is learned by doing, not just reading. Pick a project that excites you, get the hardware, and start building. Every bug you fix teaches you something new about how ARM really works.

Good luck on your ARM journey!