Sprint: ARM Assembly Mastery - Real World Projects

Goal: Build a first-principles understanding of ARM assembly across microcontrollers and application-class systems, with enough depth to reason about performance, correctness, and hardware behavior. You will internalize how ARM profiles, execution states, registers, and memory maps shape the machine-level reality beneath high-level code, and how toolchains turn assembly into runnable binaries. You will be able to analyze and explain boot flows, interrupts, calling conventions, and memory ordering, and you will build real, observable projects that validate each concept. By the end, you can navigate both Cortex-M (Thumb) and AArch64 worlds with confidence and produce artifacts that prove you understand how ARM systems actually work.

Introduction

What is ARM assembly? It is the human-readable form of the instructions that ARM CPUs execute. ARM is a family of architectures with multiple profiles (A, M, R), each optimized for different constraints and use cases. citeturn0search0turn0search1turn0search2
What problem does it solve today? It gives you control and insight over performance, timing, and hardware interaction in environments where abstractions leak: microcontrollers, real-time systems, firmware, OS kernels, and low-level tooling.
What will you build? A toolchain explorer, instruction encoder/decoder, register and stack visualizer, vector table builder, interrupt-driven I/O demos, ABI audits, memory-ordering litmus tests, and a capstone bare-metal monitor.
In scope vs out of scope: In scope are ARMv6-M/ARMv7-M (Thumb) and ARMv8-A (AArch64) concepts, toolchains, and bare-metal workflows. Out of scope are high-level OS driver frameworks, full Linux kernel bring-up, and vendor-specific BSPs (you will reference them, but not depend on them).

Big Picture: Why ARM Assembly Matters
─────────────────────────────────────────────────────────────────────────
Apps/OS code ─▶ Compiler ─▶ ARM ISA ─▶ Registers/Memory ─▶ Peripherals
     ^                  ^         ^            ^               ^
     |                  |         |            |               |
Readability         Toolchain  Instruction  Boot/Interrupts  MMIO
                   choices     encoding      & exceptions     timing

How to Use This Guide

Read the Theory Primer first; it is the mental model that makes the projects meaningful.
Pick a learning path that fits your target: Cortex-M (embedded) or AArch64 (systems).
Treat each project as an experiment: define the expected outcome, build it, then verify.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Comfortable with C or Rust at the level of pointers, structs, and function calls.
Binary/hex conversions and bitwise operations.
Basic CPU concepts: registers, memory, stack, and instruction execution.
Recommended Reading: “Computer Organization and Design: ARM Edition” by Patterson & Hennessy - Ch. 1-3.

Helpful But Not Required

Familiarity with Linux command-line tooling (objdump, readelf, gdb).
Prior microcontroller exposure (GPIO, UART) or Raspberry Pi bring-up.

Self-Assessment Questions

Can you explain the difference between a register and a memory address?
Can you convert 0b10110100 to hex without a calculator?
Can you describe what happens when a function call returns?

Development Environment Setup Required Tools:

GNU binutils (as, ld, objdump, readelf) - recent version. citeturn1search1
AArch64 toolchain (cross-compiler or assembler) - recent version.

Recommended Tools:

QEMU for ARM emulation.
OpenOCD + GDB for hardware debugging.

Testing Your Setup: $ aarch64-none-elf-as –version GNU assembler (aarch64-unknown-elf) version X.Y

Time Investment

Simple projects: 4-8 hours each
Moderate projects: 10-20 hours each
Complex projects: 20-40 hours each
Total sprint: 2-4 months

Important Reality Check ARM assembly exposes details that most developers never see. Expect to be slow at first; speed and intuition only come after repeated, observable experiments.

Big Picture / Mental Model

ARM assembly is not a single language. It is a family of instruction sets and conventions that map to the same architectural lineage but diverge by profile and execution state. A mental model that connects architecture → registers → memory map → toolchain is what makes the projects coherent.

Mental Model: From Architecture to Observable Behavior
─────────────────────────────────────────────────────────────────────────
Architecture Profile (A/M/R)
         │
         ├── Execution State (AArch64 / AArch32 / Thumb)
         │        │
         │        ├── Register Model & ABI
         │        │        │
         │        │        ├── Instruction Encoding & Addressing
         │        │        │        │
         │        │        │        ├── Memory Map & MMIO
         │        │        │        │        │
         │        │        │        │        └── Interrupts/Exceptions
         │        │        │        │
         │        │        └── Toolchain & Binary Formats (ELF)
         │        │
         └── Observable Outcomes (timing, I/O, boot, correctness)

Theory Primer

Concept 1: ARM Architecture Profiles and Execution States

Fundamentals ARM is a family of architectures organized into profiles optimized for different constraints. The A-profile targets application processors that run rich OSes (phones, laptops, servers), the M-profile targets microcontrollers with tight power and memory budgets, and the R-profile targets deterministic real-time systems. citeturn0search0turn0search1turn0search2 Each profile implies a different set of instructions, privilege models, and system features. Within a profile, ARM defines execution states (such as AArch64 or AArch32) that determine register width, instruction encoding, and address space. AArch64 is the 64-bit execution state introduced in ARMv8-A, while AArch32 is the 32-bit state; M-profile uses Thumb encodings for compact code density and simpler decode logic. citeturn0search2 This is why “ARM assembly” is not a single language: the same mnemonic can encode differently, or even be invalid, depending on profile and state.

Deep Dive The profile split is the most important high-level idea in ARM. A-profile cores are built to host complex operating systems with virtual memory, multi-core scheduling, and high performance. That means features like exception levels, MMUs, and richer instruction sets matter. In contrast, M-profile focuses on minimal latency, low power, and deterministic behavior: it strips away many features to reduce silicon cost and simplify real-time response. R-profile sits in-between: it retains more predictability than A-profile but includes stronger real-time guarantees than M-profile. citeturn0search0turn0search1 When you choose to write assembly, you’re implicitly choosing a profile, and that choice changes everything from the boot flow to the toolchain arguments you use.

Execution states deepen the split. In ARMv8-A, AArch64 brings a new 64-bit register file and 32-bit fixed-length instruction encoding (A64). AArch32 keeps the 32-bit model (A32/T32). This means that for A-profile hardware, your code must declare its intended execution state; otherwise, even valid mnemonics may assemble into the wrong encoding or fail. citeturn0search2 M-profile, by contrast, uses Thumb encodings by design, favoring compact instructions and simpler decode paths. These constraints are not academic. They drive register availability, calling convention differences, and even the structure of your interrupt handlers. If you write code that assumes AArch64 but run on Cortex-M, the encoding and semantics are incompatible.

Another subtle but critical effect of profile and state is the system-level context. A-profile expects multiple privilege levels and potentially a hypervisor. M-profile’s exception model is simpler, its vector table is fixed and immediate, and it typically boots directly into a single firmware image. R-profile targets systems where real-time guarantees trump throughput; this affects interrupt priority, memory latency assumptions, and peripheral access patterns. Understanding profile choice lets you reason about why an instruction exists, why a particular addressing mode is missing, and why certain system registers are visible or hidden.

Finally, architecture profiles determine the ecosystem around your work. A-profile benefits from abundant tooling, standardized ABIs, and OS integration, while M-profile leans on vendor SDKs, board-specific memory maps, and smaller toolchains. This guide intentionally spans both because many real-world systems combine them: a Linux-capable application processor for high-level features and a microcontroller for deterministic control. Once you see that split, you can design experiments and projects that map to the right target without confusion.

How this fits on projects

Shapes target selection in P01 (Toolchain Pipeline Explorer) and P07 (Exception Level Lab).
Determines encoding assumptions in P03 (Thumb Encoder/Decoder).

Definitions & key terms

Profile: A family of ARM features optimized for a market segment (A/M/R). citeturn0search0turn0search1turn0search2
Execution state: The architectural mode (AArch64, AArch32, Thumb) that defines register width and instruction encoding. citeturn0search2
AArch64: 64-bit execution state introduced in ARMv8-A.
AArch32: 32-bit execution state in ARMv8-A.
Thumb: Compact instruction encoding used by M-profile.

Mental model diagram

ARM Architecture Evolution:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                              ┌─────────────────────────────────────────────┐
                              │            ARM Holdings (IP owner)           │
                              │   Designs architectures, licenses to others  │
                              └─────────────────────┬───────────────────────┘
                                                    │
           ┌────────────────────────────────────────┼────────────────────────────┐
           │                                        │                            │
           ▼                                        ▼                            ▼
    ┌──────────────┐                    ┌────────────────┐              ┌────────────────┐
    │   M-Profile  │                    │    A-Profile   │              │   R-Profile    │
    │ Microcontrollers                  │  Applications  │              │   Real-Time    │
    │ (Embedded)   │                    │ (Phones, PCs)  │              │   (Automotive) │
    └──────────────┘                    └────────────────┘              └────────────────┘
           │                                        │
    ┌──────┴──────┐                    ┌────────────┼────────────┐
    │             │                    │            │            │
    ▼             ▼                    ▼            ▼            ▼
 ┌───────┐   ┌────────┐          ┌─────────┐  ┌─────────┐  ┌─────────┐
 │Cortex │   │Cortex  │          │Cortex-A7│  │Cortex-A │  │Cortex-A │
 │ -M0+  │   │-M4/M7  │          │Cortex-A9│  │53/55/72 │  │76/78/X  │
 │       │   │        │          │(32-bit) │  │(64-bit) │  │(64-bit) │
 └───────┘   └────────┘          └─────────┘  └─────────┘  └─────────┘
     │            │                   │            │            │
     │            │                   │            │            │
  Thumb       Thumb-2              ARM32      AArch64      AArch64
  only        + DSP               + Thumb    + NEON       + SVE2
                + FPU

┌──────────────────────────────────────────────────────────────────────────────┐
│ YOUR TARGETS:                                                                 │
│                                                                               │
│ Raspberry Pi Pico (RP2040)     Raspberry Pi 3/4/5                            │
│ ├─ Dual Cortex-M0+ cores       ├─ Cortex-A53/A72/A76 cores                   │
│ ├─ ARMv6-M architecture        ├─ ARMv8-A architecture (AArch64)             │
│ ├─ Thumb instruction set       ├─ A64 instruction set                        │
│ ├─ 16 registers (r0-r15)       ├─ 31 registers (x0-x30)                      │
│ ├─ 133 MHz max clock           ├─ 1.5-2.4 GHz clock                          │
│ └─ 264 KB RAM, no OS           └─ 1-8 GB RAM, Linux capable                  │
└──────────────────────────────────────────────────────────────────────────────┘

ARM Architecture Family Tree

How it works (step-by-step, with invariants and failure modes)

Choose the target profile (A/M/R) based on system constraints and OS expectations.
Select execution state (AArch64, AArch32, Thumb) based on ISA and toolchain output.
Assemble and link with profile/state-specific flags; encoding mismatches yield invalid opcodes.
Boot into the expected privilege level; if the firmware expects EL2 and you start at EL1, early setup fails.
Validate on target or emulator; incorrect profile assumptions manifest as illegal instruction faults or boot hangs.

Minimal concrete example (pseudo-assembly, not runnable)

Select Target = {Profile: M, State: Thumb}
Assemble([LOAD R0, [ADDR]], Target)
If Target != CPU_State → Fault: Illegal Instruction

Common misconceptions

“ARM assembly is one language” → It is a family with profile/state splits.
“Thumb is only a compact mode” → It also shapes register access and available instructions.
“AArch64 is just ARM32 with bigger registers” → It changes the register file and encoding model.

Check-your-understanding questions

Why can Cortex-M code not run on a Cortex-A core without translation?
What is the difference between AArch64 and AArch32?
How does the profile choice affect your toolchain flags?

Check-your-understanding answers

Cortex-M uses the M-profile with Thumb encodings and a different system model; Cortex-A expects A-profile with AArch64/AArch32 states.
AArch64 is a 64-bit execution state with a new register file and A64 encoding; AArch32 is 32-bit with different encodings. citeturn0search2
The assembler and linker must emit instructions for the correct ISA and object format; mismatches produce illegal opcodes or link errors.

Real-world applications

Microcontroller firmware (M-profile) in sensors, robotics, and embedded control. citeturn0search0
Application processors (A-profile) in mobile, desktop, and servers. citeturn0search2

Where you’ll apply it

P01 Toolchain Pipeline Explorer
P03 Thumb Instruction Encoder/Decoder
P07 AArch64 Exception Level Lab

References

Arm M-profile overview. citeturn0search0
Arm R-profile overview. citeturn0search1
Arm A-profile overview and execution states. citeturn0search2

Key insights Your “ARM assembly” only makes sense once you name the profile and execution state.

Summary Profiles and execution states are the root of every other difference in ARM assembly. When you get this right, the rest of the system becomes predictable.

Homework/Exercises to practice the concept

Pick two devices (one microcontroller, one phone) and identify their ARM profile and execution state.
Write a one-paragraph explanation of why Thumb exists.

Solutions to the homework/exercises

Example: RP2040 is M-profile with Thumb; a modern smartphone SoC is A-profile with AArch64.
Thumb improves code density and decoder simplicity, which is crucial for small embedded systems.

Concept 2: Registers, Stack, and Calling Conventions

Fundamentals Registers are the CPU’s fastest storage and form the working set for computation. ARM profiles differ in register count and special-purpose roles, but all share the idea that function calls require a contract: how arguments and return values are passed, which registers must be preserved, and how the stack is organized. The procedure call standard defines this contract so independently compiled code can interoperate. citeturn3search7 In Cortex-M, the register file is small and heavily constrained by Thumb encoding, whereas AArch64 offers 31 general-purpose registers and a distinct separation between 64-bit (X) and 32-bit (W) views. citeturn2search0 The stack pointer and link register govern call/return flow, and misusing them corrupts control flow even if individual instructions look correct.

Deep Dive The register file is the interface between the ISA and your mental model. On Cortex-M0+, you have 16 architectural registers (r0–r15), with r13 as the stack pointer, r14 as the link register (return address), and r15 as the program counter. Thumb encodings make low registers (r0–r7) more convenient, and this shapes how you allocate values and temporaries. On AArch64, the register file expands to 31 general-purpose registers (x0–x30), plus a dedicated SP. The lower 32-bit view (w0–w30) is an alias, not separate storage. citeturn2search0 This abundance reduces register pressure but increases the importance of ABI rules to maintain interoperability.

Calling conventions are not optional. They define which registers hold arguments, which registers are caller-saved or callee-saved, and how the stack is aligned. The AAPCS64 procedure call standard formalizes this for AArch64, and the same principle applies to Cortex-M via EABI conventions. citeturn3search7 If a function trashes a callee-saved register or returns with an unbalanced stack pointer, the next return will jump to a garbage address. The stack is not merely a place to store locals: it is a control-flow structure with strict invariants. Typical invariants include: the stack pointer must remain aligned to a fixed boundary at call boundaries, return addresses must be preserved (often via LR), and exception frames must be compatible with hardware expectations.

The difference between M-profile and A-profile also affects how you reason about stack frames and interrupts. Cortex-M pushes a standard register frame on interrupt entry and uses special EXC_RETURN values to restore state. AArch64 exceptions are higher-level and interact with exception levels (EL0–EL3). This means the calling convention is entangled with the exception model: you must ensure that the register context saved by an ISR matches what the hardware expects, and that your handler preserves the right registers. Even if you never write an ISR, debugging requires you to recognize how the stack frame was laid out and which registers are live at a given moment.

Finally, registers are a performance and correctness story. Keeping hot values in registers avoids memory latency, but retaining too many values across calls increases spill overhead and complexity. The ABI is a compromise: caller-saved registers allow fast leaf functions to avoid stack usage, while callee-saved registers allow values to persist across deeper calls. Understanding this trade-off is what turns disassembly from a wall of text into a readable narrative.

How this fits on projects

Central to P02 (Register and Stack Visualizer) and P08 (Calling Convention Audit).

Definitions & key terms

Register file: The set of architectural registers visible to instructions.
Caller-saved: Registers a caller must preserve if it needs their values after a call.
Callee-saved: Registers a callee must preserve across the call. citeturn3search7
Stack frame: A structured region on the stack holding locals and saved state.
Procedure call standard: ABI rules for passing arguments and preserving registers. citeturn3search7

Mental model diagram

Cortex-M0+ Register File:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    32 bits wide
              ◀──────────────────▶

    ┌───────────────────────────────┐
r0  │  General Purpose (argument 1) │  ─┐
    ├───────────────────────────────┤   │
r1  │  General Purpose (argument 2) │   │  Low registers:
    ├───────────────────────────────┤   │  - All Thumb instructions work
r2  │  General Purpose (argument 3) │   │  - Used for function arguments
    ├───────────────────────────────┤   │    and return values
r3  │  General Purpose (argument 4) │   │  - Caller-saved (scratch)
    ├───────────────────────────────┤   │
r4  │  General Purpose (preserved)  │   │
    ├───────────────────────────────┤   │
r5  │  General Purpose (preserved)  │   │
    ├───────────────────────────────┤   │
r6  │  General Purpose (preserved)  │   │
    ├───────────────────────────────┤   │
r7  │  General Purpose (frame ptr)  │  ─┘
    ├───────────────────────────────┤
r8  │  General Purpose (preserved)  │  ─┐  High registers:
    ├───────────────────────────────┤   │  - Only some instructions work
r9  │  General Purpose (preserved)  │   │  - Must move to low reg for
    ├───────────────────────────────┤   │    most operations
r10 │  General Purpose (preserved)  │   │  - Callee-saved
    ├───────────────────────────────┤   │
r11 │  General Purpose (preserved)  │   │
    ├───────────────────────────────┤   │
r12 │  Intra-Procedure Call scratch │  ─┘
    ├═══════════════════════════════┤  ────  SPECIAL REGISTERS  ────
r13 │  Stack Pointer (SP)           │  Points to top of stack
    ├───────────────────────────────┤  (actually 2 SPs: MSP and PSP)
r14 │  Link Register (LR)           │  Return address for functions
    ├───────────────────────────────┤
r15 │  Program Counter (PC)         │  Address of next instruction
    └───────────────────────────────┘

    ┌───────────────────────────────┐
xPSR│ N│Z│C│V│ ... │ Exception #   │  Program Status Register:
    └─┬─┴─┴─┴─────────────────────┴─┘  N = Negative, Z = Zero
      │                                 C = Carry, V = Overflow
      └─ Condition flags set by
         arithmetic operations

IMPORTANT: Cortex-M0+ has NO program counter-relative addressing for
data. You MUST use literal pools or calculate addresses manually!

Cortex-M0+ Register File

AArch64 Register File:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

             64 bits wide (X registers)
      ◀──────────────────────────────────────────▶
      │    32 bits (W register alias)            │
      │                    ◀─────────────────────┤

    ┌───────────────────────────────────────────────────────────────┐
x0  │  Argument 1 / Return value                                    │  w0
    ├───────────────────────────────────────────────────────────────┤
x1  │  Argument 2 / Return value (for 128-bit returns)              │  w1
    ├───────────────────────────────────────────────────────────────┤
x2  │  Argument 3                                                   │  w2
    ├───────────────────────────────────────────────────────────────┤
...                        ...
    ├───────────────────────────────────────────────────────────────┤
x7  │  Argument 8                                                   │  w7
    ├───────────────────────────────────────────────────────────────┤
x8  │  Indirect result location (for large struct returns)         │  w8
    ├───────────────────────────────────────────────────────────────┤
x9  │  Temporary / Caller-saved                                     │  w9
    ├───────────────────────────────────────────────────────────────┤
... │  x9-x15: Temporaries (caller-saved)                           │
    ├───────────────────────────────────────────────────────────────┤
x16 │  IP0 - Intra-procedure-call scratch (PLT, veneers)           │  w16
    ├───────────────────────────────────────────────────────────────┤
x17 │  IP1 - Intra-procedure-call scratch                          │  w17
    ├───────────────────────────────────────────────────────────────┤
x18 │  Platform register (reserved on some OSes)                    │  w18
    ├───────────────────────────────────────────────────────────────┤
x19 │  Callee-saved (must preserve across calls)                    │  w19
    ├───────────────────────────────────────────────────────────────┤
... │  x19-x28: Callee-saved                                        │
    ├═══════════════════════════════════════════════════════════════┤
x29 │  Frame Pointer (FP)                                           │  w29
    ├───────────────────────────────────────────────────────────────┤
x30 │  Link Register (LR) - return address                          │  w30
    ├───────────────────────────────────────────────────────────────┤
SP  │  Stack Pointer (not a GPR, dedicated register)                │  wsp
    ├───────────────────────────────────────────────────────────────┤
PC  │  Program Counter (not directly accessible like ARM32!)        │
    └───────────────────────────────────────────────────────────────┘

    ┌───────────────────────────────────────────────────────────────┐
XZR │  Zero Register (reads as 0, writes discarded)                 │  wzr
    └───────────────────────────────────────────────────────────────┘
      ^ This is REVOLUTIONARY - no wasted instruction to clear!


SIMD/Floating-Point Registers (32 × 128-bit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    ┌───────────────────────────────────────────────────────────────┐
v0  │  B0│B1│B2│...│B15│  ← 16 bytes = 128 bits (Q0/V0)            │
    │  H0│H1│...│H7    │  ← 8 halfwords                             │
    │  S0│S1│S2│S3     │  ← 4 singles (float)                       │
    │  D0│D1           │  ← 2 doubles                               │
    └───────────────────────────────────────────────────────────────┘

    Used for: floating-point, SIMD (NEON), and crypto operations


Key Differences from ARM32:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  • 31 GPRs (vs 16) → Less register pressure, fewer spills
  • PC not directly readable/writable → Use ADR/ADRP for addresses
  • Zero register (xzr/wzr) → MOV x0, xzr instead of MOV r0, #0
  • No conditional execution → Use CSEL, CSINC instead
  • 64-bit addresses → Can address all of RAM directly
  • All instructions 32-bit → No 16-bit Thumb encoding

AArch64 Register File

How it works (step-by-step, with invariants and failure modes)

Arguments are placed in registers according to the ABI contract. citeturn3search7
Callee preserves its required registers and sets up a stack frame if needed.
Return value is placed in the agreed register(s) before restoring SP and LR.
Failure mode: If SP alignment is violated or callee-saved registers are clobbered, returns jump to wrong addresses or data corrupts silently.

Minimal concrete example (pseudo-assembly, not runnable)

CALL f(a,b):
  ARG0 <- a
  ARG1 <- b
  LR   <- return_address
  SP   <- SP - frame_size
  ...
  return: restore SP, jump LR

Common misconceptions

“More registers always mean simpler code” → ABI rules still constrain usage.
“Stack is just a data structure” → It also encodes control flow.

Check-your-understanding questions

Why does the ABI require callee-saved registers?
What happens if a function returns without restoring SP?
Why does AArch64 expose 31 GPRs instead of 16?

Check-your-understanding answers

It lets callers rely on certain registers surviving across calls, enabling composition. citeturn3search7
The return address is read from the wrong stack location, leading to a crash or silent corruption.
AArch64’s design prioritizes performance and reduced spills; the larger register file supports that. citeturn2search0

Real-world applications

Disassembly analysis for security or performance auditing.
Interfacing assembly routines with C libraries. citeturn3search7

Where you’ll apply it

P02 Register and Stack Visualizer
P08 Calling Convention Audit

References

AArch64 register model overview. citeturn2search0
AAPCS64 procedure call standard. citeturn3search7

Key insights Calling conventions are the glue that makes low-level code composable.

Summary Registers and the stack are not independent; they form a contract that every function must obey.

Homework/Exercises to practice the concept

Draw a stack frame for a function that calls two helpers and uses three local variables.
Identify which registers you would preserve in a callee according to a generic ABI.

Solutions to the homework/exercises

The frame must allocate locals, save the return address, and preserve any callee-saved registers used.
Preserve registers designated as callee-saved in the ABI; all others are caller-saved.

Concept 3: Instruction Encoding and Addressing Modes

Fundamentals Every assembly instruction is encoded into bits. Encoding determines which registers and immediates are accessible, how large constants can be, and which addressing modes are legal. Thumb encodings prioritize compactness and energy efficiency for microcontrollers, while AArch64 uses fixed 32-bit instruction widths to simplify decode and improve pipeline predictability. citeturn0search2 Understanding encoding explains why some instructions are missing or require multi-instruction sequences and why certain address calculations must be split into steps.

Deep Dive Instruction encoding is the “physics” of assembly. A mnemonic like MOV is a human label for a specific bit pattern; if that pattern cannot fit your operands, the assembler will either refuse or emit a different instruction sequence. Thumb uses a mix of 16-bit and 32-bit encodings (Thumb-2), which means register and immediate fields are often smaller. This is why Cortex-M uses a small set of low registers more naturally and why large constants must be loaded via literal pools or multi-step sequences. AArch64, in contrast, uses 32-bit instructions exclusively, providing more predictable decoding and a richer register space. citeturn0search2 The trade-off is code density versus decode simplicity.

Addressing modes compound encoding constraints. Load/store architectures like ARM separate arithmetic from memory access: you compute addresses in registers, then load or store. But the address computation itself is limited by encoding fields. For example, immediate offsets might be limited to a certain bit width or require alignment. When you understand the bit fields, you can predict when the assembler will need to generate extra instructions, which in turn affects performance and size. This is especially important in microcontrollers where code size is constrained and instruction fetches may come from slow flash.

Encoding also interacts with endianness and instruction alignment. Many ARM cores require instructions to be aligned to 2 or 4 bytes depending on the ISA. Misalignment results in faults or unintended behavior. The assembler handles alignment for you, but if you build a binary layout manually (for example in a boot image), you must respect alignment rules to avoid hard-to-debug startup failures. This is why toolchain awareness is an essential complement to ISA knowledge.

Finally, encoding shapes the patterns you see in disassembly. A sequence of machine code bytes can decode to different instructions depending on execution state. This is a common pitfall in reverse engineering: decoding AArch64 bytes as Thumb or ARM32 yields nonsense. Knowing the encoding class prevents misinterpretation and helps you verify that your build pipeline is producing the ISA you intended.

How this fits on projects

Core to P03 (Thumb Instruction Encoder/Decoder) and P01 (Toolchain Pipeline Explorer).

Definitions & key terms

Encoding: Bit-level representation of an instruction.
Addressing mode: How an instruction specifies the location of its operands.
Thumb/Thumb-2: Compact ARM instruction encodings for M-profile.
A64: 32-bit fixed-length instruction encoding for AArch64. citeturn0search2

Mental model diagram

Thumb Instruction Encoding Examples:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

16-bit Thumb instruction: MOV r0, #42
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ 0│ 1│ 0│ 0│ Rd  │     imm8 (immediate)    │
│ 0│ 0│ 1│ 0│ 0│0│0│0│ 0│ 0│ 1│ 0│ 1│ 0│ 1│ 0│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
              │r0 │         42 = 0x2A
              │   │
              ▼   ▼
          Encodes to: 0x202A (little-endian: 2A 20)


32-bit Thumb-2 instruction: LDR r0, [r1, #offset]  (when offset > 31)
┌──────────────────────────────────────────────────────────────────┐
│ First halfword (16 bits)  │  Second halfword (16 bits)           │
│   encoding prefix + Rn    │     Rt + imm12 offset                │
└──────────────────────────────────────────────────────────────────┘


Common Thumb Instructions You'll Use:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Data Movement:
  MOV  Rd, #imm8        Move 8-bit immediate to register
  MOV  Rd, Rm           Move register to register
  LDR  Rt, [Rn, #off]   Load word from memory
  STR  Rt, [Rn, #off]   Store word to memory
  PUSH {reglist}        Push registers to stack
  POP  {reglist}        Pop registers from stack

Arithmetic:
  ADD  Rd, Rn, #imm3    Add 3-bit immediate
  ADD  Rd, #imm8        Add 8-bit immediate to Rd
  SUB  Rd, Rn, #imm3    Subtract 3-bit immediate
  SUBS Rd, Rn, Rm       Subtract with flags update

Logic:
  AND  Rd, Rm           Bitwise AND
  ORR  Rd, Rm           Bitwise OR
  EOR  Rd, Rm           Bitwise XOR (exclusive OR)
  LSL  Rd, Rm, #imm5    Logical shift left
  LSR  Rd, Rm, #imm5    Logical shift right

Control Flow:
  B    label            Unconditional branch
  BEQ  label            Branch if equal (Z=1)
  BNE  label            Branch if not equal (Z=0)
  BL   function         Branch with link (function call)
  BX   Rm               Branch to address in register
  BLX  Rm               Branch with link to address in register


LIMITATION: Cortex-M0+ is MISSING many instructions!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✗ No hardware divide (UDIV, SDIV)     → Must use software division
✗ No bit-field instructions (BFI)     → Must use shift/mask sequences
✗ No conditional execution (IT block) → Must use branches
✗ Limited addressing modes            → Can't do [Rn, Rm, LSL #2]
✗ No saturation arithmetic            → Must check overflow manually

Thumb Instruction Encoding

How it works (step-by-step, with invariants and failure modes)

The assembler maps mnemonics to encoding templates for the target ISA.
Register and immediate fields are packed into fixed bit positions.
If a value doesn’t fit, the assembler emits a sequence or errors out.
Failure mode: decoding with the wrong ISA yields invalid instructions or faults.

Minimal concrete example (pseudo, not runnable)

ENCODE(op=ADD, rd=R0, rn=R1, imm=5)
→ [opcode bits][rd bits][rn bits][imm bits]

Common misconceptions

“Assembler will always accept my operands” → Encoding limits still apply.
“Instruction length doesn’t matter” → It affects alignment and memory layout.

Check-your-understanding questions

Why does Thumb use shorter encodings than AArch64?
What happens if an immediate is too large for its field?
Why can decoding with the wrong execution state break disassembly?

Check-your-understanding answers

Thumb optimizes for code density and decoder simplicity in M-profile contexts. citeturn0search2
The assembler emits a sequence or reports an error because it cannot fit the value.
The same bytes map to different instruction sets depending on state, so decoding mismatches yield nonsense.

Real-world applications

Building encoders/decoders for tooling and reverse engineering.
Size-sensitive firmware builds for microcontrollers.

Where you’ll apply it

P03 Thumb Instruction Encoder/Decoder
P01 Toolchain Pipeline Explorer

References

Arm A-profile overview (AArch64, instruction model). citeturn0search2

Key insights Encoding constraints explain most “mysterious” assembler errors.

Summary Instruction encodings are the boundary between human mnemonics and machine reality; mastering them unlocks predictability.

Homework/Exercises to practice the concept

Choose a Thumb instruction and manually identify which bits encode the register fields.
Explain why a large constant may require multiple instructions.

Solutions to the homework/exercises

The register fields are fixed bit slices in the instruction encoding; their size limits which registers are directly addressable.
If the immediate field is too small, the assembler must build the constant through multiple steps.

Concept 4: Memory Maps, MMIO, and Memory Ordering

Fundamentals ARM systems expose peripherals through memory-mapped I/O (MMIO): reading or writing specific addresses triggers hardware behavior rather than normal memory access. This is central to microcontrollers and still vital on A-profile SoCs. The memory map defines which address ranges are RAM, flash, peripherals, and internal control regions. Memory ordering adds another layer: modern CPUs can reorder memory accesses for performance, so barriers (DMB/DSB/ISB) are required to guarantee visibility and ordering to devices or other cores. citeturn3search3 Understanding MMIO and ordering is the key to controlling hardware reliably.

Deep Dive A memory map is a contract between the CPU and the SoC. Addresses are not abstract: they correspond to real hardware blocks. In Cortex-M systems, large fixed ranges map to flash, SRAM, peripherals, and internal control registers. These ranges determine what happens when you load or store. For example, a store to a GPIO register flips a pin; a load from a UART data register consumes a byte from a FIFO. MMIO behaves differently from RAM: it is often non-cacheable, may have side effects on read, and is frequently write-only or read-only. When you treat it like ordinary memory, bugs emerge: stale values, missing updates, or unintended state changes.

Memory ordering complicates this further. ARM cores, like most modern CPUs, can reorder memory operations to improve performance. This is invisible in single-threaded logic but catastrophic for devices and multi-core coordination. If you write a command buffer to memory and then write a “doorbell” MMIO register that tells the device to consume it, the device might see the doorbell first unless you insert a barrier. ARM provides barrier instructions—DMB, DSB, ISB—each with distinct strength. DMB ensures prior memory accesses are observed before subsequent ones; DSB additionally waits for completion; ISB flushes the instruction pipeline to make control-register changes visible. citeturn3search3 These are not optional: they are the difference between “mostly works” and “always correct.”

On microcontrollers, you may not have caches or complex reorder buffers, but the bus fabric and peripheral interactions still require ordering. On A-profile systems with caches, speculation, and out-of-order execution, the need is even greater. DMA engines read and write memory independently of the CPU; if you don’t synchronize caches or enforce ordering, the DMA sees stale or partial data. This is why firmware often combines barriers with explicit cache maintenance. The principle is simple: your mental model must include the device, the bus, and the CPU pipeline, not just the instruction sequence.

MMIO access patterns also introduce concurrency hazards. Read-modify-write sequences can race with interrupts or other cores. Hardware often provides SET/CLEAR registers specifically to avoid these races by allowing atomic bit operations. If you ignore these and perform a naive read-modify-write, you can silently clear unrelated bits. The safest approach is to understand the register semantics and use the atomic registers provided. That is not assembly-specific, but assembly exposes the pattern directly and makes it obvious.

How this fits on projects

Core to P04 (Memory Map & MMIO Field Notebook) and P09 (Memory Ordering Litmus Tests).

Definitions & key terms

Memory map: The assignment of address ranges to RAM, flash, and peripherals.
MMIO: Memory addresses that control hardware rather than store data.
DMB/DSB/ISB: Memory barrier instructions for ordering and visibility. citeturn3search3

Mental model diagram

Cortex-M Memory Map (4GB address space):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    0xFFFFFFFF ┌─────────────────────────────────────────┐
               │         Vendor-Specific                 │
    0xE0100000 ├─────────────────────────────────────────┤
               │         Private Peripheral Bus          │  ← NVIC lives here
               │         (Internal peripherals)          │    at 0xE000E000
    0xE0000000 ├─────────────────────────────────────────┤
               │                                         │
               │         External Device                 │  ← Memory-mapped
               │         (Peripherals, etc.)             │    devices
               │                                         │
    0xA0000000 ├─────────────────────────────────────────┤
               │                                         │
               │         External RAM                    │
               │                                         │
    0x60000000 ├─────────────────────────────────────────┤
               │                                         │
               │         Peripheral                      │  ← GPIO, UART, SPI,
               │         (On-chip I/O)                   │    I2C, PWM, etc.
               │                                         │
    0x40000000 ├─────────────────────────────────────────┤
               │                                         │
               │         SRAM                            │  ← Variables, stack,
               │         (On-chip RAM)                   │    heap
               │                                         │
    0x20000000 ├─────────────────────────────────────────┤
               │                                         │
               │         Code                            │  ← Flash/ROM with
               │         (Flash/ROM)                     │    your program
               │                                         │
    0x00000000 └─────────────────────────────────────────┘


RP2040-Specific Memory Map:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Address         │  Size      │  Contents
    ────────────────┼────────────┼─────────────────────────────────────
    0x10000000      │  2 MB      │  External Flash (XIP)
                    │            │  ↳ Your code runs from here
    ────────────────┼────────────┼─────────────────────────────────────
    0x20000000      │  256 KB    │  Main SRAM (4 banks × 64KB)
                    │            │  ↳ Variables, stack, heap
    0x20040000      │  4 KB      │  SRAM4 (for USB)
    0x20041000      │  4 KB      │  SRAM5 (for USB)
    ────────────────┼────────────┼─────────────────────────────────────
    0x40000000      │  -         │  APB Peripherals
                    │            │  ↳ UART, SPI, I2C, PWM...
    ────────────────┼────────────┼─────────────────────────────────────
    0x50000000      │  -         │  AHB-Lite Peripherals
                    │            │  ↳ DMA, USB, PIO...
    ────────────────┼────────────┼─────────────────────────────────────
    0xD0000000      │  -         │  SIO (Single-cycle I/O)
                    │            │  ↳ GPIO (fast access!)
    ────────────────┼────────────┼─────────────────────────────────────
    0xE0000000      │  -         │  Cortex-M0+ internal
                    │            │  ↳ NVIC, SysTick, Debug

Cortex-M Memory Map

Memory-Mapped I/O Concept:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Normal Memory:                  Peripheral Register:
    ──────────────                  ────────────────────
    LDR r0, [addr]                  LDR r0, [UART_DATA]
         │                               │
         ▼                               ▼
    Read from RAM                   Read TRIGGERS HARDWARE!
    Data was sitting there          Byte removed from RX FIFO
    Memory unchanged                Status flags updated

    STR r0, [addr]                  STR r0, [GPIO_OUT]
         │                               │
         ▼                               ▼
    Write to RAM                    Write CAUSES ACTION!
    Data now stored there           Pin voltage changes
    Can read it back                May not read same value back


Example: GPIO Control on RP2040:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    SIO Base: 0xD0000000

    Offset  │ Register       │ Purpose
    ────────┼────────────────┼──────────────────────────────────
    0x000   │ CPUID          │ Processor ID (read-only)
    0x004   │ GPIO_IN        │ Read current GPIO input state
    0x010   │ GPIO_OUT       │ Read/write GPIO output state
    0x014   │ GPIO_OUT_SET   │ Set bits in GPIO_OUT (write-only)
    0x018   │ GPIO_OUT_CLR   │ Clear bits in GPIO_OUT (write-only)
    0x01C   │ GPIO_OUT_XOR   │ Toggle bits in GPIO_OUT (write-only)
    0x020   │ GPIO_OE        │ Output enable (1=output, 0=input)
    0x024   │ GPIO_OE_SET    │ Set bits in GPIO_OE
    0x028   │ GPIO_OE_CLR    │ Clear bits in GPIO_OE


    To turn ON GPIO25 (Pico's LED):
    ─────────────────────────────────────────────────────────────────

    LDR  r0, =0xD0000000     // SIO base address
    MOV  r1, #1
    LSL  r1, r1, #25         // r1 = 0x02000000 (bit 25)
    STR  r1, [r0, #0x024]    // GPIO_OE_SET: enable output
    STR  r1, [r0, #0x014]    // GPIO_OUT_SET: set high → LED ON!


    Why SET/CLR registers instead of just GPIO_OUT?
    ─────────────────────────────────────────────────────────────────

    Without SET/CLR (DANGEROUS):
    ┌────────────────────────────────────────────────────────────────┐
    │ LDR r1, [r0, #GPIO_OUT]   // Read current value                │
    │ ORR r1, r1, #(1<<25)      // Set bit 25                        │
    │ STR r1, [r0, #GPIO_OUT]   // Write back                        │
    │                                                                 │
    │ PROBLEM: If another core or interrupt modifies GPIO_OUT        │
    │ between the LDR and STR, those changes are LOST!               │
    │ This is a classic "read-modify-write race condition."          │
    └────────────────────────────────────────────────────────────────┘

    With SET/CLR (ATOMIC and SAFE):
    ┌────────────────────────────────────────────────────────────────┐
    │ MOV r1, #(1<<25)                                                │
    │ STR r1, [r0, #GPIO_OUT_SET]  // Hardware atomically sets bit   │
    │                                                                 │
    │ Other bits are UNAFFECTED - hardware handles it!               │
    └────────────────────────────────────────────────────────────────┘

Memory-Mapped I/O

Why Memory Barriers Are Needed:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Modern CPUs reorder memory accesses for performance. This is usually
invisible to single-threaded code, but becomes critical when:

  1. Communicating with peripherals (they have side effects!)
  2. Multi-core systems (other cores see different ordering)
  3. DMA operations (hardware sees memory, not caches)

Example WITHOUT barrier (BROKEN):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    You write:                  CPU might execute as:
    ──────────────────────      ────────────────────────────────
    mailbox_buffer[0] = cmd     mailbox_write = buffer_addr ← FIRST!
    mailbox_buffer[1] = arg     mailbox_buffer[0] = cmd     ← TOO LATE
    mailbox_write = buffer_addr mailbox_buffer[1] = arg

    The peripheral reads garbage because the buffer wasn't filled yet!


ARM Memory Barrier Instructions:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    DMB (Data Memory Barrier)
    ├── Ensures all previous memory accesses complete before
    │   subsequent memory accesses begin
    ├── Does NOT affect instruction execution order
    └── Use between: data writes and peripheral write

    DSB (Data Synchronization Barrier)
    ├── Like DMB, but also waits for all previous instructions
    │   to complete (stronger than DMB)
    └── Use before: peripheral access that must be visible

    ISB (Instruction Synchronization Barrier)
    ├── Flushes the instruction pipeline
    ├── Ensures previous context changes take effect
    └── Use after: changing system registers, enabling MMU


Correct Pattern for Peripheral Access:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    // Fill mailbox buffer
    str  w1, [x0]           // Write data to buffer
    str  w2, [x0, #4]       // Write more data

    dsb  sy                 // ← BARRIER: Complete all writes

    str  w3, [x4]           // Now write to mailbox register
                            // Hardware now sees complete buffer

ARM Memory Barriers

How it works (step-by-step, with invariants and failure modes)

Identify which addresses are MMIO and which are normal memory.
Use atomic SET/CLEAR registers when available to avoid races.
Insert barriers before device “doorbell” writes to guarantee ordering. citeturn3search3
Failure mode: devices read partial buffers, interrupts race, or GPIO bits flip incorrectly.

Minimal concrete example (pseudo, not runnable)

WRITE buffer
BARRIER
WRITE device_register

Common misconceptions

“MMIO behaves like RAM” → Reads and writes can trigger side effects.
“Ordering is always preserved” → CPUs and buses can reorder operations. citeturn3search3

Check-your-understanding questions

Why can reading a UART data register change system state?
When do you need a DSB instead of a DMB?
Why are SET/CLEAR registers safer than read-modify-write?

Check-your-understanding answers

MMIO reads can pop FIFO entries or clear flags, which changes hardware state.
When you need to ensure prior instructions are fully completed before continuing. citeturn3search3
They avoid races because the hardware performs the atomic bit update.

Real-world applications

GPIO control, DMA setup, and peripheral initialization in firmware.

Where you’ll apply it

P04 Memory Map & MMIO Field Notebook
P09 Memory Ordering Litmus Tests

References

Arm ACLE barrier intrinsics and semantics. citeturn3search3

Key insights MMIO and ordering are the difference between “works once” and “always correct.”

Summary Memory maps define what addresses mean; barriers define when writes become real.

Homework/Exercises to practice the concept

Describe a race condition caused by a read-modify-write on GPIO.
Sketch an ordering bug where a peripheral sees stale data.

Solutions to the homework/exercises

Another core sets a different bit between your read and write; your write erases it.
You signal the device before writing the buffer; it reads garbage.

Concept 5: Boot, Exceptions, and Interrupts

Fundamentals Boot and exception handling define how control flow starts and changes when the system is interrupted. On Cortex-M, reset reads a vector table at a fixed address to obtain the initial stack pointer and reset handler. On AArch64, exception levels (EL0–EL3) define privilege and isolation across kernel, hypervisor, and secure monitor. citeturn0search6turn2search4 Interrupts are structured events with defined entry and exit behavior; when misunderstood, they cause the most common low-level failures (silent lockups, corrupted stacks, and unacknowledged interrupts).

Deep Dive Boot flow is architecture-specific, but it always starts with the hardware choosing a program counter and stack pointer. In Cortex-M, the vector table is a literal list of addresses at the start of flash (or a remapped location). The CPU loads the initial SP from offset 0 and the reset handler from offset 4; execution begins there. This is why vector tables are so critical: a single incorrect address prevents boot. In AArch64 systems, boot is more complex. Firmware (or a ROM) selects the initial exception level and execution state, then transfers control to your image. This can occur at EL2 or EL1 depending on platform; understanding the starting level is essential for setting up the MMU and interrupt controller. citeturn0search6turn2search4

Exceptions and interrupts are structured transitions. On Cortex-M, hardware automatically saves a register frame on the stack and switches to handler mode. This means your ISR is effectively running on a known stack layout; if you violate it, return from interrupt fails. AArch64 exceptions follow a different path: they trap into higher exception levels and use banked registers and exception vector tables that differ per EL. This makes exception handling on A-profile both more powerful and more complex. In practice, you must know which registers are saved by hardware and which you must save manually, and you must understand the difference between synchronous exceptions (e.g., illegal instruction) and asynchronous interrupts (e.g., timer). citeturn0search6turn2search4

Interrupt latency is also a systems-level trade-off. M-profile is designed for low-latency, deterministic responses, which is why it dominates microcontroller workloads. citeturn0search0 This is a critical difference from A-profile, where throughput and virtualization might be prioritized. When you design firmware, you need to decide which tasks are best done in an ISR versus in the main loop; an ISR that does too much can starve other interrupts and introduce jitter.

Finally, exceptions connect directly to debugging. Many “mysterious” crashes are just unhandled faults. On Cortex-M, a hard fault may indicate an invalid memory access or misaligned stack. On AArch64, synchronous exceptions reveal illegal instructions or permission violations. By understanding the exception model, you can interpret fault codes and correlate them to your assembly-level behavior, which is a core skill in systems programming and security analysis.

How this fits on projects

Central to P05 (Vector Table Builder), P06 (Interrupt-Driven UART), and P07 (Exception Level Lab).

Definitions & key terms

Vector table: Table of exception handler addresses used at reset or interrupt.
Exception level: Privilege tier in AArch64 (EL0–EL3). citeturn0search6turn2search4
ISR: Interrupt service routine.
HardFault: Cortex-M fault handler for severe errors.

Mental model diagram

Cortex-M Boot Sequence:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Power Applied
         │
         ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │  1. CPU comes out of reset                                      │
    │     - All registers undefined (except SP and PC)                │
    │     - Processor in Thread mode, privileged                      │
    │     - Using Main Stack Pointer (MSP)                            │
    └─────────────────────────────────────────────────────────────────┘
         │
         ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │  2. CPU reads address 0x00000000 (or VTOR)                      │
    │     - Loads INITIAL STACK POINTER value                         │
    │     - This value goes into SP/r13                               │
    └─────────────────────────────────────────────────────────────────┘
         │
         ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │  3. CPU reads address 0x00000004                                │
    │     - Loads RESET HANDLER address                               │
    │     - This value goes into PC/r15                               │
    │     - Bit 0 MUST be 1 (Thumb mode indicator)                    │
    └─────────────────────────────────────────────────────────────────┘
         │
         ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │  4. Execution begins at Reset_Handler                           │
    │     - Your code starts running!                                 │
    │     - Stack is ready to use                                     │
    │     - All peripherals need initialization                       │
    └─────────────────────────────────────────────────────────────────┘


Vector Table Structure (first 16 entries are standard Cortex-M):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Offset   │  Exception #  │  Contents
    ─────────┼───────────────┼────────────────────────────────────────
    0x0000   │  -            │  Initial Stack Pointer value
    0x0004   │  1 (Reset)    │  Reset_Handler address (| 1 for Thumb)
    0x0008   │  2 (NMI)      │  NMI_Handler address
    0x000C   │  3 (HardFault)│  HardFault_Handler address
    0x0010   │  4            │  Reserved (M0+ doesn't use)
    ...      │  ...          │  ...
    0x003C   │  15 (SysTick) │  SysTick_Handler address
    0x0040   │  16 (IRQ0)    │  First peripheral interrupt
    0x0044   │  17 (IRQ1)    │  Second peripheral interrupt
    ...      │  ...          │  (RP2040 has 26 IRQs)


Example minimal vector table in assembly:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    .section .vectors, "a"
    .align 2

    .word   _stack_top          // 0x00: Initial SP
    .word   Reset_Handler + 1   // 0x04: Reset (bit 0 = Thumb)
    .word   NMI_Handler + 1     // 0x08: NMI
    .word   HardFault_Handler+1 // 0x0C: HardFault
    .word   0                   // 0x10: Reserved
    // ... more entries ...

NOTE: On RP2040, flash is at 0x10000000, so your vector table
lives there. The boot ROM copies the SP and PC from flash.

Cortex-M Boot Sequence

AArch64 Exception Levels:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    ┌─────────────────────────────────────────────────────────────────┐
    │  EL3: Secure Monitor                                            │
    │       - Highest privilege, manages secure/non-secure worlds     │
    │       - TrustZone firmware lives here                           │
    ├─────────────────────────────────────────────────────────────────┤
    │  EL2: Hypervisor                                                │
    │       - Virtualization support                                  │
    │       - Controls virtual machines                               │
    ├─────────────────────────────────────────────────────────────────┤
    │  EL1: OS Kernel                                                 │
    │       - Where Linux kernel runs                                 │
    │       - Your bare-metal code runs here!                         │
    ├─────────────────────────────────────────────────────────────────┤
    │  EL0: User Applications                                         │
    │       - Lowest privilege                                        │
    │       - Normal programs run here under Linux                    │
    └─────────────────────────────────────────────────────────────────┘

    On Raspberry Pi boot:
    ┌──────────────────────────────────────────────────────────────┐
    │ GPU firmware starts at EL3, then drops to EL2,               │
    │ loads your kernel8.img, and jumps to 0x80000 at EL2.         │
    │ Your bare-metal code typically runs at EL1 after setup.      │
    └──────────────────────────────────────────────────────────────┘

AArch64 Exception Levels

Interrupt Flow on Cortex-M:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Main Code Running
           │
           │ ← UART receives byte
           │   Hardware sets interrupt flag
           │   NVIC sees enabled interrupt
           ▼
    ┌──────────────────────────────────────────────────────────────────┐
    │  AUTOMATIC HARDWARE ACTIONS (you don't write code for this):    │
    │  1. Finish current instruction                                   │
    │  2. Push 8 registers to stack: r0-r3, r12, LR, PC, xPSR         │
    │  3. Load new PC from vector table (exception #)                  │
    │  4. Load 0xFFFFFFF9 into LR (EXC_RETURN)                        │
    │  5. Enter Handler mode (privileged)                              │
    └──────────────────────────────────────────────────────────────────┘
           │
           ▼
    ┌──────────────────────────────────────────────────────────────────┐
    │  YOUR ISR EXECUTES:                                              │
    │  - Must save r4-r11 if you use them (push {r4-r7})              │
    │  - Read UART data register (clears interrupt flag)               │
    │  - Process byte (store in buffer, set flag, etc.)                │
    │  - Restore r4-r11 if saved                                       │
    │  - Return with: BX LR (the magic EXC_RETURN value)               │
    └──────────────────────────────────────────────────────────────────┘
           │
           ▼
    ┌──────────────────────────────────────────────────────────────────┐
    │  AUTOMATIC HARDWARE ACTIONS:                                     │
    │  1. Hardware detects EXC_RETURN in LR                            │
    │  2. Pop 8 registers from stack                                   │
    │  3. Resume execution exactly where interrupted                   │
    │  4. Return to Thread mode                                        │
    └──────────────────────────────────────────────────────────────────┘
           │
           ▼
    Main Code Continues (unaware anything happened!)


Stack During Interrupt:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    BEFORE interrupt:           AFTER entry, BEFORE ISR code:
    ┌──────────────┐           ┌──────────────┐
    │  (old data)  │           │  (old data)  │
    │              │           ├──────────────┤
    │              │           │  xPSR        │ ← +0x1C from new SP
    │              │           ├──────────────┤
    │              │           │  PC (return) │ ← +0x18
    │              │           ├──────────────┤
    │              │           │  LR          │ ← +0x14
    │              │           ├──────────────┤
    │              │           │  r12         │ ← +0x10
    │              │           ├──────────────┤
    │              │           │  r3          │ ← +0x0C
    │              │           ├──────────────┤
    │              │           │  r2          │ ← +0x08
    │              │           ├──────────────┤
    │              │           │  r1          │ ← +0x04
    │              │           ├──────────────┤
SP →│              │        SP→│  r0          │ ← +0x00 (new SP)
    └──────────────┘           └──────────────┘

    The 32 bytes (8 × 4) are pushed automatically by hardware!

Cortex-M Interrupt Flow

How it works (step-by-step, with invariants and failure modes)

Boot loads initial SP and PC from the vector table (Cortex-M) or firmware-defined entry (AArch64). citeturn0search6turn2search4
An interrupt triggers hardware context save and branches to the handler.
Handler restores context and returns using the architecture-specific mechanism.
Failure mode: wrong vector address or corrupted stack → boot hang or fault loop.

Minimal concrete example (pseudo, not runnable)

VECTOR_TABLE[Reset] -> Reset_Handler
On interrupt: push context, branch handler, restore, return

Common misconceptions

“Interrupt handlers are just normal functions” → They obey different entry/exit rules.
“Boot is just jump to main” → Boot is a structured sequence with strict alignment rules.

Check-your-understanding questions

Why must the reset handler address be Thumb-aligned on Cortex-M?
What does EL1 represent in AArch64?
Why must ISRs be short?

Check-your-understanding answers

Bit 0 of the address indicates Thumb state; if it’s wrong, the CPU faults.
EL1 is the kernel-level privilege where OS code typically runs. citeturn0search6turn2search4
Long ISRs increase latency and can block higher-priority interrupts.

Real-world applications

Firmware bootloaders, interrupt-driven I/O, and OS exception handling. citeturn0search0turn0search6

Where you’ll apply it

P05 Vector Table Builder
P06 Interrupt-Driven UART
P07 AArch64 Exception Level Lab

References

AArch64 exception model and privilege levels. citeturn0search6turn2search4
Cortex-M profile emphasis on low-latency interrupt response. citeturn0search0

Key insights Boot and exceptions are not features you add later; they are the foundation of control flow.

Summary Once you understand boot and exceptions, most “mysterious” bare-metal failures become obvious.

Homework/Exercises to practice the concept

Draw the Cortex-M vector table layout and label the first 8 entries.
Explain how an interrupt differs from a synchronous exception.

Solutions to the homework/exercises

The first entry is the initial SP, followed by reset, NMI, HardFault, and system handlers.
Interrupts are asynchronous hardware events; synchronous exceptions are triggered by the current instruction.

Concept 6: Toolchain and Binary Formats (ELF)

Fundamentals Assembly alone is not executable; you need a toolchain to assemble, link, and package code into a binary format. GNU as (the GNU assembler) accepts assembly source and emits object files; the linker combines objects into an executable with sections and symbols. citeturn1search1 On most ARM systems, the object format is ELF, defined by the System V ABI family. citeturn1search4 Understanding sections, symbols, and relocations is essential for boot images, firmware layout, and disassembly.

Deep Dive The toolchain is a pipeline: source → object → linked image. The assembler parses directives, encodes instructions for the target ISA, and emits relocatable objects. The linker then resolves symbols, assigns addresses, applies relocations, and produces a final ELF file or a raw binary. This is not a black box: if your startup code lands at the wrong address or your vector table is misaligned, the linker script is responsible. The GNU assembler manual documents directive syntax and how the assembler handles sections, alignment, and symbols. citeturn1search1

ELF (Executable and Linkable Format) is the standard container for compiled objects. It defines headers, sections, and symbol tables so tools can reason about what is in a binary. citeturn1search4 ELF’s strength is transparency: you can inspect sections such as .text (code), .data (initialized data), .bss (zero-initialized data), and custom sections for vector tables or boot metadata. In embedded contexts, you often convert ELF into a raw binary that can be flashed, but the ELF remains the authoritative artifact for debugging because it contains symbols and relocation information.

Relocations are where everything connects. When the assembler emits an instruction that references a symbol whose address is not yet known, it emits a relocation entry. The linker later resolves it. This is how references to labels, functions, and global variables are patched. If you understand relocations, you can interpret why certain instructions appear in disassembly, and you can identify errors like “relocation overflow” or “undefined reference.” The same reasoning applies to position-independent code or shared libraries on A-profile systems.

In practical terms, mastering the toolchain lets you answer questions like: Why is my vector table not at the start of flash? Why does the linker place my .data in RAM but my .text in flash? Why does a symbol show up as undefined? These are the exact questions you will encounter in bare-metal ARM development, and they can only be solved by understanding ELF and the linker. The toolchain also connects to diagnostics: objdump and readelf are not just utilities; they are the microscope that lets you see what the assembler and linker actually produced.

How this fits on projects

Core to P01 (Toolchain Pipeline Explorer) and P10 (Capstone Monitor).

Definitions & key terms

Assembler: Translates assembly source into object files. citeturn1search1
Linker: Resolves symbols and produces an executable or binary.
ELF: Executable and Linkable Format for binaries. citeturn1search4
Relocation: A placeholder that the linker resolves to a final address.

Mental model diagram

Toolchain Flow
──────────────
Source (.s) → Assembler → Object (.o) → Linker → ELF (.elf) → Binary (.bin)
                    │                 │
               Symbols/Relocs    Sections/Addresses

How it works (step-by-step, with invariants and failure modes)

Assemble source into relocatable objects.
Link with a linker script or default layout.
Verify ELF sections and symbols.
Failure mode: wrong section placement → boot hangs or interrupts jump to wrong address.

Minimal concrete example (pseudo, not runnable)

.section .vectors
.word reset_handler
.linker: place .vectors at flash start

Common misconceptions

“ELF is only for OS programs” → It is central in embedded, too. citeturn1search4
“Linker script is optional” → Not when you need precise memory layout.

Check-your-understanding questions

What is the role of a relocation entry?
Why do embedded projects often convert ELF to raw binary?
What is the difference between .text and .bss?

Check-your-understanding answers

It records a reference the linker must patch with a final address.
Flashing tools often want raw bytes, but ELF holds symbols for debugging.
.text holds code; .bss holds zero-initialized data.

Real-world applications

Firmware image layout, boot loaders, and disassembly tooling.

Where you’ll apply it

P01 Toolchain Pipeline Explorer
P10 Capstone Monitor

References

GNU assembler manual. citeturn1search1
ELF format and ABI overview. citeturn1search4

Key insights The toolchain is the bridge between assembly and hardware; without it, nothing runs.

Summary Understanding ELF and linking turns build failures into solvable layout problems.

Homework/Exercises to practice the concept

Identify three sections you expect in a bare-metal ELF and explain why.
Explain how a symbol reference becomes a concrete address.

Solutions to the homework/exercises

.text for code, .data for initialized globals, .bss for zeroed globals.
The assembler emits a relocation that the linker resolves to the final address.

Glossary

AArch64: 64-bit execution state for ARMv8-A.
ABI: Application Binary Interface; rules for calling and binary layout.
ELF: Executable and Linkable Format.
MMIO: Memory-mapped I/O; addresses that control hardware.
Vector table: Table of exception handler addresses used by the CPU.

Why ARM Assembly Matters

Modern motivation: ARM is the dominant architecture for mobile and embedded systems, and it is rapidly expanding in servers and cloud workloads.
Real-world statistics: Arm reported ~7 billion Arm-based chips shipped in the quarter ending March 31, 2024, with a cumulative total of 287+ billion chips shipped. citeturn4search2
Impact: Reuters (2023) reports Arm-based CPUs power about 99% of smartphones, underscoring why ARM expertise maps to the dominant mobile ecosystem. citeturn4search3

Old Model (Black Box)               New Model (Assembly-Aware)
┌──────────────────────────┐        ┌──────────────────────────┐
│ High-level language only │        │ Knows registers & MMIO   │
│ "Why did this fail?"     │   →    │ "Which register spilled?"│
│                          │        │ "Which barrier missing?" │
└──────────────────────────┘        └──────────────────────────┘

Context & Evolution

ARM evolved from embedded efficiency into a broad computing ecosystem; profiles and execution states reflect that expansion. citeturn0search0turn0search1turn0search2

Concept Summary Table

Concept Cluster	What You Need to Internalize
Profiles & Execution States	A/M/R profiles and AArch64/AArch32/Thumb states define which instructions and system features are valid.
Registers & Calling Convention	Register roles, stack behavior, and ABI rules are mandatory for correct control flow.
Instruction Encoding	Encodings explain operand limits, instruction size, and why some sequences require expansion.
Memory & Ordering	MMIO semantics and barriers determine when devices see your writes.
Boot & Exceptions	Vector tables and exception levels define how control flow begins and changes.
Toolchain & ELF	Assemblers and linkers turn text into runnable binaries with explicit memory layout.

Project-to-Concept Map

Project	Concepts Applied
Project 1	Profiles & Execution States, Toolchain & ELF
Project 2	Registers & Calling Convention
Project 3	Instruction Encoding
Project 4	Memory & Ordering
Project 5	Boot & Exceptions, Toolchain & ELF
Project 6	Boot & Exceptions, Memory & Ordering
Project 7	Profiles & Execution States, Boot & Exceptions
Project 8	Registers & Calling Convention
Project 9	Memory & Ordering
Project 10	Toolchain & ELF, Boot & Exceptions

Deep Dive Reading by Concept

Concept	Book and Chapter	Why This Matters
Profiles & Execution States	“Computer Organization and Design: ARM Edition” - Ch. 1-2	ISA and architecture context.
Registers & Calling Convention	“ARM Assembly Language” by William Hohl - Ch. 3-4	Register usage and calling patterns.
Instruction Encoding	“ARM Assembly Language” by William Hohl - Ch. 5	Encoding constraints and instruction forms.
Memory & Ordering	“The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu - Ch. 4-6	Memory map and peripheral access.
Boot & Exceptions	“The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu - Ch. 10-12	Exception handling and vector tables.
Toolchain & ELF	“Linkers and Loaders” by John R. Levine - Ch. 1-3	Sections, symbols, and relocations.

Quick Start

Day 1:

Read Concept 1 and Concept 2 in the Theory Primer.
Start Project 1 and produce the first toolchain artifacts.

Day 2:

Validate Project 1 against the Definition of Done.
Read Concept 3 and sketch the Thumb encoding diagram by hand.

Recommended Learning Paths

Path 1: The Embedded Builder

Project 1 → Project 3 → Project 4 → Project 5 → Project 6 → Project 10

Path 2: The Systems Engineer

Project 1 → Project 2 → Project 7 → Project 8 → Project 9 → Project 10

Success Metrics

You can explain why a specific instruction encoding fails without trial-and-error.
You can trace an interrupt from hardware event to handler and back.
You can inspect an ELF and explain where code and data live.

Tooling & Debugging Appendix

Toolchain Cheat Sheet (No code, just intent)

as: assemble text into object files. citeturn1search1
ld: link object files into ELF images.
objdump: inspect instructions and sections.
readelf: inspect ELF headers and symbols. citeturn1search4

Project Overview Table

#	Project Name	Main Language	Difficulty	Time Estimate	Core Concepts	Coolness
1	Toolchain Pipeline Explorer	Assembly + C	Level 2	6-10 hrs	Profiles, ELF	★★★☆☆
2	Register & Stack Visualizer	Assembly + C	Level 2	8-12 hrs	Registers, ABI	★★★☆☆
3	Thumb Encoder/Decoder	Python/C	Level 3	10-16 hrs	Encoding	★★★★☆
4	Memory Map & MMIO Notebook	Assembly + C	Level 3	10-16 hrs	MMIO, Ordering	★★★☆☆
5	Vector Table Builder	Assembly	Level 3	12-18 hrs	Boot, Exceptions	★★★☆☆
6	Interrupt-Driven UART Echo	Assembly + C	Level 4	16-24 hrs	Interrupts, MMIO	★★★★☆
7	AArch64 Exception Lab	Assembly + C	Level 4	16-24 hrs	ELs, Boot	★★★★☆
8	ABI Conformance Audit	Python	Level 3	10-14 hrs	ABI, Registers	★★★☆☆
9	Memory Ordering Litmus	C	Level 4	16-24 hrs	Barriers, Ordering	★★★★☆
10	Capstone Hardware Monitor	Assembly + C	Level 5	30-40 hrs	All	★★★★★

Project List

The following projects guide you from toolchain fluency to full bare-metal reasoning.

Project 1: Toolchain Pipeline Explorer

File: P01-toolchain-pipeline-explorer.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust, Zig
Coolness Level: Level 2 (See REFERENCE.md)
Business Potential: Level 2 (See REFERENCE.md)
Difficulty: Level 2 (See REFERENCE.md)
Knowledge Area: Toolchains, Binary Formats
Software or Tool: GNU binutils, objdump, readelf
Main Book: “Linkers and Loaders” by John R. Levine

What you will build: A repeatable pipeline that assembles, links, and inspects both Cortex-M and AArch64 binaries.

Why it teaches ARM assembly: You will observe how the same assembly intent produces different encodings and ELF layouts across execution states.

Core challenges you will face:

Target selection -> Profiles & Execution States
Section layout -> Toolchain & ELF
Binary inspection -> Instruction Encoding

Real World Outcome

You can generate and inspect a minimal ELF for both targets and explain the role of each section.

CLI output (example):

$ arm-toolchain-lab --target cortex-m0 --show-sections
ELF: sample.elf
.text  @ 0x10000000  size 0x120
.vectors @ 0x10000000 size 0x40
.bss   @ 0x20000000  size 0x80

$ arm-toolchain-lab --target aarch64 --show-sections
ELF: sample.elf
.text  @ 0x00080000  size 0x180
.rodata @ 0x00082000 size 0x40
.bss   @ 0x00100000  size 0x100

The Core Question You Are Answering

“How do assembly source files become concrete bytes in memory, and how does that differ across ARM execution states?”

This reveals the precise place where intent becomes machine reality: the assembler and linker.

Concepts You Must Understand First

Profiles & Execution States
- What does AArch64 vs Thumb mean for encoding and sections?
- Book Reference: “Computer Organization and Design: ARM Edition” - Ch. 1-2
Toolchain & ELF
- What does an ELF section represent?
- Book Reference: “Linkers and Loaders” - Ch. 1-3

Questions to Guide Your Design

Binary Layout
- How will you verify where .text and .vectors are placed?
- How will you ensure the correct entry point for each target?
Verification Strategy
- Which tool will you trust to inspect sections and symbols?
- How will you compare two binaries for encoding differences?

Thinking Exercise

Trace the Pipeline

Draw the path from source to flash and label where errors can occur.

Questions to answer:

Where does a label become an address?
What happens if two sections overlap in the linker script?

The Interview Questions They Will Ask

“What is the difference between an object file and an ELF executable?”
“Why do embedded projects use linker scripts?”
“What is a relocation entry?”
“How do you verify your vector table is at the correct address?”
“Why would the same assembly source produce different binaries on AArch64 and Thumb?”

Hints in Layers

Hint 1: Start with just one section Use a single .text section and verify its address before adding .vectors or .bss.

Hint 2: Compare disassembly, not source Focus on the machine code bytes to see how encodings differ.

Hint 3: Inspect symbols Dump the symbol table and ensure _start and reset handlers resolve correctly.

Hint 4: Validate layout Use an ELF inspector to confirm section order and addresses.

Books That Will Help

Topic	Book	Chapter
Linking and layout	“Linkers and Loaders” by John R. Levine	Ch. 1-3
ARM ISA context	“Computer Organization and Design: ARM Edition”	Ch. 1-2

Common Pitfalls and Debugging

Problem 1: “Binary boots on one target but not the other”

Why: Wrong execution state or entry point.
Fix: Verify target triple and entry symbol.
Quick test: Inspect ELF header and entry address.

Definition of Done

ELF sections are correctly placed for both targets
Entry points match the expected reset or start address
Disassembly matches expected encoding class (Thumb vs A64)
Results are reproducible across builds

Project 2: Register & Stack Visualizer

File: P02-register-stack-visualizer.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust
Coolness Level: Level 3 (See REFERENCE.md)
Business Potential: Level 1 (See REFERENCE.md)
Difficulty: Level 2 (See REFERENCE.md)
Knowledge Area: Calling Conventions, Debugging
Software or Tool: GDB, QEMU or hardware debugger
Main Book: “ARM Assembly Language” by William Hohl

What you will build: A small program that logs register and stack changes across a chain of function calls.

Why it teaches ARM assembly: You will see the ABI rules in action as registers and stack frames evolve.

Core challenges you will face:

Register preservation -> Registers & Calling Conventions
Stack alignment -> Registers & Calling Conventions
Call/return flow -> Toolchain & ELF

Real World Outcome

You can run a trace and see a step-by-step log of register values and stack pointer changes.

$ regstack-trace --target aarch64 --depth 3
CALL depth=1  SP=0x0000000000107ff0  LR=0x0000000000080120
CALL depth=2  SP=0x0000000000107fd0  LR=0x0000000000080154
CALL depth=3  SP=0x0000000000107fb0  LR=0x0000000000080188
RETURN depth=3 SP=0x0000000000107fd0
RETURN depth=2 SP=0x0000000000107ff0
RETURN depth=1 SP=0x0000000000108000

The Core Question You Are Answering

“What does a function call actually do to registers and the stack on ARM?”

Concepts You Must Understand First

Registers & Calling Convention
- Which registers must be preserved?
- Book Reference: “ARM Assembly Language” - Ch. 3-4
Toolchain & ELF
- How do you read symbols for function addresses?
- Book Reference: “Linkers and Loaders” - Ch. 1-2

Questions to Guide Your Design

Trace Design
- Which registers will you log at entry and exit?
- How will you avoid disturbing the state you are measuring?
Output Format
- How will you ensure output is comparable across runs?
- How will you include the call depth clearly?

Thinking Exercise

Draw a Stack Frame

Sketch the stack before and after a call and label which values must be preserved.

Questions to answer:

Which register holds the return address on your target?
What is the alignment requirement at call boundaries?

The Interview Questions They Will Ask

“What is the role of the link register?”
“What is the difference between caller-saved and callee-saved registers?”
“Why must the stack be aligned?”
“How can you verify a function preserved its registers?”
“Why do AArch64 and Cortex-M have different register usage patterns?”

Hints in Layers

Hint 1: Start with entry/exit markers Log just SP and LR at function entry/exit before logging anything else.

Hint 2: Preserve the state Use a scratch buffer so your logging doesn’t overwrite registers you want to observe.

Hint 3: Compare against ABI rules Check whether callee-saved registers are unchanged across calls.

Hint 4: Validate with a debugger Use single-step debugging to confirm the trace is accurate.

Books That Will Help

Topic	Book	Chapter
Register usage	“ARM Assembly Language” by William Hohl	Ch. 3-4
ABI basics	“Linkers and Loaders” by John R. Levine	Ch. 1

Common Pitfalls and Debugging

Problem 1: “Trace output changes when logging is enabled”

Why: Logging perturbs registers and stack.
Fix: Save/restore any registers you touch for logging.
Quick test: Compare trace with and without logging.

Definition of Done

Stack pointer changes are consistent with call depth
Callee-saved registers are preserved
Trace output is deterministic
You can explain each line in the trace

Project 3: Thumb Instruction Encoder/Decoder

File: P03-thumb-encoder-decoder.md
Main Programming Language: Python or C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 (See REFERENCE.md)
Business Potential: Level 2 (See REFERENCE.md)
Difficulty: Level 3 (See REFERENCE.md)
Knowledge Area: Instruction Encoding
Software or Tool: Reference manuals, hex dump tools
Main Book: “ARM Assembly Language” by William Hohl

What you will build: A small tool that converts between a subset of Thumb mnemonics and machine-code bit patterns.

Why it teaches ARM assembly: Encoding forces you to understand the exact bit layout of instructions.

Core challenges you will face:

Bitfield packing -> Instruction Encoding
Immediate range limits -> Instruction Encoding
Endianness -> Instruction Encoding

Real World Outcome

You can enter a mnemonic and see its exact 16-bit encoding (and vice versa).

$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A

$ thumb-decode 0xE000
B label (unconditional branch)

The Core Question You Are Answering

“How do mnemonic fields become concrete bits, and what constraints do they impose?”

Concepts You Must Understand First

Instruction Encoding
- What are the bit fields for registers and immediates?
- Book Reference: “ARM Assembly Language” - Ch. 5
Profiles & Execution States
- Why does Thumb encoding differ from A64?
- Book Reference: “Computer Organization and Design: ARM Edition” - Ch. 1

Questions to Guide Your Design

Instruction Subset
- Which 6-8 instructions are most representative of encoding patterns?
- How will you handle invalid patterns?
Round-trip Tests
- How will you verify encode→decode→encode consistency?
- What is your error format for unsupported inputs?

Thinking Exercise

Bitfield Sketching

Draw the bit layout for a single Thumb MOV and label each field.

Questions to answer:

Which bits represent the destination register?
What is the maximum immediate size for this encoding?

The Interview Questions They Will Ask

“Why does Thumb use 16-bit encodings?”
“What happens if an immediate doesn’t fit?”
“How does little-endian affect instruction bytes?”
“How would you detect an illegal opcode?”
“Why is decoding state-dependent?”

Hints in Layers

Hint 1: Encode one instruction first Implement MOV or ADD before tackling branches.

Hint 2: Represent bitfields explicitly Use a table of field positions and widths.

Hint 3: Add decode tests Verify that known hex values decode to expected mnemonics.

Hint 4: Handle errors early Return a clear error when a field overflows.

Books That Will Help

Topic	Book	Chapter
Instruction encoding	“ARM Assembly Language” by William Hohl	Ch. 5
ISA overview	“Computer Organization and Design: ARM Edition”	Ch. 1

Common Pitfalls and Debugging

Problem 1: “Decoder outputs nonsense”

Why: Mixing endianness or wrong execution state.
Fix: Normalize byte order and confirm Thumb context.
Quick test: Decode known reference instructions.

Definition of Done

Encodes at least 6 Thumb instructions accurately
Decodes known hex sequences correctly
Round-trip tests pass
Error handling is explicit and documented

Project 4: Memory Map & MMIO Field Notebook

File: P04-mmio-memory-map-notebook.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust
Coolness Level: Level 3 (See REFERENCE.md)
Business Potential: Level 2 (See REFERENCE.md)
Difficulty: Level 3 (See REFERENCE.md)
Knowledge Area: MMIO, Hardware Interfaces
Software or Tool: Datasheets, emulator or hardware
Main Book: “The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu

What you will build: A documented MMIO reference and a small tool to compute register addresses and bitfields.

Why it teaches ARM assembly: You will translate raw addresses into meaningful hardware behavior.

Core challenges you will face:

Address calculation -> Memory Maps & MMIO
Atomic register semantics -> Memory Maps & MMIO
Ordering requirements -> Memory Ordering

Real World Outcome

You can query a register by name and get its address, fields, and access rules.

$ mmio-notebook lookup GPIO_OUT_SET
Base: 0xD0000000
Offset: 0x014
Address: 0xD0000014
Access: write-only
Behavior: atomically sets GPIO output bits

The Core Question You Are Answering

“What exactly happens when I read or write a peripheral address?”

Concepts You Must Understand First

Memory Maps & MMIO
- Why are peripherals mapped into address space?
- Book Reference: “The Definitive Guide to ARM Cortex-M3/M4” - Ch. 4-6
Memory Ordering
- When do you need barriers?
- Book Reference: “Computer Architecture: A Quantitative Approach” - Ch. 5

Questions to Guide Your Design

Data Model
- How will you represent base addresses, offsets, and bitfields?
- How will you encode read-only vs write-only semantics?
Validation
- How will you verify that computed addresses match the datasheet?
- What does a mismatch imply?

Thinking Exercise

Map a Peripheral

Pick UART and manually compute its data and status register addresses.

Questions to answer:

Which offsets are read-only vs write-only?
Which registers must be accessed atomically?

The Interview Questions They Will Ask

“What is memory-mapped I/O and why is it used?”
“Why can a read from a register change hardware state?”
“How do you avoid read-modify-write races?”
“When do you need a memory barrier for device access?”
“What is the difference between MMIO and normal RAM?”

Hints in Layers

Hint 1: Start with one peripheral Build the notebook for just GPIO first.

Hint 2: Encode access rules Track read/write/clear-on-read semantics explicitly.

Hint 3: Compare to a known map Cross-check one register with a second source (datasheet or SDK).

Hint 4: Add ordering notes Annotate where barriers are required.

Books That Will Help

Topic	Book	Chapter
MMIO fundamentals	“The Definitive Guide to ARM Cortex-M3/M4”	Ch. 4-6
Memory systems	“Computer Architecture: A Quantitative Approach”	Ch. 5

Common Pitfalls and Debugging

Problem 1: “Writes appear to do nothing”

Why: Wrong address or missing clock enable.
Fix: Verify base address and peripheral enable sequence.
Quick test: Read a status register to confirm peripheral state.

Definition of Done

Notebook includes at least 3 peripherals with full address maps
Access rules are documented (R/W/RO/WO)
Atomic vs non-atomic registers are identified
Computed addresses match datasheet values

Project 5: Vector Table Builder

File: P05-vector-table-builder.md
Main Programming Language: ARM Assembly
Alternative Programming Languages: C, Rust
Coolness Level: Level 3 (See REFERENCE.md)
Business Potential: Level 1 (See REFERENCE.md)
Difficulty: Level 3 (See REFERENCE.md)
Knowledge Area: Boot, Exceptions
Software or Tool: Linker scripts, hex viewer
Main Book: “The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu

What you will build: A correct Cortex-M vector table and reset sequence that reliably boots on an emulator or hardware.

Why it teaches ARM assembly: Boot is the first test of correct encoding, layout, and system understanding.

Core challenges you will face:

Vector layout -> Boot & Exceptions
Alignment -> Toolchain & ELF
Reset handler flow -> Registers & Calling Conventions

Real World Outcome

A boot image that prints a recognizable startup banner and does not hard-fault.

$ bootlab-run --board pico
[BOOT] Vector table OK
[BOOT] Reset handler entered
[BOOT] Stack initialized at 0x20041000

The Core Question You Are Answering

“What exactly must exist at address 0x00000000 for a Cortex-M to boot?”

Concepts You Must Understand First

Boot & Exceptions
- What is in the vector table?
- Book Reference: “The Definitive Guide to ARM Cortex-M3/M4” - Ch. 10-12
Toolchain & ELF
- How do you place sections at fixed addresses?
- Book Reference: “Linkers and Loaders” - Ch. 2

Questions to Guide Your Design

Vector Table Placement
- How will you ensure the table is at the correct address?
- How do you verify alignment?
Reset Handler
- What minimal initialization must occur before C code?
- How do you confirm the handler is in Thumb state?

Thinking Exercise

Vector Layout Check

Write down the first eight vector entries and their meaning.

Questions to answer:

Why does the reset handler address have bit 0 set?
Which entry corresponds to SysTick?

The Interview Questions They Will Ask

“What is a vector table?”
“Why must the reset handler be Thumb-aligned?”
“What happens if the stack pointer is wrong at reset?”
“How do you place a section at a fixed address?”
“What causes a HardFault on boot?”

Hints in Layers

Hint 1: Start with a minimal table Only include initial SP and reset handler before adding other vectors.

Hint 2: Verify section placement Confirm the .vectors section address in the ELF.

Hint 3: Add a known pattern Place a recognizable constant in flash to confirm mapping.

Hint 4: Use a debugger Check SP and PC immediately after reset.

Books That Will Help

Topic	Book	Chapter
Boot sequence	“The Definitive Guide to ARM Cortex-M3/M4”	Ch. 10-12
Linking	“Linkers and Loaders”	Ch. 2

Common Pitfalls and Debugging

Problem 1: “Boot loops or hard-faults immediately”

Why: Vector table at wrong address or invalid reset handler.
Fix: Verify the vector table section placement and alignment.
Quick test: Inspect first 8 words of flash.

Definition of Done

Vector table placed at correct flash address
Reset handler runs and outputs a banner
SP initialized to valid RAM
No HardFault on startup

Project 6: Interrupt-Driven UART Echo

File: P06-interrupt-driven-uart-echo.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust
Coolness Level: Level 4 (See REFERENCE.md)
Business Potential: Level 2 (See REFERENCE.md)
Difficulty: Level 4 (See REFERENCE.md)
Knowledge Area: Interrupts, MMIO
Software or Tool: UART, debugger
Main Book: “The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu

What you will build: An interrupt-driven UART echo loop on Cortex-M that uses an ISR to capture bytes.

Why it teaches ARM assembly: It combines MMIO, vector tables, and interrupt context preservation.

Core challenges you will face:

ISR correctness -> Boot & Exceptions
Register preservation -> Registers & Calling Conventions
MMIO access -> Memory Maps & MMIO

Real World Outcome

Typing into the UART console returns each character via ISR-based echo.

$ uart-echo
> hello
hello
> arm
arm

The Core Question You Are Answering

“How does hardware preempt my code and return safely after servicing an interrupt?”

Concepts You Must Understand First

Boot & Exceptions
- How are interrupt handlers located?
- Book Reference: “The Definitive Guide to ARM Cortex-M3/M4” - Ch. 10-12
Memory Maps & MMIO
- Which UART registers must be read to clear interrupts?
- Book Reference: “The Definitive Guide to ARM Cortex-M3/M4” - Ch. 6

Questions to Guide Your Design

Interrupt Entry/Exit
- Which registers are automatically saved by hardware?
- Which must you preserve manually?
UART Behavior
- What clears the RX interrupt flag?
- How will you avoid buffer overruns?

Thinking Exercise

Interrupt Timeline

Draw a timeline showing when the UART interrupt fires and when it is cleared.

Questions to answer:

What happens if you do not read the data register?
How long can the ISR run before losing bytes?

The Interview Questions They Will Ask

“How does Cortex-M save state on interrupt entry?”
“Why must ISRs be short?”
“How do you acknowledge a UART interrupt?”
“What is EXC_RETURN?”
“How do you prevent nested interrupts from corrupting state?”

Hints in Layers

Hint 1: Start with polling Verify UART by polling before switching to interrupts.

Hint 2: Keep ISR minimal Only read the byte and store it; process later in main loop.

Hint 3: Validate vector table Confirm the UART handler address in the vector table.

Hint 4: Use a ring buffer Prevent data loss by buffering bytes quickly.

Books That Will Help

Topic	Book	Chapter
Interrupt handling	“The Definitive Guide to ARM Cortex-M3/M4”	Ch. 10-12
UART basics	“Embedded Systems” by Jonathan Valvano	Ch. 7

Common Pitfalls and Debugging

Problem 1: “ISR triggers once, then never again”

Why: Interrupt flag not cleared.
Fix: Read the UART data register to clear the flag.
Quick test: Inspect interrupt status register after ISR.

Definition of Done

UART echo works with interrupts enabled
ISR preserves required registers
No data loss at typical baud rates
Interrupt flags are cleared correctly

Project 7: AArch64 Exception Level Lab

File: P07-aarch64-exception-level-lab.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust
Coolness Level: Level 4 (See REFERENCE.md)
Business Potential: Level 3 (See REFERENCE.md)
Difficulty: Level 4 (See REFERENCE.md)
Knowledge Area: Privilege Levels, Boot
Software or Tool: Raspberry Pi or QEMU
Main Book: “Armv8-A Architecture Reference Manual” (spec)

What you will build: A minimal lab that reports current exception level and transitions to EL1.

Why it teaches ARM assembly: It exposes the privilege model and system register setup for A-profile systems. citeturn0search6turn2search4

Core challenges you will face:

Execution state handling -> Profiles & Execution States
Exception model -> Boot & Exceptions
System register setup -> Toolchain & ELF

Real World Outcome

You can boot and print a trace of exception level transitions.

$ aarch64-el-lab
EL at entry: EL2
Configured EL1 state
Dropped to EL1
EL now: EL1

The Core Question You Are Answering

“What privilege level am I running in, and how do I safely change it?”

Concepts You Must Understand First

Profiles & Execution States
- What is AArch64, and why does it matter here?
- Book Reference: “Computer Organization and Design: ARM Edition” - Ch. 1
Boot & Exceptions
- What do EL0–EL3 represent?
- Book Reference: “Armv8-A Architecture Reference Manual” - Exception model chapters

Questions to Guide Your Design

State Transition
- Which registers control target EL and state?
- How will you verify the transition succeeded?
Diagnostics
- How will you report errors if the transition fails?
- What is your fallback if the system traps unexpectedly?

Thinking Exercise

Exception Path

Draw the path from EL2 to EL1 and mark which register changes are required.

Questions to answer:

Which exception vector is used at EL2?
How do you prevent re-entry into EL2?

The Interview Questions They Will Ask

“What are the four exception levels in AArch64?”
“Why do some platforms boot at EL2?”
“What happens if you execute privileged instructions at EL0?”
“How do you set the stack pointer for EL1?”
“Why is exception level management critical for OS bring-up?”

Hints in Layers

Hint 1: Read the current EL Start by reporting the current exception level without changing it.

Hint 2: Use a minimal transition Only set the required registers for EL1 and keep the rest default.

Hint 3: Confirm with a trace Print the EL before and after the transition.

Hint 4: Watch for traps If you land in an exception handler unexpectedly, log the syndrome register.

Books That Will Help

Topic	Book	Chapter
Exception levels	“Armv8-A Architecture Reference Manual”	Exception model
AArch64 boot	“ARM 64-Bit Assembly Language” by Larry Pyeatt	Ch. 9

Common Pitfalls and Debugging

Problem 1: “System hangs immediately after transition”

Why: Stack pointer not set for the target EL.
Fix: Initialize SP for EL1 before the transition.
Quick test: Print EL and SP registers before and after.

Definition of Done

Current EL is correctly reported
Transition to EL1 succeeds
No unexpected exceptions occur
Output log is deterministic

Project 8: ABI Conformance Audit

File: P08-abi-conformance-audit.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3 (See REFERENCE.md)
Business Potential: Level 3 (See REFERENCE.md)
Difficulty: Level 3 (See REFERENCE.md)
Knowledge Area: ABI, Disassembly
Software or Tool: objdump, readelf
Main Book: “ARM Assembly Language” by William Hohl

What you will build: A static checker that inspects compiled functions for ABI compliance.

Why it teaches ARM assembly: You will read real instruction sequences and verify calling convention rules. citeturn3search7

Core challenges you will face:

Register classification -> Registers & Calling Conventions
Stack alignment -> Registers & Calling Conventions
Function boundaries -> Toolchain & ELF

Real World Outcome

You can analyze a binary and flag ABI violations with clear explanations.

$ abi-audit sample.elf
OK: function foo preserves x19-x20
WARN: function bar modifies x19 without saving
OK: stack aligned to 16 bytes at call sites

The Core Question You Are Answering

“Does this compiled code obey the ABI rules required for safe interop?”

Concepts You Must Understand First

Registers & Calling Convention
- Which registers are callee-saved in AAPCS64?
- Book Reference: “ARM Assembly Language” - Ch. 3-4
Toolchain & ELF
- How do you identify function boundaries in a binary?
- Book Reference: “Linkers and Loaders” - Ch. 1

Questions to Guide Your Design

Function Identification
- Will you use symbol tables or pattern heuristics?
- How will you handle stripped binaries?
Rule Enforcement
- Which ABI rules will you enforce first?
- How will you report violations?

Thinking Exercise

ABI Checklist

Write a checklist of rules you expect every function to obey.

Questions to answer:

Which registers must never be clobbered?
How can you detect stack misalignment from disassembly?

The Interview Questions They Will Ask

“What is the AAPCS64 and why does it matter?”
“Which registers are callee-saved?”
“How can you detect a missing stack frame?”
“Why can ABI violations cause random crashes?”
“What does it mean for a function to be leaf?”

Hints in Layers

Hint 1: Use symbol tables first Start with binaries that contain symbols.

Hint 2: Track prologue/epilogue patterns Look for standard stack frame setup and teardown.

Hint 3: Flag only high-confidence violations Avoid false positives by requiring clear evidence.

Hint 4: Provide explanations Show which rule was violated and where.

Books That Will Help

Topic	Book	Chapter
Calling conventions	“ARM Assembly Language” by William Hohl	Ch. 3-4
Binary structure	“Linkers and Loaders”	Ch. 1

Common Pitfalls and Debugging

Problem 1: “Too many false positives”

Why: Incomplete function boundary detection.
Fix: Require symbol support or implement a stronger heuristic.
Quick test: Validate against a known-good binary.

Definition of Done

Detects at least two categories of ABI violations
Produces clear, deterministic reports
Handles binaries with symbols correctly
Documents limitations (e.g., stripped binaries)

Project 9: Memory Ordering Litmus Tests

File: P09-memory-ordering-litmus.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 4 (See REFERENCE.md)
Business Potential: Level 2 (See REFERENCE.md)
Difficulty: Level 4 (See REFERENCE.md)
Knowledge Area: Concurrency, Memory Ordering
Software or Tool: QEMU or multi-core hardware
Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you will build: A small suite of litmus tests that demonstrate ordering anomalies and barrier fixes.

Why it teaches ARM assembly: You will see how DMB/DSB/ISB change observable behavior. citeturn3search3

Core challenges you will face:

Reordering effects -> Memory Ordering
Barrier placement -> Memory Ordering
Deterministic reporting -> Toolchain & ELF

Real World Outcome

You can run a test and observe reordering cases disappear when barriers are added.

$ ordering-test --case store-buffering
Observed outcomes (100000 runs):
  0,0  -> 42
  0,1  -> 55231
  1,0  -> 44727
  1,1  -> 0
After adding DMB:
  0,0  -> 0

The Core Question You Are Answering

“Why does correct-looking code fail on weakly ordered systems, and how do barriers fix it?”

Concepts You Must Understand First

Memory Ordering
- What do DMB/DSB/ISB guarantee?
- Book Reference: “Computer Architecture: A Quantitative Approach” - Ch. 5
Toolchain & ELF
- How do you ensure consistent builds for reproducible tests?
- Book Reference: “Linkers and Loaders” - Ch. 1

Questions to Guide Your Design

Test Design
- Which litmus cases are representative (store buffering, load buffering)?
- How will you capture and count outcomes?
Barrier Placement
- Where exactly should barriers be inserted?
- How will you show the difference before/after?

Thinking Exercise

Reordering Scenario

Draw two cores writing and reading shared variables without barriers.

Questions to answer:

Which outcome is unexpected?
What would a barrier change?

The Interview Questions They Will Ask

“What does a memory barrier guarantee?”
“Why can two cores see writes in different orders?”
“What is the difference between DMB and DSB?”
“Why are litmus tests useful?”
“How do you make concurrency tests deterministic?”

Hints in Layers

Hint 1: Start with a known litmus test Implement store-buffering first.

Hint 2: Fix the measurement Use a fixed number of iterations and report counts deterministically.

Hint 3: Add one barrier at a time Show the smallest change that eliminates the anomaly.

Hint 4: Confirm on two targets Run in QEMU and on hardware if possible.

Books That Will Help

Topic	Book	Chapter
Memory ordering	“Computer Architecture: A Quantitative Approach”	Ch. 5
Concurrency	“Operating Systems: Three Easy Pieces”	Ch. 28

Common Pitfalls and Debugging

Problem 1: “Results are inconsistent between runs”

Why: Non-deterministic scheduling and timing.
Fix: Fix iteration counts and seed any randomness.
Quick test: Run with a single fixed CPU frequency.

Definition of Done

At least two litmus tests implemented
Barrier effects are observable and documented
Output is deterministic across runs
Explanation of results is clear

Project 10: Capstone Hardware Monitor

File: P10-capstone-hardware-monitor.md
Main Programming Language: ARM Assembly + C
Alternative Programming Languages: Rust
Coolness Level: Level 5 (See REFERENCE.md)
Business Potential: Level 4 (See REFERENCE.md)
Difficulty: Level 5 (See REFERENCE.md)
Knowledge Area: Systems Integration
Software or Tool: UART, timers, debugger
Main Book: “The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu

What you will build: A bare-metal hardware monitor that boots, initializes UART, handles interrupts, and exposes a minimal command interface.

Why it teaches ARM assembly: It forces you to combine every concept into a working system with observable outputs.

Core challenges you will face:

Boot + Vector table -> Boot & Exceptions
MMIO control -> Memory Maps & MMIO
Toolchain layout -> Toolchain & ELF

Real World Outcome

A bootable monitor that accepts simple commands and reports hardware state.

$ monitor
[BOOT] OK
> regs
SP=0x20041000 LR=0x10000124
> mmio GPIO_OUT
0x00000020
> help
commands: regs, mmio, echo, reboot

The Core Question You Are Answering

“Can I build a minimal, reliable ARM system from scratch without an OS?”

Concepts You Must Understand First

Boot & Exceptions
- How does the system start and handle interrupts?
- Book Reference: “The Definitive Guide to ARM Cortex-M3/M4” - Ch. 10-12
Memory Maps & MMIO
- How do you read and write peripheral registers safely?
- Book Reference: “The Definitive Guide to ARM Cortex-M3/M4” - Ch. 4-6
Toolchain & ELF
- How do you place code and data precisely?
- Book Reference: “Linkers and Loaders” - Ch. 2

Questions to Guide Your Design

Command Interface
- How will you parse commands without dynamic memory?
- How will you handle unknown commands?
System Initialization
- What is the minimum init sequence for UART and timers?
- How will you confirm correct clock setup?

Thinking Exercise

Boot-to-Command Flow

Draw the exact sequence from reset to first prompt.

Questions to answer:

Which step depends on MMIO ordering?
Which step depends on vector table correctness?

The Interview Questions They Will Ask

“How do you bring up a bare-metal system?”
“What are the first three registers you check on boot?”
“How do you handle input without an OS?”
“How do you debug a boot hang?”
“What does a minimal monitor need to be useful?”

Hints in Layers

Hint 1: Start with a boot banner Only print a startup line before adding commands.

Hint 2: Add one command at a time Implement regs, then mmio, then echo.

Hint 3: Keep state explicit Store command buffers in a fixed memory region.

Hint 4: Validate with a debugger Break at reset and confirm SP and PC.

Books That Will Help

Topic	Book	Chapter
Bare-metal bring-up	“The Definitive Guide to ARM Cortex-M3/M4”	Ch. 10-12
Memory maps	“The Definitive Guide to ARM Cortex-M3/M4”	Ch. 4-6
Linking	“Linkers and Loaders”	Ch. 2

Common Pitfalls and Debugging

Problem 1: “Boot banner never appears”

Why: Vector table misplacement or UART not initialized.
Fix: Verify reset handler and UART registers.
Quick test: Toggle a GPIO pin in the reset handler.

Definition of Done

Boot banner prints reliably
UART commands operate deterministically
MMIO reads/writes are correct and documented
System survives repeated resets

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Toolchain Pipeline Explorer	Level 2	Weekend	Medium	★★★☆☆
2. Register & Stack Visualizer	Level 2	Weekend	Medium	★★★☆☆
3. Thumb Encoder/Decoder	Level 3	1-2 weeks	High	★★★★☆
4. Memory Map & MMIO Notebook	Level 3	1-2 weeks	High	★★★☆☆
5. Vector Table Builder	Level 3	1-2 weeks	High	★★★☆☆
6. Interrupt-Driven UART Echo	Level 4	2-3 weeks	High	★★★★☆
7. AArch64 Exception Lab	Level 4	2-3 weeks	High	★★★★☆
8. ABI Conformance Audit	Level 3	1-2 weeks	Medium	★★★☆☆
9. Memory Ordering Litmus	Level 4	2-3 weeks	High	★★★★☆
10. Capstone Hardware Monitor	Level 5	1 month	Very High	★★★★★

Recommendation

If you are new to ARM assembly: Start with Project 1 to demystify the toolchain and binary layout. If you are an embedded developer: Start with Project 4 to anchor your understanding in MMIO and memory maps. If you want OS-level systems skills: Start with Project 7 to master exception levels and privilege transitions.

Final Overall Project: Unified ARM Bring-Up Suite

The Goal: Combine Projects 1, 5, 6, and 10 into a unified bring-up suite that can boot, initialize peripherals, and expose a diagnostic console.

Build a validated toolchain pipeline (Project 1).
Generate a correct vector table and reset flow (Project 5).
Add interrupt-driven UART and timer events (Project 6).
Expand into a minimal monitor (Project 10).

Success Criteria: The system boots reliably, prints a banner, and responds to at least three commands without faults.

From Learning to Production: What Is Next

Your Project	Production Equivalent	Gap to Fill
Toolchain Pipeline Explorer	Firmware build system	CI reproducibility, vendor integration
Vector Table Builder	Bootloader	Secure boot, update strategy
Interrupt-Driven UART	Device driver	Power management, robustness testing
Capstone Monitor	Embedded firmware shell	Security hardening, diagnostics

Summary

This learning path covers ARM assembly through 10 hands-on projects.

#	Project Name	Main Language	Difficulty	Time Estimate
1	Toolchain Pipeline Explorer	Assembly + C	Level 2	6-10 hrs
2	Register & Stack Visualizer	Assembly + C	Level 2	8-12 hrs
3	Thumb Encoder/Decoder	Python/C	Level 3	10-16 hrs
4	Memory Map & MMIO Notebook	Assembly + C	Level 3	10-16 hrs
5	Vector Table Builder	Assembly	Level 3	12-18 hrs
6	Interrupt-Driven UART Echo	Assembly + C	Level 4	16-24 hrs
7	AArch64 Exception Lab	Assembly + C	Level 4	16-24 hrs
8	ABI Conformance Audit	Python	Level 3	10-14 hrs
9	Memory Ordering Litmus	C	Level 4	16-24 hrs
10	Capstone Hardware Monitor	Assembly + C	Level 5	30-40 hrs

Expected Outcomes

You can inspect an ELF and explain its layout.
You can reason about register state across calls and interrupts.
You can build a minimal ARM system from reset to I/O.

Additional Resources and References

Standards and Specifications

Arm A-profile architecture overview (execution states). citeturn0search2
AAPCS64 procedure call standard repository. citeturn3search7
ELF format and ABI overview. citeturn1search4

Industry Analysis

Arm SEC filing on cumulative chip shipments (2024). citeturn4search2
Reuters on Arm’s smartphone CPU share. citeturn4search3

Books

“ARM Assembly Language” by William Hohl - practical assembly patterns.
“The Definitive Guide to ARM Cortex-M3/M4” by Joseph Yiu - boot and interrupts.
“Linkers and Loaders” by John R. Levine - ELF and linking.