Project 4: Raspberry Pi Bare Metal — Hello World
Build a bare metal kernel for the Raspberry Pi that boots without any operating system, initializes the UART, and prints “Hello World” to the serial console—demonstrating full control over a modern ARM system.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2 weeks |
| Language | C + ARM Assembly |
| Platform | Raspberry Pi 3/4 |
| Prerequisites | Projects 1-3, ARM basics |
| Key Topics | ARM Cortex-A, boot process, linker scripts, UART, BCM2837/BCM2711 |
1. Learning Objectives
By completing this project, you will:
- Understand the Raspberry Pi boot process — How the GPU loads firmware and your kernel
- Write ARM64 (AArch64) assembly — Startup code that initializes the CPU
- Create linker scripts — Control memory layout for bare metal code
- Configure memory-mapped peripherals — Access UART through physical addresses
- Handle multi-core startup — Park secondary cores while core 0 runs your code
- Debug without an OS — Use UART and LED for diagnostics
- Cross-compile for ARM — Build on your development machine, run on Pi
The Core Question: How does code start running on a computer before any operating system exists?
2. Theoretical Foundation
2.1 Raspberry Pi Boot Process
The Raspberry Pi has one of the most unusual boot processes among embedded systems. Unlike x86 PCs where the CPU starts first, the Pi’s GPU runs before the ARM CPU even powers on.
Raspberry Pi Boot Sequence:
Power On
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ GPU DOMAIN │
│ │
│ ┌─────────────────┐ │
│ │ 1. GPU ROM │ Hardcoded in silicon │
│ │ (First Stage)│ Loads bootcode.bin from SD │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 2. bootcode.bin │ GPU firmware │
│ │ (Second Stage│ Enables SDRAM │
│ └────────┬────────┘ Loads start.elf │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 3. start.elf │ Main GPU firmware │
│ │ (Third Stage)│ Reads config.txt │
│ └────────┬────────┘ Loads kernel8.img to 0x80000 │
│ │ Releases ARM core 0 from reset │
└───────────┼─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ARM DOMAIN │
│ │
│ ┌─────────────────┐ │
│ │ 4. kernel8.img │ YOUR CODE STARTS HERE! │
│ │ @ 0x80000 │ ARM64 execution begins │
│ └────────┬────────┘ All 4 cores wake up simultaneously │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 5. Startup Code │ Park cores 1-3 in spinloop │
│ │ (Assembly) │ Set up stack for core 0 │
│ └────────┬────────┘ Clear BSS section │
│ │ Jump to C main() │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 6. C Kernel │ Initialize UART │
│ │ main() │ Print "Hello World!" │
│ └─────────────────┘ Your application logic │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key insight: You don’t write a bootloader for the Pi—the GPU handles that. Your kernel8.img is loaded to address 0x80000 and ARM execution begins there. The “8” in kernel8.img indicates 64-bit ARM (AArch64).
Files required on SD card:
bootcode.bin— GPU first-stage bootloader (from Raspberry Pi Foundation)start.elf— GPU firmware (from Raspberry Pi Foundation)fixup.dat— Memory configuration (from Raspberry Pi Foundation)kernel8.img— YOUR kernel binaryconfig.txt— (Optional) Boot configuration
2.2 ARM Cortex-A Architecture
The Raspberry Pi 3 uses the Cortex-A53 (ARMv8-A) and Pi 4 uses Cortex-A72 (ARMv8-A). Both support 64-bit (AArch64) execution.
ARM Cortex-A53/A72 Register Set (AArch64):
General Purpose Registers (64-bit):
┌────────────────────────────────────────────────────────────┐
│ X0-X7 │ Arguments/Results (caller-saved) │
│ X8 │ Indirect result location register │
│ X9-X15 │ Temporary (caller-saved) │
│ X16-X17 │ Intra-procedure-call scratch (IP0, IP1) │
│ X18 │ Platform register (reserved) │
│ X19-X28 │ Callee-saved registers │
│ X29 │ Frame pointer (FP) │
│ X30 │ Link register (LR) - return address │
│ SP │ Stack pointer (per exception level) │
│ PC │ Program counter (not directly accessible) │
└────────────────────────────────────────────────────────────┘
W0-W30 = Lower 32 bits of X0-X30
Special Registers:
┌────────────────────────────────────────────────────────────┐
│ MPIDR_EL1 │ Multiprocessor affinity - which core am I? │
│ SCTLR_EL1 │ System control register │
│ CurrentEL │ Current exception level │
│ VBAR_EL1 │ Vector base address (exception handlers) │
└────────────────────────────────────────────────────────────┘
Exception Levels (EL):
┌─────────────────────────────────────────────────────────────┐
│ EL3 │ Secure Monitor (TrustZone) - Highest privilege │
├───────┼─────────────────────────────────────────────────────┤
│ EL2 │ Hypervisor - Virtualization │
├───────┼─────────────────────────────────────────────────────┤
│ EL1 │ Kernel/OS - Where bare metal code typically runs │
├───────┼─────────────────────────────────────────────────────┤
│ EL0 │ User mode - Applications (lowest privilege) │
└───────┴─────────────────────────────────────────────────────┘
On Pi boot, cores start in EL2 (no EL3 on Pi 3/4)
What you need to know for this project:
- Core identification: Read
MPIDR_EL1to get core ID (bits 0-1) - Exception level: Kernel code runs at EL1
- Stack setup: Each core needs its own stack, pointed to by
SP - Calling convention: First 8 arguments in X0-X7, return in X0
2.3 Linker Scripts Deep Dive
A linker script tells the linker how to arrange your code in memory. For bare metal, this is critical—there’s no OS to relocate your code.
Linker Script Structure:
┌─────────────────────────────────────────────────────────────────────┐
│ ENTRY(_start) <- Execution entry point │
│ │
│ SECTIONS { <- Memory section definitions │
│ . = 0x80000; <- Start address (where GPU loads) │
│ │
│ .text.boot : { <- Boot code MUST be first │
│ *(.text.boot) <- Wildcard: all .text.boot inputs │
│ } │
│ │
│ .text : { <- Main code section │
│ *(.text) │
│ } │
│ │
│ .rodata : { <- Read-only data (strings, etc.) │
│ *(.rodata) │
│ } │
│ │
│ .data : { <- Initialized global variables │
│ *(.data) │
│ } │
│ │
│ .bss : { <- Uninitialized globals (zeroed) │
│ __bss_start = .; <- Symbol for BSS start │
│ *(.bss) │
│ __bss_end = .; <- Symbol for BSS end │
│ } │
│ } │
└─────────────────────────────────────────────────────────────────────┘
Memory Layout After Linking:
0x00000000 ─────────────────────────────────┐
│ GPU-reserved memory │
│ (VideoCore, framebuffer) │
0x00080000 ─────────────────────────────────┤
│ .text.boot (entry point) │ ◄── _start
│ .text (C code) │
│ .rodata (strings) │
│ .data (initialized vars) │
│ .bss (zero-initialized) │
0x000XXXXX ─────────────────────────────────┤
│ │
│ FREE MEMORY │
│ (stack grows down here) │
│ │
0x3B400000 ─────────────────────────────────┤
│ GPU memory (varies) │
0x40000000 ─────────────────────────────────┘
(1GB address space - Pi 3)
Critical requirements:
- Entry point at 0x80000: GPU loads kernel here
- Boot code first:
_startassembly must be at the very beginning - BSS symbols exported: Startup code needs to zero BSS
- Alignment: Some sections need 4K alignment for MMU (not needed for hello world)
2.4 BCM2837/BCM2711 Peripherals
The Raspberry Pi’s peripherals are memory-mapped—you read and write to specific physical addresses to control hardware.
Raspberry Pi Peripheral Memory Map:
Pi 3 (BCM2837): Pi 4 (BCM2711):
┌────────────────────────────────┐ ┌────────────────────────────────┐
│ Peripheral Base: 0x3F000000 │ │ Peripheral Base: 0xFE000000 │
├────────────────────────────────┤ ├────────────────────────────────┤
│ GPIO: Base + 0x200000 │ │ GPIO: Base + 0x200000 │
│ UART0: Base + 0x201000 │ │ UART0: Base + 0x201000 │
│ AUX/UART1: Base + 0x215000 │ │ AUX/UART1: Base + 0x215000 │
│ Mailbox: Base + 0x00B880 │ │ Mailbox: Base + 0x00B880 │
│ Interrupt: Base + 0x00B200 │ │ GIC: 0xFF840000 │
└────────────────────────────────┘ └────────────────────────────────┘
UART0 (PL011) Register Map @ 0x3F201000 (Pi 3):
┌─────────────────────────────────────────────────────────────────────┐
│ Offset │ Name │ Description │
├────────┼───────────┼───────────────────────────────────────────────┤
│ 0x00 │ DR │ Data Register - read/write characters │
│ 0x04 │ RSRECR │ Receive status / error clear │
│ 0x18 │ FR │ Flag Register - TX/RX status │
│ 0x24 │ IBRD │ Integer baud rate divisor │
│ 0x28 │ FBRD │ Fractional baud rate divisor │
│ 0x2C │ LCRH │ Line control (data bits, parity, FIFO) │
│ 0x30 │ CR │ Control Register (enable TX/RX) │
│ 0x34 │ IFLS │ Interrupt FIFO level select │
│ 0x38 │ IMSC │ Interrupt mask set/clear │
│ 0x44 │ ICR │ Interrupt clear register │
└─────────────────────────────────────────────────────────────────────┘
Flag Register (FR) Bits:
┌─────────────────────────────────────────────────────────────────────┐
│ Bit │ Name │ Description │
├─────┼──────┼───────────────────────────────────────────────────────┤
│ 7 │ TXFE │ Transmit FIFO empty │
│ 5 │ TXFF │ Transmit FIFO full (wait if set before writing) │
│ 4 │ RXFE │ Receive FIFO empty (no data to read) │
│ 3 │ BUSY │ UART busy transmitting │
└─────────────────────────────────────────────────────────────────────┘
Transmit Algorithm:
┌──────────────────────────┐
│ while (FR & TXFF) │ ← Wait until TX FIFO not full
│ ; // spin │
│ DR = character; │ ← Write character to data register
└──────────────────────────┘
GPIO Configuration for UART:
The UART pins (GPIO 14 = TXD, GPIO 15 = RXD) must be configured for their “alternate function” (ALT0).
GPIO Function Select Registers (GPFSEL):
Each GPIO pin uses 3 bits to select its function:
┌─────────────────────────────────────────────────────────────────────┐
│ Value │ Function │
├───────┼────────────────────────────────────────────────────────────┤
│ 000 │ Input │
│ 001 │ Output │
│ 010 │ ALT5 │
│ 011 │ ALT4 │
│ 100 │ ALT0 ← UART0 TX/RX for GPIO 14/15 │
│ 101 │ ALT1 │
│ 110 │ ALT2 │
│ 111 │ ALT3 │
└─────────────────────────────────────────────────────────────────────┘
GPFSEL1 @ GPIO Base + 0x04 (covers GPIO 10-19):
┌─────────────────────────────────────────────────────────────────────┐
│ Bits │ 18-17-16 │ 15-14-13 │ 12-11-10 │ ... │
│ │ GPIO 16 │ GPIO 15 │ GPIO 14 │ │
│ │ │ (RXD) │ (TXD) │ │
└─────────────────────────────────────────────────────────────────────┘
To set GPIO 14 and 15 to ALT0:
GPFSEL1 = (GPFSEL1 & ~0x3F000) | 0x24000;
└─ clear bits 12-17 ─┘ └─ set 100 100 ─┘
2.5 Startup Assembly Requirements
The startup assembly code must handle several critical tasks before C code can run:
Startup Assembly Flow:
┌────────────────────────────────────┐
│ All 4 cores wake up │
│ simultaneously │
└───────────────┬────────────────────┘
│
▼
┌────────────────────────────────────┐
│ 1. Read MPIDR_EL1 to get core ID │
│ Core 0 continues │
│ Cores 1-3 → spin loop │
└───────────────┬────────────────────┘
│
┌───────────────┴───────────────────┐
│ │
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ CORE 0 (continues) │ │ CORES 1-3 (parked) │
└─────────────┬─────────────┘ └───────────────────────────┘
│ │
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ 2. Check exception level │ │ wfe │
│ (should be EL2 or EL1)│ │ b spin_loop │
└─────────────┬─────────────┘ │ (wait for event, loop) │
│ └───────────────────────────┘
▼
┌───────────────────────────┐
│ 3. Drop from EL2 to EL1 │
│ (if starting in EL2) │
└─────────────┬─────────────┘
│
▼
┌───────────────────────────┐
│ 4. Set up stack pointer │
│ SP = 0x80000 │
│ (grows down) │
└─────────────┬─────────────┘
│
▼
┌───────────────────────────┐
│ 5. Clear BSS section │
│ memset(bss, 0, len) │
└─────────────┬─────────────┘
│
▼
┌───────────────────────────┐
│ 6. Branch to kernel_main │
│ bl kernel_main │
└─────────────┬─────────────┘
│
▼
┌───────────────────────────┐
│ 7. Infinite loop (halt) │
│ (if main returns) │
└───────────────────────────┘
Stack Layout:
0x80000 ─────────────────────────────────┐
│ ▲ kernel_main stack frame │
│ │ │
│ │ Stack grows DOWN │
│ │ │
0x7XXXX ────────────────────────────────┘
Why park secondary cores?
All four cores of the Cortex-A53 wake up and start executing at 0x80000 simultaneously. If you don’t park cores 1-3, they’ll all try to:
- Use the same stack pointer (corrupting each other’s stack)
- Initialize the same peripherals (race conditions)
- Print to UART (garbled output)
For this project, we park them. Multi-core programming comes later.
2.6 Common Misconceptions
Misconception 1: “I need to write a bootloader”
- Reality: The GPU handles boot. You write a “kernel” that the GPU loads.
Misconception 2: “The ARM CPU starts first”
- Reality: The GPU boots first, initializes RAM, then releases the ARM cores.
Misconception 3: “I can use any address for my kernel”
- Reality: Must be
0x80000(or you changeconfig.txt, not recommended for learning).
Misconception 4: “UART just works if I write to it”
- Reality: GPIO pins must be configured for ALT0 function first.
Misconception 5: “All cores start at different times”
- Reality: All 4 cores wake up simultaneously and jump to
0x80000.
Misconception 6: “I can use the standard C library”
- Reality: No OS = no libc. You implement everything yourself.
Misconception 7: “Debugging is easy with printf”
- Reality: You must implement UART first to get any output. Before that, blink an LED.
3. Project Specification
3.1 What You Will Build
A minimal bare metal kernel for Raspberry Pi 3/4 that:
- Boots from SD card (replacing the default Linux kernel)
- Parks secondary CPU cores in a spin loop
- Sets up a stack for the primary core
- Initializes the PL011 UART at 115200 baud
- Prints “Hello from bare metal!” to the serial console
- Optionally blinks the onboard ACT LED as a visual indicator
3.2 Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR1 | Kernel boots on Pi 3 or Pi 4 | Required |
| FR2 | Secondary cores parked (not executing random memory) | Required |
| FR3 | BSS section cleared to zero | Required |
| FR4 | UART0 (PL011) initialized at 115200 baud, 8N1 | Required |
| FR5 | Print string to serial terminal | Required |
| FR6 | Support both Pi 3 (BCM2837) and Pi 4 (BCM2711) | Optional |
| FR7 | Blink ACT LED as visual indicator | Optional |
| FR8 | Print CPU identification (core type, revision) | Optional |
3.3 Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR1 | Kernel binary size | < 4KB |
| NFR2 | Boot to UART output | < 100ms |
| NFR3 | Cross-compilation support | Linux/macOS host |
| NFR4 | QEMU emulation support | For development without hardware |
3.4 Example Output
When you connect a USB-to-TTL serial adapter to GPIO 14 (TXD) and GPIO 15 (RXD) and run screen /dev/ttyUSB0 115200:
=== Raspberry Pi 3 Bare Metal ===
UART initialized at 115200 baud
Hello from bare metal!
CPU: Cortex-A53
Core ID: 0
Exception Level: EL1
System running. Press any key...
You pressed: 'a' (0x61)
You pressed: 'b' (0x62)
3.5 Real World Outcome
After completing this project, you will have:
On Your Development Machine:
$ ls -la build/
total 24
-rw-r--r-- 1 user user 512 Dec 15 10:30 boot.S
-rw-r--r-- 1 user user 2048 Dec 15 10:30 kernel.c
-rw-r--r-- 1 user user 256 Dec 15 10:30 linker.ld
-rw-r--r-- 1 user user 56 Dec 15 10:30 Makefile
-rwxr-xr-x 1 user user 3584 Dec 15 10:30 kernel8.img
$ file build/kernel8.img
kernel8.img: data
$ xxd build/kernel8.img | head -4
00000000: d53800a1 12000c21 7100003f 54000040 .8.....!q..?T..@
00000010: 580000c2 d61f0040 00008000 00000000 X......@........
On Your SD Card:
boot/
├── bootcode.bin # From Raspberry Pi firmware
├── start.elf # From Raspberry Pi firmware
├── fixup.dat # From Raspberry Pi firmware
├── config.txt # Optional: arm_64bit=1
└── kernel8.img # YOUR compiled kernel!
On Serial Terminal:
=== Raspberry Pi 3 Bare Metal ===
Hello from bare metal!
Verification checklist:
- Pi boots without Linux kernel panic
- Serial output appears within 1 second of power-on
- Output is not garbled (correct baud rate)
- Pressing keys echoes them back (if you implement RX)
- ACT LED blinks (if you implement GPIO)
4. Solution Architecture
4.1 Boot Sequence Diagram
Complete Boot Flow (Your Code Path Highlighted):
Power On
│
▼
[GPU ROM] ──────────────────────────────────────────────────────────┐
│ Hardcoded in Broadcom SoC │
│ Looks for bootcode.bin on SD card │
│ │
▼ │
[bootcode.bin] ─────────────────────────────────────────────────────┤
│ Enables SDRAM │
│ Loads start.elf │
│ │ NOT YOUR
▼ │ CODE
[start.elf] ────────────────────────────────────────────────────────┤
│ GPU firmware with ARM bootloader │
│ Reads config.txt (optional) │
│ Loads kernel8.img to 0x80000 │
│ Sets up ATAGs or device tree (optional) │
│ Releases ARM cores from reset │
└───────────────────────────────────────────────────────────────┘
│
════════════════════════════════════╪════════════════════════════════
│
┌───────────────────────────────┘
│
▼
╔═══════════════════════════════════════════════════════════════════╗
║ YOUR CODE STARTS HERE ║
╠═══════════════════════════════════════════════════════════════════╣
║ ║
║ [boot.S - _start] ◄────── Entry point at 0x80000 ║
║ │ ║
║ ├─► Get core ID from MPIDR_EL1 ║
║ │ │ ║
║ │ ├─► Core 0: Continue ║
║ │ └─► Cores 1-3: Jump to spin_loop ║
║ │ ║
║ ├─► Set stack pointer (SP = 0x80000) ║
║ │ ║
║ ├─► Clear BSS section ║
║ │ │ ║
║ │ └─► Loop: *bss_ptr++ = 0; while (bss_ptr < bss_end) ║
║ │ ║
║ └─► Branch to kernel_main (bl kernel_main) ║
║ ║
║ [kernel.c - kernel_main] ║
║ │ ║
║ ├─► uart_init() ║
║ │ │ ║
║ │ ├─► Disable UART ║
║ │ ├─► Set GPIO 14, 15 to ALT0 ║
║ │ ├─► Disable pull-up/down ║
║ │ ├─► Clear pending interrupts ║
║ │ ├─► Set baud rate (115200) ║
║ │ ├─► Enable FIFO, 8-bit ║
║ │ └─► Enable UART, TX, RX ║
║ │ ║
║ ├─► uart_puts("Hello from bare metal!") ║
║ │ │ ║
║ │ └─► Loop: uart_putc(*s++); while (*s) ║
║ │ ║
║ └─► while(1) { ... } // Main loop ║
║ ║
╚═══════════════════════════════════════════════════════════════════╝
4.2 Memory Map
Raspberry Pi 3 Physical Memory Map:
0x00000000 ┌─────────────────────────────────────────────┐
│ ARM-GPU shared / GPU-only │
│ (Size depends on gpu_mem in config.txt) │
│ Default: 64MB for GPU │
0x00080000 ├─────────────────────────────────────────────┤ ◄── kernel8.img loaded here
│ ┌─────────────────────────────────────────┐ │
│ │ .text.boot section │ │ _start (entry point)
│ │ - Core parking logic │ │
│ │ - Stack setup │ │
│ │ - BSS clearing │ │
│ │ - Jump to C │ │
│ ├─────────────────────────────────────────┤ │
│ │ .text section │ │
│ │ - kernel_main() │ │
│ │ - uart_init(), uart_putc(), etc. │ │
│ ├─────────────────────────────────────────┤ │
│ │ .rodata section │ │
│ │ - "Hello from bare metal!\n" │ │
│ │ - Other string constants │ │
│ ├─────────────────────────────────────────┤ │
│ │ .data section │ │
│ │ - Initialized global variables │ │
│ ├─────────────────────────────────────────┤ │
│ │ .bss section │ │ __bss_start
│ │ - Uninitialized globals (zeroed) │ │ __bss_end
│ └─────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ │ │
│ │ FREE MEMORY │ │
│ │ │ │
│ │ (Heap could grow up here) │ │
│ │ │ │
│ └──────────────────────────────┬──────────┘ │
│ │ │
│ ┌──────────────────────────────▼──────────┐ │
│ │ Stack (grows DOWN toward kernel) │ │ ◄── SP starts at 0x80000
│ └─────────────────────────────────────────┘ │
0x3F000000 ├─────────────────────────────────────────────┤
│ │
│ Peripheral I/O (Memory-Mapped) │
│ │
│ +0x200000: GPIO │
│ +0x201000: UART0 (PL011) │
│ +0x215000: AUX (UART1, SPI) │
│ +0x00B880: Mailbox │
│ │
0x40000000 ├─────────────────────────────────────────────┤
│ Local peripherals │
│ (Core timers, mailboxes) │
0x40000100 └─────────────────────────────────────────────┘
Note: Pi 4 has peripherals at 0xFE000000 and more RAM (up to 8GB)
4.3 File Structure
project/
├── Makefile # Build automation
├── linker.ld # Memory layout for linker
├── src/
│ ├── boot.S # Assembly entry point
│ ├── kernel.c # Main C code
│ ├── uart.c # UART driver
│ ├── uart.h # UART interface
│ ├── gpio.c # GPIO driver (optional)
│ ├── gpio.h # GPIO interface (optional)
│ └── mmio.h # Memory-mapped I/O helpers
├── build/ # Build output directory
│ ├── boot.o
│ ├── kernel.o
│ ├── uart.o
│ ├── kernel.elf # Linked executable
│ └── kernel8.img # Final binary for SD card
└── README.md # Build instructions
Recommended: Start with a single file approach, then split
Minimal single-file approach (recommended for learning):
project/
├── Makefile
├── linker.ld
├── boot.S
├── kernel.c # Everything in one file initially
└── kernel8.img
4.4 UART Implementation Strategy
UART Initialization Sequence:
Step 1: Disable UART
────────────────────────────────────────
CR = 0 # Control Register = 0
│
└── Turns off UART for safe reconfiguration
Step 2: Configure GPIO for UART
────────────────────────────────────────
GPFSEL1 &= ~(7 << 12) # Clear GPIO 14 function
GPFSEL1 |= (4 << 12) # GPIO 14 = ALT0 (TXD0)
GPFSEL1 &= ~(7 << 15) # Clear GPIO 15 function
GPFSEL1 |= (4 << 15) # GPIO 15 = ALT0 (RXD0)
│
└── Sets pins to UART alternate function
Step 3: Disable Pull-up/Pull-down
────────────────────────────────────────
GPPUD = 0 # Disable pull-up/down
delay(150) # Wait 150 cycles
GPPUDCLK0 = (1 << 14) | (1 << 15) # Clock GPIO 14, 15
delay(150) # Wait 150 cycles
GPPUDCLK0 = 0 # Remove clock
│
└── Required sequence from BCM2835 datasheet
Step 4: Clear Pending Interrupts
────────────────────────────────────────
ICR = 0x7FF # Clear all interrupt flags
│
└── Start with clean state
Step 5: Set Baud Rate
────────────────────────────────────────
# For 115200 baud with 48MHz UART clock:
# Divider = 48000000 / (16 * 115200) = 26.0416...
# Integer part: 26
# Fractional part: 0.0416 * 64 = 2.67 ≈ 3
IBRD = 26 # Integer baud rate divisor
FBRD = 3 # Fractional baud rate divisor
│
└── Configures 115200 baud (close approximation)
Step 6: Configure Line Control
────────────────────────────────────────
LCRH = (3 << 5) | (1 << 4) # 8 bits, FIFO enabled
│ │ │
│ │ └── FIFO enable (bit 4)
│ └── Word length = 8 bits (bits 5-6 = 11)
│
└── 8N1: 8 data bits, no parity, 1 stop bit
Step 7: Enable UART
────────────────────────────────────────
CR = (1 << 9) | (1 << 8) | (1 << 0)
│ │ │ │
│ │ │ └── UART enable (bit 0)
│ │ └── TX enable (bit 8)
│ └── RX enable (bit 9)
│
└── UART is now ready for communication
5. Implementation Guide
5.1 Environment Setup
Required tools:
# On Ubuntu/Debian:
sudo apt-get update
sudo apt-get install gcc-aarch64-linux-gnu
sudo apt-get install qemu-system-arm
# On macOS (via Homebrew):
brew tap ArmMbed/homebrew-formulae
brew install arm-none-eabi-gcc
brew install qemu
# On Arch Linux:
sudo pacman -S aarch64-linux-gnu-gcc qemu-system-aarch64
# Verify installation:
aarch64-linux-gnu-gcc --version
qemu-system-aarch64 --version
Download Raspberry Pi firmware files:
# Create project directory
mkdir rpi-bare-metal && cd rpi-bare-metal
# Download firmware files (bootcode.bin, start.elf, fixup.dat)
wget https://github.com/raspberrypi/firmware/raw/master/boot/bootcode.bin
wget https://github.com/raspberrypi/firmware/raw/master/boot/start.elf
wget https://github.com/raspberrypi/firmware/raw/master/boot/fixup.dat
5.2 Step 1: Create the Linker Script
The linker script controls where your code goes in memory. Think through these questions:
Design Questions:
- At what address does the GPU load
kernel8.img? - What must be at the very first byte of the kernel?
- What sections does your code need?
- How will your startup code know where BSS starts and ends?
Linker Script Skeleton:
/* linker.ld - Fill in the values */
ENTRY(/* What symbol is your entry point? */)
SECTIONS
{
/* Where does the GPU load the kernel? */
. = ???;
/* Boot code MUST be first - why? */
.text.boot : { *(/* pattern? */) }
/* Rest of code */
.text : { *(.text) }
/* Read-only data (strings) */
.rodata : { *(.rodata) }
/* Initialized global variables */
.data : { *(.data) }
/* Uninitialized data - needs symbols for startup code */
.bss : {
/* Export symbol: where does BSS start? */
__bss_start = .;
*(.bss)
/* Export symbol: where does BSS end? */
__bss_end = .;
}
}
5.3 Step 2: Write the Assembly Entry Point
Your startup code runs before C. It must set up the environment C expects.
Design Questions:
- How do you determine which core you are?
- What should cores 1-3 do?
- What must be true about the stack before calling C?
- What is the BSS section and why clear it?
Assembly Skeleton:
// boot.S - Fill in the implementation
.section ".text.boot"
.global _start
_start:
// Step 1: Which core am I?
// Read MPIDR_EL1, check bits [1:0] for core ID
// Instruction: mrs x?, mpidr_el1
// Then mask: and x?, x?, #0x3
// Step 2: If not core 0, go to parking loop
// Instruction: cbnz x?, label
// Step 3: Set stack pointer
// Stack grows DOWN, starts BELOW kernel
// Instruction: ldr x?, =??? then mov sp, x?
// Step 4: Clear BSS
// Load __bss_start and __bss_end symbols
// Loop storing zeros
// Instructions: ldr x?, =__bss_start, etc.
// Step 5: Jump to C
// Instruction: bl kernel_main
// Step 6: Hang if main returns
halt:
// Wait for event, branch to self
// Instructions: wfe, b halt
// Parking loop for secondary cores
secondary_spin:
// Wait for event, loop forever
// Instructions: wfe, b secondary_spin
Key ARM64 Instructions:
mrs Xd, <sysreg>— Move from system register to Xdand Xd, Xn, #imm— Bitwise AND with immediatecbnz Xn, label— Compare and branch if not zerocbz Xn, label— Compare and branch if zeroldr Xd, =symbol— Load address of symbol (pseudo-instruction)mov sp, Xn— Set stack pointer from registerstr Xzr, [Xn], #8— Store zero, post-increment address by 8cmp Xn, Xm— Compare two registersb.lt label— Branch if less than (after cmp)bl label— Branch with link (call function)wfe— Wait for event (low power wait)b label— Unconditional branch
5.4 Step 3: Implement UART Driver
Your first C code. Keep it simple.
Design Questions:
- How do you read/write to a specific memory address in C?
- What is
volatileand why is it critical for hardware registers? - In what order must UART be configured?
- How do you know when it’s safe to write another character?
UART Skeleton:
// kernel.c - Fill in the implementation
// Step 1: Define peripheral base addresses
// Pi 3: 0x3F000000, Pi 4: 0xFE000000
#define PERIPHERAL_BASE ???
// Step 2: Define UART register offsets
#define UART0_BASE (PERIPHERAL_BASE + ???)
#define UART0_DR (UART0_BASE + 0x00) // Data register
#define UART0_FR (UART0_BASE + ???) // Flag register
// ... more registers
// Step 3: Helper to write to memory-mapped register
// Think: How do you write to an arbitrary address in C?
// What does volatile mean and why is it needed?
static inline void mmio_write(unsigned long reg, unsigned int data) {
*(volatile unsigned int *)reg = data;
}
// Step 4: Helper to read from memory-mapped register
static inline unsigned int mmio_read(unsigned long reg) {
return *(volatile unsigned int *)reg;
}
// Step 5: Transmit single character
// Think: What must you check before writing to data register?
void uart_putc(unsigned char c) {
// Wait while transmit FIFO is full
// Check bit ??? of flag register
while (???) {
// spin
}
// Write character to data register
???;
}
// Step 6: Transmit string
void uart_puts(const char *str) {
while (*str) {
???;
}
}
// Step 7: UART initialization
// Follow the sequence from section 4.4
void uart_init(void) {
// 1. Disable UART
// 2. Configure GPIO
// 3. Disable pull-up/down
// 4. Clear interrupts
// 5. Set baud rate
// 6. Configure line control
// 7. Enable UART
}
// Step 8: Main function
void kernel_main(void) {
uart_init();
uart_puts("Hello from bare metal!\r\n");
// Loop forever
while (1) {
// Optional: echo received characters
}
}
5.5 Step 4: Create the Makefile
Design Questions:
- What CPU architecture flags does the compiler need?
- What does
-ffreestandingmean and why is it needed? - How do you convert an ELF file to a raw binary?
Makefile Skeleton:
# Makefile - Fill in the commands
# Toolchain prefix for cross-compilation
CROSS = aarch64-linux-gnu-
# Compiler flags
# -mcpu=??? : Which CPU?
# -fpic : Position independent code
# -ffreestanding : No standard library
CFLAGS = ???
# Assembler flags
ASFLAGS = ???
# Source files
SOURCES = boot.S kernel.c
# Build targets
all: kernel8.img
# Compile assembly
boot.o: boot.S
$(CROSS)gcc $(ASFLAGS) -c $< -o $@
# Compile C
kernel.o: kernel.c
$(CROSS)gcc $(CFLAGS) -c $< -o $@
# Link
kernel.elf: boot.o kernel.o linker.ld
$(CROSS)ld -T linker.ld -o $@ boot.o kernel.o
# Convert to raw binary
# Hint: objcopy with -O binary
kernel8.img: kernel.elf
$(CROSS)objcopy ??? $< $@
clean:
rm -f *.o *.elf *.img
5.6 Step 5: Test with QEMU
Before putting it on hardware, test in QEMU:
# Run in QEMU (serial to stdio)
qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio
# With debugging:
qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio -d int,cpu_reset
# With GDB server (for debugging):
qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio -S -s
# In another terminal:
aarch64-linux-gnu-gdb kernel.elf -ex "target remote :1234"
Expected QEMU output:
Hello from bare metal!
5.7 Step 6: Deploy to Hardware
Prepare SD card:
# Format SD card as FAT32 (use lsblk to find device, e.g., /dev/sdb)
sudo mkfs.vfat -F 32 /dev/sdX1
# Mount and copy files
sudo mount /dev/sdX1 /mnt
sudo cp bootcode.bin start.elf fixup.dat kernel8.img /mnt/
sudo umount /mnt
Optional config.txt:
# config.txt (optional)
arm_64bit=1
kernel=kernel8.img
Connect serial adapter:
USB-TTL Adapter Raspberry Pi GPIO
──────────────── ───────────────────
GND ───────────── Pin 6 (Ground)
RXD ───────────── Pin 8 (GPIO 14, TXD)
TXD ───────────── Pin 10 (GPIO 15, RXD)
(DO NOT connect VCC - Pi has its own power)
Open terminal:
# Linux
screen /dev/ttyUSB0 115200
# macOS
screen /dev/cu.usbserial-* 115200
# Exit screen: Ctrl-A, then K, then Y
Power on Pi and watch output.
6. Testing Strategy
6.1 Unit Testing (Limited in Bare Metal)
Without an OS, traditional unit testing is difficult. Instead:
Test each component in isolation:
- Assembly entry only: Just make it not crash (QEMU won’t hang)
- Stack test: Write a C function that uses locals, verify it returns
- UART TX only: Hardcode characters before implementing putc
- UART RX: Echo characters back
6.2 Integration Testing
Test Matrix:
| Test | QEMU | Pi 3 | Pi 4 | Expected |
|---|---|---|---|---|
| Boots without hang | Pass | Pass | Pass | UART output appears |
| Correct baud rate | N/A | Pass | Pass | No garbled output |
| String output | Pass | Pass | Pass | “Hello from bare metal!” |
| Character echo | Pass | Pass | Pass | Typed chars echoed |
| Multi-core parking | Pass | Pass | Pass | Only one “Hello” message |
6.3 Debugging Techniques
When nothing works (no output):
Debugging Flowchart:
[No output]
│
├──► Is QEMU working?
│ │
│ ├── Yes: Problem is hardware-specific
│ │ → Check GPIO wiring
│ │ → Check serial adapter
│ │ → Verify firmware files on SD
│ │
│ └── No: Problem is in code
│ → Check linker script origin
│ → Verify _start is at 0x80000
│ → Add infinite LED blink BEFORE uart_init
│
├──► Is it garbled output?
│ │
│ └── Baud rate mismatch
│ → Check IBRD/FBRD calculation
│ → Verify UART clock frequency
│
└──► Is it partial output?
│
└── Missing newlines or buffer issues
→ Use \r\n not just \n
→ Add uart_flush() if needed
LED debugging (before UART works):
// GPIO 47 is ACT LED on Pi 3 (active low)
#define GPIO_BASE (PERIPHERAL_BASE + 0x200000)
#define GPFSEL4 (GPIO_BASE + 0x10)
#define GPSET1 (GPIO_BASE + 0x20)
#define GPCLR1 (GPIO_BASE + 0x2C)
void led_init(void) {
// Set GPIO 47 as output
unsigned int val = mmio_read(GPFSEL4);
val &= ~(7 << 21); // Clear bits 23-21
val |= (1 << 21); // Set as output
mmio_write(GPFSEL4, val);
}
void led_on(void) {
mmio_write(GPCLR1, (1 << 15)); // GPIO 47 is bit 15 of GPCLR1
}
void led_off(void) {
mmio_write(GPSET1, (1 << 15));
}
Use LED as progress indicator:
void kernel_main(void) {
led_init();
led_on(); // LED on = we got to main
uart_init();
led_off(); // LED off = UART init complete
uart_puts("Hello");
led_on(); // LED on = message sent
}
7. Common Pitfalls & Debugging
7.1 Build Issues
| Symptom | Cause | Fix |
|---|---|---|
undefined reference to _start |
Wrong linker script | Check ENTRY(_start) |
kernel8.img is 0 bytes |
objcopy failed | Check section names in linker script |
cannot find -lgcc |
Missing compiler libs | Use -nostdlib flag |
relocation truncated |
Code too far from address | Use -fpic or check linker script |
7.2 Boot Issues
| Symptom | Cause | Fix |
|---|---|---|
| Rainbow screen | Kernel not found | Check kernel8.img name, FAT32 format |
| Hangs with blank screen | Crash before UART | Add LED blink at _start |
| Pi doesn’t power on | SD card issue | Verify firmware files present |
| Boot loops | Kernel crashes immediately | Check stack pointer setup |
7.3 UART Issues
| Symptom | Cause | Fix |
|---|---|---|
| No output at all | GPIO not configured | Set ALT0 function for GPIO 14/15 |
| Garbled output | Wrong baud rate | Recalculate IBRD/FBRD |
| Missing characters | TX FIFO overflow | Wait for FIFO not full before write |
| Only first char shows | Missing loop in puts | Check string iteration |
| Works in QEMU, not hardware | Pull-up/down not configured | Implement GPPUD sequence |
7.4 Multi-Core Issues
| Symptom | Cause | Fix |
|---|---|---|
| Output repeated 4x | All cores running main | Add core ID check in _start |
| Random crashes | Stack corruption | Each core uses same stack |
| Inconsistent behavior | Race conditions | Park secondary cores with WFE |
7.5 Memory Issues
| Symptom | Cause | Fix |
|---|---|---|
| Global variables corrupted | BSS not cleared | Add BSS clearing loop in _start |
| Function calls crash | Stack pointer wrong | Check SP initialization |
| Strings show garbage | .rodata not in binary | Check linker script sections |
8. Extensions & Challenges
Once “Hello World” works, try these extensions:
8.1 Easy Extensions
- Implement uart_getc() — Read characters from serial
- Echo terminal — Echo typed characters back
- Blink ACT LED — Visual feedback without serial
8.2 Intermediate Extensions
- Print hex numbers — Implement simple hex formatting
- Simple command parser — “led on”, “led off”, “help”
- System information — Print CPU ID, memory size from mailbox
- Timer-based LED blink — Use ARM timer instead of delay loop
8.3 Advanced Extensions
- Framebuffer output — Print to screen via mailbox/framebuffer
- USB keyboard input — Much harder, requires USB stack
- Multi-core activation — Wake secondary cores, give them tasks
- Exception handlers — Catch and report exceptions
- Simple shell — Command line with history
8.4 Project Ideas Using This Foundation
- Bare metal music player — PWM audio output
- Logic analyzer — GPIO sampling with timing
- LED matrix controller — SPI or bit-banged output
- Temperature logger — I2C sensor reading
- Mini operating system — Memory management, task switching
9. Real-World Connections
9.1 Where This Knowledge Applies
| Domain | Application |
|---|---|
| Embedded Systems | IoT firmware, sensor nodes, industrial controllers |
| Operating Systems | Linux kernel development, boot process |
| Bootloaders | U-Boot, Raspberry Pi bootloader |
| Security Research | Firmware analysis, rootkit development |
| Game Development | Retro console homebrew |
| Automotive | ECU firmware, CAN bus controllers |
| Aerospace | Flight controllers, satellite systems |
9.2 Production Examples
Raspberry Pi Foundation’s own kernel:
- Same boot process
- Same UART initialization
- Evolves into full Linux kernel
U-Boot bootloader:
- Similar early initialization
- Loads and boots Linux
- Used on most ARM boards
FreeRTOS on Pi:
- Similar bare metal start
- Adds scheduler, tasks
- Real-time applications
9.3 Career Relevance
Job roles that use this:
- Embedded Systems Engineer
- Firmware Developer
- OS/Kernel Developer
- Security Researcher
- IoT Developer
Interview topics from this project:
- “Explain the ARM boot process”
- “What is a linker script?”
- “How does memory-mapped I/O work?”
- “Why use volatile for hardware registers?”
10. Resources
10.1 Primary References
| Resource | Why It’s Useful |
|---|---|
| bztsrc/raspi3-tutorial | Excellent step-by-step bare metal tutorials |
| BCM2837 ARM Peripherals | Official peripheral documentation |
| ARM Architecture Reference Manual | Definitive ARM instruction reference |
| Valvers Pi Tutorial | Classic bare metal tutorial series |
10.2 Books
| Book | Chapters | Why |
|---|---|---|
| “Bare-metal programming for ARM” by Umanovskis | Ch. 3-5 | Free ebook, perfect for this project |
| “The Definitive Guide to ARM Cortex-M3/M4” | Ch. 1-4 | ARM architecture fundamentals |
| “Operating Systems: Three Easy Pieces” | Ch. 1-6 | OS concepts that apply here |
10.3 Tools
| Tool | Purpose |
|---|---|
aarch64-linux-gnu-gcc |
Cross-compiler for ARM64 |
qemu-system-aarch64 |
ARM64 emulator |
screen or minicom |
Serial terminal |
aarch64-linux-gnu-objdump |
Disassembly and inspection |
xxd |
Hex dump for binary inspection |
10.4 Community
- OSDev Wiki — OS development knowledge base
- OSDev Forums — Community help
- r/osdev — Reddit community
- Raspberry Pi Forums — Hardware-specific help
11. Self-Assessment Checklist
Use this to verify your understanding before, during, and after the project:
Before Starting
- I can explain what bare metal programming means
- I understand why the Raspberry Pi GPU boots before the ARM CPU
- I know what a linker script does
- I understand the difference between ELF and raw binary formats
- I know what memory-mapped I/O means
During Implementation
- I can read the MPIDR_EL1 register to get core ID
- I understand why secondary cores must be parked
- I know why volatile is required for hardware registers
- I can calculate UART baud rate divisors
- I understand the GPIO function select register format
After Completion
- I can explain every line of my assembly startup code
- I know what would happen if I didn’t clear BSS
- I can describe the complete boot sequence from power-on to my printf
- I can debug UART issues systematically
- I could implement UART on a different ARM board using a different datasheet
Extension Readiness
- I could add interrupt handling to this kernel
- I understand how to access the mailbox for system information
- I could implement a simple timer-based scheduler
- I know what additional steps would be needed for MMU/paging
12. Completion Criteria
You have successfully completed this project when:
Minimum Requirements
- Builds without errors:
makeproduceskernel8.img - Boots in QEMU:
qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdioshows output - Prints message: “Hello from bare metal!” appears in serial terminal
- Works on hardware: Same output when running on physical Pi 3/4
- Clean shutdown: Kernel enters infinite loop without crashing
Quality Criteria
- Code is readable: Comments explain non-obvious operations
- Proper separation: Startup assembly, main kernel, UART driver are distinct
- No hardcoded magic: Constants are named and documented
- Portable base addresses: Easy to switch between Pi 3 and Pi 4
Understanding Criteria
- Can explain boot sequence: From power-on to your first printf
- Understands linker script: Can modify memory layout if needed
- Knows register purpose: Can describe each UART register used
- Can debug issues: Systematic approach to fixing problems
Stretch Goals
- Echo terminal: Receives and echoes typed characters
- LED feedback: ACT LED indicates system state
- Pi 3 and Pi 4: Same code works on both (peripheral base detection)
- System info: Prints CPU type and memory size via mailbox
Summary
This project takes you from zero to running your own code on a Raspberry Pi with no operating system. You’ve learned:
- The Pi boot process — GPU loads firmware, then your kernel
- ARM64 assembly basics — Core parking, stack setup, BSS clearing
- Linker scripts — Controlling memory layout for bare metal
- Memory-mapped I/O — Accessing hardware through physical addresses
- UART programming — Serial communication for debugging
This is the foundation for all embedded systems and operating system development. Every OS kernel, every bootloader, every firmware project starts with code like this.
Next steps:
- Project 5: x86 Bootloader (learn another architecture)
- Project 9: STM32 Bare Metal (professional ARM Cortex-M)
- Project 10: Simple Kernel with Multitasking (use this as a base)
The transition from “code that runs on an OS” to “code that IS the first thing running” is a fundamental shift in understanding. You now have that understanding.
Time to boot into your own code. Good luck!