Learn Bare Metal Programming: From Blinking LEDs to Operating Systems
Goal: Build a mental model for code that runs with zero OS help: bring-up from reset vector to
main(), place code with linker scripts, drive peripherals through MMIO, handle interrupts deterministically, and debug failures with UART/JTAG so you can bootstrap real systems (MCUs, Raspberry Pi, x86 PCs, UEFI apps).
Introduction
Bare metal programming means your code is the first thing the CPU executes after power-on—no kernel, no runtime, no drivers. You own the reset vector, the memory map, the interrupt table, and every register write. This guide turns that responsibility into a repeatable pipeline you can reproduce across MCUs, ARM SoCs, and x86 PCs.
What you’ll build across the projects:
- Minimal firmware that blinks LEDs, prints over UART, and configures timers
- Boot sectors and kernels that switch CPU modes and map memory
- Drivers for interrupts, storage, and graphics in pre-OS environments
- A small multitasking kernel that schedules work without an OS
Big picture (the control you take over):
Power → Reset → Startup code → Memory map → Peripherals → Interrupts → Scheduler/Services
│ │ │ │ │
└─ you own all of this, there is no OS safety net
How to Use This Guide
- Read the primer first: Internalize the pipeline (reset → memory → interrupts → timing → toolchain → debugging).
- Map concepts to projects: Use the Project-to-Concept Map to pick the next build.
- Work in loops: Build a minimal slice, verify with a physical signal (GPIO/UART), then extend.
- Use layered hints: Each project has progressive hints; check them only after you form a hypothesis.
- Validate: Use the Definition of Done checklists and “Common Pitfalls” as your test plan.
Prerequisites & Background
Essential Prerequisites (Must Have)
- Confident with C (pointers, structs, bitwise ops) and basic assembly reading
- Comfortable reading datasheets/reference manuals and wiring a serial console
- Familiar with Make/CMake basics and cross-compiling
Helpful But Not Required
- OS concepts (interrupts, paging, scheduling)
- Digital logic (registers, buses, pull-ups/downs)
- Exposure to GDB, OpenOCD, or QEMU
Self-Assessment Questions
- Can you explain the difference between stack, heap, and
.bss? - Can you locate a peripheral register map in a datasheet and decode bitfields?
- Can you write and flash a C program with
-ffreestandingand no libc? - Do you know what a vector table is and how an ISR returns?
Development Environment Setup
- Toolchains:
avr-gcc,arm-none-eabi-gcc,nasm,qemu-system-* - Debug:
gdb-multiarch, OpenOCD/JTAG/SWD, UART adapter (screen/minicom) - Flashing/Sim:
avrdude,st-flash,qemu-system-{i386,aarch64}
Time Investment (realistic)
- Projects 1–3: 3–4 weeks
- Projects 4–9: 2–3 months
- Projects 10–12: 2–4 months
Important Reality Check
Most failures are address/clock mistakes, not logic bugs. Expect to read schematics, map registers, and prove each step with a physical or simulated signal.
Big Picture / Mental Model
Power-on / Reset
│
▼
Vector table / Reset handler (sets SP, clears .bss, copies .data)
│
▼
Clock + Memory map (flash, RAM, peripherals)
│
▼
Peripherals via MMIO (GPIO, UART, timers, DMA)
│
▼
Interrupts/Exceptions (vector -> ISR -> acknowledge)
│
▼
Scheduler/Services (tasks, paging, filesystem)
│
▼
User code / demos
If any layer is wrong, nothing “crashes nicely”—it just stops. Always add one verified signal per layer (GPIO toggle, UART byte, timer tick, ISR counter).
Theory Primer (Mini-Book)
Each concept below has its own mini-chapter. Every concept listed in the Concept Summary Table is mapped here and back to projects.
Concept 1: Power-On, Reset, and Startup Chain
Fundamentals (what & why)
The reset chain is the CPU’s on-ramp: a fixed address (reset vector) points to startup code that sets the stack, clears .bss, copies .data, parks extra cores, and jumps into C. Without this, the first C instruction runs with random RAM, wrong stack, and undefined peripherals. This chain is short but decisive: a single misplaced vector or uncleared section causes silent lockups. Treat it as a checklist: known entry point, valid stack in writable RAM, memory sanitized, minimal clock policy, then hand-off to C.
Deep Dive (step-by-step, ~500+ words)
The CPU begins in a deterministic state defined by its architecture: x86 starts in real mode at 0xFFFF0, ARMv7-A boots at SoC-defined addresses (e.g., 0x80000 on Raspberry Pi 3), Cortex-M loads the initial SP and reset handler from the vector table in flash. The startup code must: (1) load a valid stack pointer that points to writable RAM; (2) zero .bss so globals start clean; (3) copy .data from flash to RAM so initialized globals have the right values; (4) configure minimal clocks if required for memory/flash access; (5) optionally disable/park secondary cores; (6) branch to main. Linker scripts place .text, .data, .bss, and stacks at exact addresses so these steps work. A mental invariant: every byte the CPU might fetch must already be mapped and executable; every byte it might write must live in writable memory with clocks enabled. On x86, BIOS/UEFI provides early services, but you still control the boot sector (real mode) or PE/COFF entry (UEFI) and must handle mode switches. On MCUs, reset hits the vector table directly—no firmware safety net—so the first instruction must be valid and correctly aligned. Failure modes: wrong origin in linker script (CPU executes garbage), uninitialized SP (stack corruption before main), missing .bss clear (stale data), wrong endianness when copying sections, or leaving caches/MPU in undefined states. Verification strategy: (a) toggle a GPIO at _start to prove execution reached your code; (b) emit a UART byte before calling main; (c) dump the linker map to confirm section placement; (d) single-step _start in QEMU/GDB to ensure SP, .bss, and .data are correct. Cross-platform notes: Cortex-M demands a correctly aligned vector table and optional VTOR relocation; ARMv8-A bootloaders may start in EL3/EL2 and need exception level configuration; x86 requires segment setup before switching to protected/long mode. In all cases, the startup chain is a deterministic finite sequence—make it explicit, measurable, and repeatable.
Practical “what can go wrong” catalog: caches may retain stale instructions if you relocate code; some SoCs require cache invalidation or branch predictor disable during early boot. Flash wait states may need configuration before you fetch large blocks; if not, instruction prefetch stalls will look like random hangs. Multi-core bring-up (Raspberry Pi, SMP x86) requires holding secondary cores in a tight loop at a known address until memory and page tables are stable. On UEFI, the firmware loads your PE/COFF image and hands you rich services, but once you call ExitBootServices you own the MMU, APIC, and memory map—failing to copy the map or pin stacks causes elusive faults. Tooling influences startup: link-time relaxation may reorder sections; LTO can remove seemingly “unused” symbols (like your vector table) unless marked __attribute__((used, section(\".vectors\"))). Secure boot flows insert another step—signature verification or measured boot—which means the very first instruction you own might be in a verified payload, not in raw flash, and you must maintain that chain of trust when handing off to the next stage. Embedded flash often powers up with brown-out reset; if BOR isn’t configured, spurious resets mimic code bugs. Brown-out detection should be part of startup policy. Finally, document invariants: the exact stack top address, the expected size of .bss, and the reset cause register. These become your regression checklist when porting to a new board: if the reset cause shows watchdog instead of POR, revisit clock/voltage; if .bss size changes unexpectedly, inspect new globals.
Portability tactics: keep startup code tiny and architecture-specific, but keep policy (clock values, memory sizes) in a single header so ports are data-driven. For x86, consider using multiboot2 headers so you can boot via GRUB in addition to raw BIOS; for ARM, provide both position-dependent and position-independent _start options in case the ROM loader relocates you. Add assertions: compare _edata - _sdata to the copy size you emit; zero .bss with an inline loop that double-checks it stayed zeroed when you return; emit a “magic” pattern to the top of stack and verify it after boot to detect overflow. Capture the reset reason in a global variable and print it—brown-out, watchdog, software reset—so intermittent failures are visible. By codifying these habits you turn early boot from folklore into a testable, repeatable sequence you can reuse on every board.
How this fits in projects: Projects 4–8 (Pi boot, x86 bootloader, protected mode, paging) start here; Projects 1–3 rely on correct MCU reset vectors; Project 12 depends on UEFI entry discipline.
Definitions & key terms: reset vector, vector table, _start, .bss, .data, stack pointer, linker script, origin, entry point, VTOR.
Mental model diagram
Reset vector -> _start -> set SP -> zero .bss -> copy .data -> minimal clocks -> jump main
How it works (succinct): A pointer in ROM directs execution; startup code prepares RAM and environment; hand-off to C occurs only after memory is sane.
Minimal example (C/ASM sketch):
.section .text.start
.global _start
_start:
ldr sp, =_stack_top @ set stack
bl zero_bss_copy_data
bl main
b .
Common misconceptions
- “The compiler handles startup.” → Not when you use
-nostdlib/-ffreestanding. - “BSS can be skipped.” → Uninitialized globals will hold random data.
- “Linker scripts are optional.” → Without correct origins, the CPU executes garbage.
Check-your-understanding
- Q: What happens if SP points to flash? A: First push corrupts flash address space → fault.
- Q: Why clear
.bss? A: Language guarantees zero-initialized globals; logic depends on it. - Q: Why park secondary cores? A: Prevent them from running stale vectors and corrupting state.
Real-world applications: Firmware bring-up, trusted boot, secure boot chains, RTOS startups.
Where you’ll apply it: Projects 1–8, 10, 12.
References: “Bare Metal C” (Ch. 1–3), “Write Great Code Vol. 1” (Ch. 1–3), vendor datasheets.
Key insight: Startup is deterministic; prove each step with a signal before layering complexity.
Summary: Place code correctly, set SP, sanitize memory, then jump to C—nothing else runs first.
Homework & solutions
- Exercise: Write a startup that only toggles a GPIO, no C. Solution: Set SP, enable clock, set DDR, toggle pin, loop.
- Exercise: Modify linker script origin by +0x100 and observe failure. Solution: CPU fetches wrong address → no UART → fix origin.
Concept 2: Memory Maps & MMIO Discipline
Fundamentals
A memory map is a contract: every address corresponds to flash, RAM, peripheral registers, or is reserved. MMIO makes peripherals appear as memory, but reads/writes have side effects. Correctness depends on volatile access, proper widths, and respecting reserved bits. The discipline is to know the base, mask only the bits you intend to change, verify write-readback, and ensure clocks/bridges are enabled so the contract is actually live.
Deep Dive (~500+ words)
Memory maps differ per platform: AVR has flat 8-bit I/O space; Cortex-M maps peripherals in the 0x4000_0000 region; x86 uses memory plus port I/O. The linker script must place .text in executable flash/ROM, .data and .bss in RAM, stacks at safe high addresses, and optionally map kernel virtual addresses (paging). MMIO discipline: mark pointers volatile, use read-modify-write to avoid clobbering neighboring bits, and ensure the bus is clocked/enabled before access (e.g., RCC on STM32, CMU on RP2040, PCI/ACPI enabling on PCs). Endianness and alignment matter—misaligned or wrong-width accesses can bus-fault MCUs or produce undefined behavior on x86. Cache and buffering layers (write combining, store buffers) mean you sometimes need barriers (DSB/ISB on ARM, mfence on x86) around critical MMIO sequences. Address decoding errors manifest as constant 0x0/0xFFFF reads, bus faults, or lockups; fix by checking base addresses, enabling clocks, and verifying the memory map against the datasheet. For paging projects, virtual memory introduces another layer: virtual→physical translations through page directories/tables; identity-map early code before enabling paging to avoid triple faults. Safety patterns: (1) define registers as structs of volatile uint32_t; (2) precompute masks; (3) audit reset values and write sequences; (4) add instrumentation (UART prints or GPIO toggles) after each register write to prove it stuck; (5) when uncertain, read-back the register and log it. On MCUs, peripheral offsets are small and predictable; on SoCs/PCs, MMIO can be behind bridges with different ordering rules. MMIO is your “syscall interface” to hardware—treat it with the same rigor as kernel system calls: validate inputs, keep critical sequences short, and document required timing/ordering.
Deeper practice: always draw the memory map with exact ranges—flash, SRAM, peripheral windows, reserved holes. Mark which regions are executable, cacheable, bufferable, device-type. For Cortex-A and x86, memory attributes in page tables matter: mapping device memory as cacheable can reorder writes and stall peripherals; mapping normal memory as device can tank performance. Device errata sometimes require specific access widths (16-bit writes only); violating them yields silent failures. On PCI/SoC buses, MMIO may traverse bridges with posted writes—your write returns before hardware sees it; insert read-backs or barriers where necessary. On MCUs with bit-banding, consider atomic bit set/clear to avoid RMW races. When adding virtual memory, preserve identity maps for early handlers and map stacks as guard-paged regions to catch overflows. A disciplined approach is to create a single header describing all addresses and masks; every project in this guide can then reuse the same contract. Testing MMIO: use a logic analyzer to observe pin toggles when writing GPIO, or read known ID registers (e.g., STM32 DBGMCU_IDCODE, PCI vendor/device IDs) to validate base addresses. When things read as 0xFFFF, confirm that the peripheral clock is enabled and that you are not accidentally reading from a bus that requires aligned halfword/word access. For storage and PCIe, remember that MMIO BARs are discovered at runtime—your driver must parse configuration space, not hardcode addresses. For DMA-driven peripherals, your memory map also includes coherency domains: buffers may need to be uncached or cache-flushed to ensure devices see the right data. Every MMIO bug you fix should add a rule of thumb to your checklist: enable clock, confirm base, use correct width, RMW carefully, read-back, and if paged, map with the right attributes.
Advanced pitfalls: overlapping mappings in paging can alias two virtual addresses to the same physical region, leading to surprising cache interactions; always invalidate TLBs after changing page tables. On systems with MPU instead of MMU, region size and alignment constraints (power-of-two, base aligned) can force you to reorganize memory layout—plan early. Pay attention to access permissions: marking peripheral regions as executable by accident can open security holes; marking flash as writable can corrupt code during stray writes. In mixed-endian peripherals, convert explicitly when reading multi-byte registers. If you must share buffers between DMA and CPU caches, use cache maintenance primitives or place buffers in non-cacheable regions; failing to do so yields “heisenbugs” where data looks correct in RAM but not on the wire. Finally, annotate every register write in code with the reset value and the reason for each bit—future-you (or reviewers) can then audit quickly when hardware revisions arrive.
How this fits in projects: Projects 1–3 (AVR GPIO/UART/timers), 4 (RPi UART), 5–8 (x86 VGA, PIC, paging), 9 (STM32 RCC/GPIO/NVIC), 11 (ATA PIO registers).
Definitions & key terms: MMIO, PIO, base address, offset, mask, volatile, barrier, endianness, alignment.
Mental model diagram
Address space
┌───────────────┬───────────────┬─────────────┬───────────────┐
│ Flash / ROM │ RAM │ Peripherals │ Reserved │
│ (exec) │ (R/W) │ (MMIO) │ (fault) │
└───────────────┴───────────────┴─────────────┴───────────────┘
How it works: CPU issues loads/stores; bus fabric routes them to RAM or peripheral IP blocks; side effects occur on write/read.
Minimal example:
#define GPIOB_BASE 0x40020400u
#define GPIOB_MODER (*(volatile uint32_t*)(GPIOB_BASE + 0x00))
#define GPIOB_ODR (*(volatile uint32_t*)(GPIOB_BASE + 0x14))
GPIOB_MODER |= (1u << 10); // set PB5 as output
GPIOB_ODR ^= (1u << 5); // toggle PB5
Common misconceptions
- “Reading back is optional.” → Many peripherals require read-back to clear or to confirm.
- “Any width works.” → Some registers require 16/32-bit aligned access.
- “Volatile is a hint.” → It is required for correctness with MMIO.
Check-your-understanding
- Q: Why identity-map early code before paging? A: So current PC/SP remain valid when paging enables.
- Q: What happens if you write without enabling the peripheral clock? A: Writes are ignored; reads return default/bus fault.
- Q: Why RMW instead of direct write? A: To avoid clearing adjacent control bits.
Real-world applications: GPIO control, UART baud setup, timer PWM, DMA configuration, PCIe BAR mapping.
Where you’ll apply it: Projects 1–4, 7–9, 11.
References: Datasheets, “Making Embedded Systems” (Ch. 2), “Bare Metal C” (Ch. 2), OSDev MMIO pages.
Key insight: Treat addresses as contracts; prove each register write with a read-back or external signal.
Summary: Correct memory maps + disciplined volatile access turn raw addresses into predictable hardware control.
Homework & solutions
- Exercise: Derive GPIO base + offset for an LED and toggle it; verify with scope. Solution: Enable clock, set MODER, write ODR, observe waveform.
- Exercise: Modify one bit with RMW vs direct write; observe unintended side effects. Solution: Direct write clears other bits; RMW preserves them.
Concept 3: Interrupts, Exceptions, and Timing
Fundamentals
Interrupts deliver hardware reality (timers, UART RX, buttons) into software asynchronously. Correct handling requires a vector table, ISR prologue/epilogue, masking/acknowledging sources, and predictable latency. Timing builds on this via hardware timers, prescalers, and compare units that make time measurable without busy loops. Determinism is the goal: fixed tick rates, bounded ISR time, and measurable jitter that you can constrain.
Deep Dive (~500+ words)
An interrupt is a hardware request that diverts execution through a vector entry to an ISR. The CPU saves context (implicit on x86, partial on ARM), jumps to the handler, runs it, acknowledges the source, and resumes. Interrupt controllers (NVIC on Cortex-M, GIC on ARMv8, PIC/APIC on x86) prioritize, mask, and route interrupts. Failure patterns: wrong vector address, unaligned table, not enabling the controller, forgetting global enable (e.g., sei() on AVR, cpsie i on ARM), or not acknowledging the source leading to repeated triggers. ISRs must be minimal: clear flags, enqueue work, and exit; heavy work moves to main loop or a scheduler. Timing: hardware timers divide a clock (with prescalers) and compare counts to generate periodic interrupts or PWM waveforms. Determinism comes from known tick rates, bounded ISR latency, and verified jitter (measure with GPIO toggles). Clocks feed timers; if the system clock differs from assumptions, baud rates and PWM frequencies drift. On x86, PIT (8253/8254) or HPET provide ticks; on MCUs, SysTick/Timer peripherals do. For preemption (Project 10), the timer ISR performs a context switch—save general registers, stack pointer, flags; pick next task; restore registers; return with iret/bx lr to the new task. Exceptions (faults) use the same mechanism but signal errors (divide-by-zero, page fault). Treat them as structured interrupts with richer diagnostics. Testing strategy: (1) install a single handler that flips a GPIO to prove vectors are correct; (2) log ISR entry counts over UART; (3) intentionally trigger a known interrupt (timer compare) to validate priorities; (4) measure latency/jitter with a scope or cycle counter. Concurrency hazards: shared state between ISR and main must use volatile and, when needed, atomic/critical sections. Nested interrupts require priorities and possibly tail-chaining (Cortex-M) awareness. In virtual memory systems, ISRs run with kernel mappings; ensure stacks are mapped and large enough. For PC hardware, remap the PIC to avoid conflicts with CPU exceptions (vectors 0–31 reserved). Deterministic timing also means avoiding long critical sections that block ISRs and carefully selecting prescalers to minimize error (e.g., 16 MHz/64 for 1 kHz tick). For PWM, align updates to safe phases to avoid glitches. The key invariant: each interrupt source has (a) a vector, (b) enable, (c) flag/ack, (d) priority, (e) handler that keeps latency bounded.
Practical determinism work: measure, don’t guess. Add a scope probe to a GPIO toggled at ISR entry/exit to quantify latency and jitter; capture logs over UART with timestamps from a hardware timer; run stress tests with nested interrupts to ensure priorities behave as expected. On Cortex-M, tail-chaining can hide latent latency if ISRs run back-to-back; design ISRs to be short enough that nesting isn’t required unless deliberate. On x86 SMP, an interrupt can arrive on any core—decide whether to mask or use per-CPU IDTs/APIC routing. Timer selection impacts accuracy: HPET gives stable ticks but slower access; PIT is simple but low resolution; LAPIC timers are per-core and suitable for preemption. For PWM motor control, align compare updates with a shadow register or update-on-wrap to avoid audible glitches. For UART RX, decide between polling, interrupt-driven ring buffers, or DMA; each changes latency and CPU utilization. Fault handlers should dump registers, stack pointer, and faulting address, then halt with a known pattern on screen/UART so you can root-cause early boot faults. When timing drifts, re-check clock tree math and external crystal specs; a 0.5% baud mismatch shows up as garbled characters. Power-saving modes often gate timers/interrupts—verify wake-up sources and that the vector table is still mapped after sleep. Always acknowledge interrupts in the order required by the controller (e.g., write EOI before leaving the ISR on PIC/APIC; clear peripheral flag first on many MCUs). Document maximum acceptable ISR runtime; if exceeded, move work to deferred contexts. This discipline keeps systems responsive and prevents rare “once-a-day” stalls in real deployments.
To close the loop, create a test plan per source: (a) synthetic timer storm to measure maximum jitter; (b) UART flood at maximum baud to test RX buffer overflow handling; (c) spurious interrupt injection to ensure unhandled vectors fall back to a safe stub; (d) deliberate divide-by-zero/page-fault to verify exception paths. Use perf counters or cycle counters (DWT_CYCCNT on Cortex-M, rdtsc on x86) to benchmark ISR entry/exit. Track latency budgets in documentation—e.g., <10 µs worst-case for a 1 kHz control loop—and re-measure after code changes. For preemptive kernels, ensure scheduler data structures are interrupt-safe and that critical sections are short; if you disable interrupts too long, you violate real-time deadlines. Finally, remember that an interrupt system is also a security boundary: mask or gate debug-triggered interrupts in production, validate input lengths in ISR-driven protocols, and ensure fault handlers don’t leak secrets over UART.
How this fits in projects: Project 2 (UART RX/TX), 3 (timers/PWM), 7 (IDT), 8 (page faults), 9 (NVIC), 10 (scheduler), 11 (ATA IRQs).
Definitions & key terms: ISR, vector table/IDT, PIC/NVIC/GIC/APIC, mask/unmask, acknowledge/EOI, latency, jitter, prescaler, compare match, PWM.
Mental model diagram
Event -> Interrupt controller -> Vector lookup -> ISR prologue -> Work -> Acknowledge -> Return
How it works: Hardware asserts a line; controller prioritizes; CPU saves state; handler runs; flag cleared; execution resumes.
Minimal example (AVR timer CTC):
ISR(TIMER1_COMPA_vect) {
PORTB ^= (1 << PB5); // toggle LED
}
Common misconceptions
- “Interrupts replace polling entirely.” → ISRs still require careful shared-state handling.
- “Any prescaler works.” → Frequency error impacts UART baud/PWM tone.
- “Unhandled interrupts are harmless.” → They often crash or freeze the system.
Check-your-understanding
- Q: Why remap PIC on x86? A: To avoid clashing with CPU exceptions 0–31.
- Q: What must every ISR do? A: Clear/acknowledge the source and exit quickly.
- Q: How do you measure jitter? A: Toggle GPIO in ISR and measure with scope or cycle counter.
Real-world applications: RTOS ticks, motor control PWM, UART RX buffering, watchdog kicks, audio timing.
Where you’ll apply it: Projects 2, 3, 7, 9, 10, 11.
References: “Making Embedded Systems” (Ch. 2, 7), OSDev (Interrupts, PIC), AVR datasheet timer sections.
Key insight: Deterministic systems depend on predictable interrupts; prove latency with measurements, not guesses.
Summary: Correct vectors + fast ISRs + verified timing = stable hardware interactions.
Homework & solutions
- Exercise: Configure a timer for 1 kHz and log tick drift over 10 seconds. Solution: Compare expected 10,000 ticks to actual; adjust prescaler/compare.
- Exercise: Trigger divide-by-zero in Project 7 and print fault info. Solution: Install exception handler, dump CS:EIP or rip.
Concept 4: Toolchain, Linker Scripts, and Debugging Observability
Fundamentals
Your “runtime” is the toolchain: compiler flags, startup objects, linker script, and binary format. Observability comes from UART, JTAG/SWD, GDB, and minimal printf/trace hooks. Without these, debugging devolves into guessing. The contract is twofold: reproducible binaries (same flags, same placement) and an early, reliable way to see what the CPU is doing.
Deep Dive (~500+ words)
Freestanding builds require -ffreestanding -nostdlib to avoid pulling libc and host assumptions. Startup objects (crt0) are either supplied (UEFI/EFI apps) or written by you (MCUs/x86 boot). The linker script sets section layout, memory origins, and symbols you reference in startup (_stack_top, _sbss, _ebss, _sdata, _edata). On x86 kernel builds, scripts also place multiboot headers or higher-half mappings. Binary formats: raw BIN for boot sectors, ELF for kernels, PE/COFF for UEFI. Size and alignment constraints matter (512-byte boot sectors need 0xAA55 signature; UEFI requires aligned sections). Build reproducibility: pin compiler versions, use -mcpu/-march flags, and inspect map files to confirm placement. Observability: first channel is UART (initialize early, even before C), second is GPIO toggles for timing, third is GDB (via QEMU -s -S or OpenOCD). For MCUs, SWD/JTAG provides memory/register access; for PCs, QEMU and Bochs logs help decode triple faults. Logging discipline: avoid heavy printf in ISRs; implement ring buffers; include timestamps from hardware timers. Safety flags: -fno-pic for tiny MCUs, -fpic for some ARM SoCs, -fno-omit-frame-pointer during bring-up, -Wall -Wextra -Werror to catch UB. Memory safety: mark MMIO as volatile, disable optimizations around timing-critical handshakes if needed (__asm__ __volatile__). Tooling scripts: Makefiles that separate compile/link/objcopy; post-build steps to create HEX/BIN; flashing commands that verify writes. Debug playbook: (1) build with symbols (-g); (2) run in QEMU with GDB stub; (3) set breakpoints at _start, main, ISR; (4) inspect SP, PC, and registers; (5) disassemble around faults; (6) dump memory to confirm section addresses; (7) on hardware, use UART heartbeat and GPIO scope probes. For paging and multitasking, add panic handlers that dump registers and halt. For storage/network drivers, add hexdumps of buffers. Ultimately, the toolchain is part of the system contract—if flags or scripts change, the binary layout changes, and so does boot behavior.
Go deeper: reproducibility and provenance are part of correctness. Record exact compiler, binutils, and linker script versions; keep map files and linker scripts under version control; add make map targets to inspect size regressions. Introduce -Wl,--gc-sections judiciously—great for size, dangerous if you forget __attribute__((used)) on vectors/startup. For UEFI, -fshort-wchar matters; for ARM, -mthumb vs -marm shifts calling conventions and alignment. Build once with -O0 -g for debug, then with -O2 -g for release; compare disassembly to ensure critical sequences (e.g., MMIO) survive optimizations. Observability scaffolding should include a minimal panic logger that writes both to UART and to a reserved RAM buffer you can dump over GDB after a crash. For early bring-up, prefer “static traces”: GPIO heartbeat per milestone and a single-byte UART code for each major init stage (0xA1 for clocks, 0xA2 for memory, etc.). When integrating with hardware debuggers, script OpenOCD to halt at _start, inspect the vector table, and verify SP before you ever run C. For x86, leverage QEMU tracing (-d int,cpu_reset) to decode faults; for ARM, use semihosting or ITM if available. Don’t neglect binary post-processing: objcopy to BIN/HEX, checksum/sign for secure boot, and align images to SD-card expectations. Every project in this guide benefits from a “one-command build+flash” script that also prints the git hash, toolchain version, and resulting binary sizes—this keeps experiments repeatable and debuggable months later. Observability closes the loop: without early UART/GPIO/GDB hooks, you are debugging blind. With them, you can bisect failures in minutes.
Operationalize the toolchain: add CI jobs that build all targets, run QEMU smoke tests (boot to UART banner), and publish artifacts (ELF + map + bin). Track binary size trends to catch sudden regressions when enabling features. Incorporate static analysis (-fanalyzer, clang-tidy) for UB hot spots, and use sanitizers in hosted unit tests for algorithms you later drop into freestanding builds. When porting between architectures, keep per-target CFLAGS/LDFLAGS in separate make fragments to avoid accidental cross-pollution. Document your “first 5 minutes” debug checklist next to the Makefile so teammates can reproduce it. Finally, treat your toolchain itself as an artifact: version-lock Docker/Podman images or devcontainers with the exact compilers—firmware that boots today should still boot in six months with the same image.
How this fits in projects: All projects rely on correct scripts/flags; Projects 4–8 and 12 need specific binary formats; Project 10 depends on debug traces for context switches; Project 11 needs hexdumps.
Definitions & key terms: freestanding, crt0, linker script, map file, objcopy, ELF, BIN, PE/COFF, GDB stub, SWD, ring buffer.
Mental model diagram
Source → Compile (-ffreestanding) → Link (script) → Objcopy (bin/hex) → Flash/Boot → Observe (UART/GDB/GPIO)
How it works: Compiler emits objects; linker places them per script; objcopy converts format; flasher writes; debugger observes.
Minimal example (linker excerpt):
ENTRY(_start)
SECTIONS {
. = 0x80000;
.text : { *(.text.boot) *(.text*) }
.rodata : { *(.rodata*) }
.data : { *(.data*) }
.bss : { *(.bss*) *(COMMON) }
}
Common misconceptions
- “If it runs in QEMU, hardware will match.” → Clocks/cache/MMIO differ.
- “Optimization level is irrelevant.” →
-O0changes timing;-O2may remove volatile-less code. - “Printf is free.” → It can stall ISRs and mask timing bugs.
Check-your-understanding
- Q: Why use
-ffreestanding? A: Tell compiler no hosted assumptions; avoid implicit libc. - Q: Why inspect map files? A: To verify sections align with the memory map.
- Q: How to prove you’re executing your binary? A: Early UART byte or GPIO toggle at
_start.
Real-world applications: Firmware reproducibility, safety certification traces, secure boot measurements.
Where you’ll apply it: All projects.
References: “Bare Metal C” (Ch. 2–3), “Making Embedded Systems” (Ch. 4), OSDev (linker scripts), UEFI Spec.
Key insight: The toolchain defines your runtime—control it as tightly as you control code.
Summary: Deterministic builds + early observability make bring-up measurable and debuggable.
Homework & solutions
- Exercise: Add
-ffreestanding -nostdliband observe missingcrt0; write minimal_start. Solution: Provide your own startup, link map cleanly. - Exercise: Add a UART heartbeat in
_startand verify on scope/terminal. Solution: Initialize UART registers early, emit 0x55.
Glossary
- Reset Vector: Fixed address CPU jumps to after power-on.
- Vector Table / IDT: Table mapping interrupt numbers to handler addresses.
- MMIO: Memory-mapped I/O; peripherals accessed as memory.
- ISR: Interrupt Service Routine.
- Linker Script: File describing memory layout and section placement.
- Prescaler: Divider that scales a clock source for timers.
- EOI: End-of-interrupt acknowledgment to a controller.
- Freestanding: Build mode with no hosted environment/OS assumptions.
Why Bare Metal Programming Matters (Context & Evolution)
Modern systems depend on firmware correctness. In 2024, industry reports estimate ~28.5 billion MCUs shipped worldwide—far more than PCs or phones—driven by automotive, IoT, and industrial controls. Open architectures like RISC-V captured the fastest growth, with Asia-Pacific accounting for 55%+ of RISC-V core shipments in 2024. Secure boot, OTA updates, EV motor controllers, and medical devices all start as bare metal firmware—errors here propagate upward. The shift from opaque BIOS to UEFI and from proprietary ISAs to open ones (RISC-V) shows the ecosystem moving toward transparency and verifiability.
ASCII: Old vs new boot approaches
Legacy BIOS Chain Modern Bare Metal / UEFI
┌──────────┐ ┌──────────────┐
│ BIOS int │ text mode I/O │ UEFI services│ GOP, files
├──────────┤ ├──────────────┤
│ Bootsect │ 512 B limit │ PE/COFF app │ richer protocols
└────┬─────┘ └─────┬────────┘
│ │
▼ ▼
Kernel (real mode) Kernel (protected/long)
Concept Summary Table
| Concept | What to internalize | Primer Section | Used In Projects |
|———|———————|—————-|——————|
| Power-On & Startup Chain | Vector, stack, .bss/.data, hand-off to C | Concept 1 | 1–8, 10, 12 |
| Memory Maps & MMIO | Volatile, RMW, base offsets, identity maps | Concept 2 | 1–4, 7–9, 11 |
| Interrupts & Timing | Vectors, ISRs, PIC/NVIC, prescalers, jitter | Concept 3 | 2, 3, 7, 9, 10, 11 |
| Toolchain & Debugging | Linker scripts, freestanding flags, UART/JTAG traces | Concept 4 | All |
Project-to-Concept Map
| Project | Startup | MMIO | Interrupts/Timing | Toolchain/Debug | |———|———|——|——————-|—————–| | P01 Blink | ✓ | ✓ | — | ✓ | | P02 UART | ✓ | ✓ | ✓ | ✓ | | P03 Timer/PWM | ✓ | ✓ | ✓ | ✓ | | P04 RPi Bare Metal | ✓ | ✓ | ✓ | ✓ | | P05 x86 Bootloader | ✓ | — | — | ✓ | | P06 Protected Mode | ✓ | ✓ (VGA) | ✓ (faults) | ✓ | | P07 IDT | ✓ | ✓ | ✓ | ✓ | | P08 Paging | ✓ | ✓ | ✓ (page faults) | ✓ | | P09 STM32 | ✓ | ✓ | ✓ | ✓ | | P10 Multitasking | ✓ | ✓ | ✓ | ✓ | | P11 Filesystem | ✓ | ✓ (ATA) | ✓ (IRQs) | ✓ | | P12 UEFI App | ✓ | ✓ | ✓ | ✓ |
Deep Dive Reading by Concept
| Concept | Book/Chapter | Why | |———|————–|—–| | Startup Chain | Bare Metal C Ch.1–3; Write Great Code Vol.1 Ch.1–3 | Startup, memory prep | | MMIO & Memory Maps | Making Embedded Systems Ch.2; Vendor datasheets | Register discipline | | Interrupts & Timing | Making Embedded Systems Ch.7; OSDev Interrupts | Predictable latency | | Toolchain & Debug | Bare Metal C Ch.2–3; OSDev linker scripts; UEFI Spec | Deterministic builds & visibility |
Quick Start (First 48 Hours)
- Day 1: Install toolchains; read one datasheet GPIO section; build & flash Project 1 blink; add UART hello byte in startup.
- Day 2: Compute UART baud divisor; bring up Project 2 echo; sketch your memory map and verify with
objdump/map file; toggle GPIO inside an ISR to prove interrupts.
Recommended Learning Paths
- Embedded-first: 1 → 2 → 3 → 9 → 4 → 5 → 6 → 7 → 8 → 10 → 11 → 12
- Kernel-first: 5 → 6 → 7 → 8 → 10 → 11 → 12 → 1 → 2 → 3 → 4 → 9
- Security/Boot: 12 → 5 → 6 → 7 → 8 → 4 → 9 → 1 → 2 → 3 → 10 → 11
Success Metrics
- Explain power-on to
main()verbally with addresses for your target. - Read linker map and predict where each section lives.
- Bring up UART + one GPIO proof signal on new hardware in <60 minutes.
- Handle three interrupts (timer, UART, external) without lockups.
- Enable paging and catch/log page faults without triple faults.
- Context switch between two tasks under a timer tick.
Optional Appendix: Debugging & Bring-Up Workflow
- When nothing prints: toggle GPIO in
_start, verify power/clock, read SP/PC in GDB/QEMU. - When UART garbles: recompute divisor, verify clock tree, measure on scope/logic analyzer.
- When ISRs fail: check vector table address, global enable, and acknowledge flags.
- Tools cheat sheet:
objdump -D,readelf -S,nm,hexdump,qemu -d int,openocd gdb_port 3333.
Project Overview Table
| # | Project | Platform | Difficulty | Time | Outcome |
|---|---|---|---|---|---|
| 1 | Blink LED (AVR) | Arduino/AVR | Beginner | Weekend | Direct GPIO control |
| 2 | UART Serial (AVR) | Arduino/AVR | Intermediate | 1 week | Debug console |
| 3 | Timer & PWM (AVR) | Arduino/AVR | Intermediate | 1 week | Precise timing/PWM |
| 4 | RPi Bare Metal Hello | Raspberry Pi 3/4 | Advanced | 2 weeks | ARM boot + UART |
| 5 | x86 Bootloader | x86 (QEMU) | Expert | 1–2 weeks | 512B boot sector |
| 6 | Protected Mode Kernel | x86 (QEMU) | Expert | 2–3 weeks | 32-bit C kernel |
| 7 | Interrupt Descriptor Table | x86 (QEMU) | Expert | 2 weeks | Full IDT/IRQs |
| 8 | Paging | x86 (QEMU) | Master | 3–4 weeks | Virtual memory |
| 9 | STM32 Bare Metal | STM32F4 | Advanced | 3–4 weeks | Clocks, NVIC, DMA |
| 10 | Multitasking Kernel | x86 (QEMU) | Master | 4–6 weeks | Scheduler |
| 11 | File System Driver | x86 (QEMU) | Expert | 3–4 weeks | FAT16 read |
| 12 | UEFI Application | x86 (OVMF) | Advanced | 2 weeks | GOP + files pre-OS |
Project List
Project 1: Blink LED on AVR (Arduino Bare Metal)
Real World Outcome
```bash $ avr-gcc -mmcu=atmega328p -Os -o blink.elf blink.c $ avr-objcopy -O ihex blink.elf blink.hex $ avrdude -p m328p -c arduino -P /dev/ttyACM0 -b 115200 -U flash:w:blink.hex
avrdude: writing flash (176 bytes) avrdude: 176 bytes of flash verified
LED on pin 13 blinks; code size ~176 bytes vs ~900 bytes Arduino core.
#### The Core Question You're Answering
How does software drive a pin with no OS, HAL, or runtime?
#### Concepts You Must Understand First
- GPIO register maps (AVR Workshop Ch.2)
- Volatile + MMIO (Bare Metal C Ch.1–2)
- Bitwise masks/shifts (K&R C Ch.2)
- Reset vector basics (Write Great Code Vol.1 Ch.1–2)
#### Questions to Guide Your Design
1. Which register controls pin direction and what is its reset value?
2. Is the LED active-high or active-low on your board?
3. How will you delay (busy loop vs timer)?
4. How do you prevent the compiler from optimizing away the loop?
#### Thinking Exercise
Trace PB5 state:
DDRB: 0b00000000 -> 0b00100000 PORTB: 0 -> 1 -> 0 in loop
Explain how each write changes the pin voltage.
#### The Interview Questions They'll Ask
- Why must MMIO be `volatile`?
- What happens if the port clock isn’t enabled?
- Why is a busy-wait delay fragile across clock sources?
- Where does execution start after reset on AVR?
#### Hints in Layers
1. Set PB5 as output:
```c
DDRB |= (1 << PB5);
- Toggle it:
PORTB ^= (1 << PB5); - Replace busy-wait with timer compare ISR.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | AVR Workshop | 2 | GPIO registers and bit ops | | Bare Metal C | 1–2 | Volatile, startup | | Write Great Code Vol.1 | 1–2 | Execution model |
Common Pitfalls & Debugging
- LED never blinks: Wrong pin/active-low. Quick test: set pin high once and probe.
- Blink too fast/slow: Clock assumption wrong (8 vs 16 MHz). Quick test: measure frequency.
- Code does nothing: Wrong MCU target or flash settings. Quick test:
avrdude -vsignature.
Definition of Done
- LED toggles at expected rate on pin 13
- Code size < 300 bytes (shows minimalism)
- Delay uses timer or calibrated loop
- Datasheet addresses documented in code comments
Project 2: UART Serial Communication (AVR)
Real World Outcome
$ make flash $ screen /dev/ttyACM0 115200 === AVR Bare Metal Serial Demo === hello You typed: hello led on LED is now ON status Uptime: 42s Temp: 23.5C Free RAM: 1847 bytes
The Core Question You’re Answering
How does a bare metal system communicate without stdio or an OS?
Concepts You Must Understand First
- UART framing/baud (Make: AVR Programming Ch.5)
- Interrupt-driven I/O (AVR Workshop Ch.7)
- Ring buffers (Algorithms in C Ch.1)
- Volatile MMIO (Bare Metal C Ch.1–2)
Questions to Guide Your Design
- Block on TX or use interrupts?
- How will you prevent RX overflow?
- Exact baud divisor for your clock?
- How will you handle backspace/newlines?
Thinking Exercise
Compute UBRR for 16 MHz @115200:
UBRR = (16,000,000/(16*115200)) - 1 ≈ 8
What happens if UBRR=7 or 9?
The Interview Questions They’ll Ask
- Why does UART need a divisor?
- Polling vs interrupts trade-offs?
- How to avoid losing bytes at high baud?
- Why is
printfchallenging on bare metal?
Hints in Layers
- Init TX only:
UBRR0L = 8; UCSR0B = (1 << TXEN0); - Add RX polling:
while (!(UCSR0A & (1<<RXC0))) {} char c = UDR0; - Move RX to ISR with ring buffer.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Make: AVR Programming | 5, 7 | UART + interrupts | | AVR Workshop | 7 | ISR patterns | | Bare Metal C | 1–2 | MMIO discipline |
Common Pitfalls & Debugging
- Garbled chars: Wrong divisor/clock. Quick test: loopback TX→RX.
- RX works once: Forgot to read UDR0/clear flag.
- Interrupt never fires: Global interrupts disabled. Quick test: toggle GPIO in ISR.
Definition of Done
- TX and RX verified at target baud
- RX buffer survives burst traffic without overflow in normal use
- UART ISR clears/acks flags
- Basic shell/echo works over serial
Project 3: Hardware Timer and PWM (AVR)
Real World Outcome
$ make flash === Timer Demo === [0.000s] System started [1.000s] Tick [2.500s] Button pressed (capture) PWM: LED fades 0→100%→0% Tone: 440 Hz for 500 ms
The Core Question You’re Answering
How do you make time predictable when the CPU is busy?
Concepts You Must Understand First
- Timer modes/prescalers (AVR Workshop Ch.5–7)
- Clock sources and drift (Making Embedded Systems Ch.2)
- ISR latency (Write Great Code Vol.1 Ch.5)
Questions to Guide Your Design
- Which timer for PWM vs timekeeping?
- Prescaler for 1 kHz with minimal error?
- How to avoid blocking delays while audio plays?
- What if multiple timers share a prescaler?
Thinking Exercise
Given 16 MHz, prescaler 64:
Tick = 16,000,000/64 = 250 kHz → OCR = 250 for 1 ms
Which prescaler minimizes error for 100 Hz?
The Interview Questions They’ll Ask
- Difference between CTC and PWM modes?
- Why are hardware timers more precise than busy loops?
- How to measure input frequency?
- What is interrupt jitter?
Hints in Layers
- Configure CTC for 1 ms tick.
- Use PWM via OCR0A duty cycle.
- Add input capture for pulse measurement.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | AVR Workshop | 5–7 | Timers, PWM, ISRs | | Making Embedded Systems | 2–3 | Timing constraints |
Common Pitfalls & Debugging
- PWM flicker: Updating OCR mid-cycle → update at safe point.
- ISR never runs: OCIE bit off → enable mask and
sei(). - Frequency off by 2x: Wrong prescaler bits → log TCCRnB.
Definition of Done
- 1 kHz tick measured within 1% error
- PWM duty cycle spans 0–100% smoothly
- Timer ISR acknowledged and counted
- Input capture verified with known pulse
Project 4: Raspberry Pi Bare Metal - Hello World
Real World Outcome
$ make && cp kernel8.img /mnt/sdcard/ $ screen /dev/ttyUSB0 115200 === Raspberry Pi 3 Bare Metal === UART initialized at 115200 baud Hello from bare metal! CPU: Cortex-A53 @1.2GHz RAM: 1024 MB
The Core Question You’re Answering
How does an ARM SoC boot into your code without Linux or a runtime?
Concepts You Must Understand First
- ARM startup sequence (Write Great Code Vol.1 Ch.1–3)
- Linker scripts for SoCs (Bare Metal C Ch.2–3)
- UART bring-up (Making Embedded Systems Ch.4)
- MMIO on ARM (Computer Organization and Design Ch.1–2)
Questions to Guide Your Design
- Where does the Pi CPU start executing after GPU boot?
- Where should vector table/stack live?
- How to prevent libc pull-in?
- What is the UART base for your Pi revision?
Thinking Exercise
Sketch minimal flow:
Reset → _start → set SP → zero .bss → init UART → main()
What breaks if .bss isn’t cleared?
The Interview Questions They’ll Ask
- Why is a linker script required here?
- What is the
.bsssection for? - How do you verify execution before UART works?
- Why park secondary cores?
Hints in Layers
- Minimal
_startsets SP and jumps to main. - Linker places
.textat 0x80000. - Initialize UART and print banner.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Bare Metal C | 2–3 | Startup + linker | | Making Embedded Systems | 4 | Debug-first UART | | Write Great Code Vol.1 | 1–3 | Execution flow |
Common Pitfalls & Debugging
- No UART output: Wrong alt-function/pin mux. Quick test: blink ACT LED.
- Works in QEMU, not hardware: Cache/clock differences. Start minimal, disable caches.
- Crash before main: SP/BSS not set. Quick test: write known value then halt.
Definition of Done
kernel8.imgboots on Pi 3/4 and prints banner- Linker script origins match SoC expectations
- Secondary cores parked or held
- UART TX verified with scope/logic analyzer
Project 5: x86 Bootloader - Real Mode
Real World Outcome
$ nasm -f bin boot.asm -o boot.bin $ qemu-system-x86_64 -drive format=raw,file=boot.bin Hello from the boot sector!
The Core Question You’re Answering
How does a PC decide what code runs first, and how do you control it in 512 bytes?
Concepts You Must Understand First
- Real mode segmentation (Write Great Code Vol.1 Ch.4–6)
- BIOS interrupts (OSDev + Ralph Brown list)
- Boot sector layout/signature
Questions to Guide Your Design
- Where does BIOS load the sector (address)?
- How do you fit code+data in 512 bytes?
- How will you load the next stage?
- Which video mode assumptions are safe?
Thinking Exercise
Layout:
0x0000–0x01BD code/data
0x01BE–0x01FD optional partition
0x01FE–0x01FF 0x55 0xAA
Why must signature be last?
The Interview Questions They’ll Ask
- Why 512-byte limit?
- Real vs protected mode differences?
- How does BIOS find boot sector?
- What breaks without signature?
Hints in Layers
- Print one char with int 0x10.
- Loop through string with
lodsb. - Use int 0x13 to load next sector.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Write Great Code Vol.1 | 4–6 | x86 real mode | | The Art of Assembly | 6–8 | Boot-time ASM |
Common Pitfalls & Debugging
- Immediate hang: Missing 0xAA55. Quick test:
hexdump -C boot.bin | tail -1. - Garbage text: DS not set. Fix: init segment registers.
- Next stage fails: Wrong CHS/LBA. Test: load one known sector.
Definition of Done
- 512-byte image with 0xAA55 signature
- Boots in QEMU and prints text
- Cleanly loops or loads second stage
- Source comments show segment assumptions
Project 6: x86 Protected Mode Kernel
Real World Outcome
$ make && qemu-system-i386 -kernel kernel.elf ============================================================================== My x86 Protected Mode Kernel v0.1 [OK] GDT loaded at 0x00000800 [OK] Switched to 32-bit protected mode [OK] VGA driver initialized (80x25)
The Core Question You’re Answering
How do you transition from 16-bit real mode into a 32-bit environment that can run C code?
Concepts You Must Understand First
- GDT/segmentation (OSDev GDT; Write Great Code Vol.1 Ch.4–6)
- Control registers (CR0) and mode switch
- Calling conventions/stack frames (Write Great Code Vol.1 Ch.7)
- VGA text MMIO
Questions to Guide Your Design
- When do you disable interrupts during switch?
- Minimum GDT entries to run C?
- How to prove 32-bit mode is active?
- How to output text without BIOS?
Thinking Exercise
Pseudo-sequence:
cli → lgdt → set CR0.PE → far jump 0x08:pm_start
Why far jump? (Flushes pipeline/CS.)
The Interview Questions They’ll Ask
- Why can’t you call BIOS in protected mode?
- Purpose of GDT null descriptor?
- What happens if you forget to reload segment registers?
- How does CPU know instruction width after switch?
Hints in Layers
- Build GDT with null/code/data.
- Switch to PM and print to VGA memory.
- Call a C function and return.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Write Great Code Vol.1 | 4–7 | Mode transitions | | Operating Systems: Three Easy Pieces | 2–4 | Execution context |
Common Pitfalls & Debugging
- Triple fault: Bad GDT selector. Use Bochs/QEMU logs.
- Garbage VGA: Wrong address/attribute. Hardcode one char to 0xB8000.
- C crashes: Stack not set. Set ESP before calling C.
Definition of Done
- GDT installed and verified in logs
- CPU in protected mode (confirmed by EFER/CR0 bits and 32-bit code executing)
- VGA text output works without BIOS
- C function executes and returns
Project 7: Interrupt Descriptor Table (IDT)
Real World Outcome
$ qemu-system-i386 -kernel kernel.elf === Interrupt Test Suite === [EXCEPTION 0] Divide Error at 0x00100234 ✓ [IRQ 0] Timer tick #1 ✓ [IRQ 1] Keyboard scancode 0x1E ✓
The Core Question You’re Answering
How does the CPU safely jump to your code when hardware or faults interrupt execution?
Concepts You Must Understand First
- IDT structure (OSDev IDT)
- PIC remapping (OSDev PIC)
- Context save/restore (Write Great Code Vol.1 Ch.7–8)
- Exceptions vs IRQs (Intel SDM Vol.3)
Questions to Guide Your Design
- Which vectors are reserved for exceptions?
- How to pass interrupt number to C?
- Where to acknowledge IRQ (PIC EOI)?
- Strategy for unhandled interrupts?
Thinking Exercise
IRQ0 remap example:
Master PIC offset 0x20 → IRQ0 at vector 0x20
Why must 0–31 be reserved?
The Interview Questions They’ll Ask
- Interrupt vs exception difference?
- Why remap the PIC?
- Interrupt gate vs trap gate?
- How do you implement syscalls?
Hints in Layers
- Single handler IDT entry, load with
lidt. - Generic ISR stub saves registers, calls C.
- Remap PIC, handle IRQ0/IRQ1.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Write Great Code Vol.1 | 7–8 | Stack frames | | Operating Systems: Three Easy Pieces | 3–5 | Traps/interrupts |
Common Pitfalls & Debugging
- Triple fault on first interrupt: Invalid IDT entry. Verify base/present bits.
- IRQs stop after first tick: Missing EOI. Send to master/slave.
- Garbage keyboard data: Wrong scancode handling. Print raw scancodes.
Definition of Done
- Exceptions logged with vector and address
- Timer and keyboard IRQs fire repeatedly
- PIC remapped to 0x20/0x28
- ISR saves/restores full context
Project 8: Memory Management - Paging
Real World Outcome
$ qemu-system-i386 -kernel kernel.elf -m 128M
=== Paging Demo ===
[OK] Page directory created
[OK] Paging enabled
[PAGE FAULT] Allocating new page for 0x40000000 → phys 0x00400000 ✓
The Core Question You’re Answering
How does the CPU translate virtual addresses to physical memory, and how do you control that mapping?
Concepts You Must Understand First
- Page tables/flags (OSDev Paging; OSTEP Ch.18–21)
- CR0/CR3 registers (Intel SDM)
- TLB behavior (Computer Architecture Ch.5)
- Page fault handling
Questions to Guide Your Design
- Virtual layout (identity vs higher half)?
- How to prevent mapping null pages?
- What if a fault occurs in the handler itself?
- How will you allocate/track frames?
Thinking Exercise
Translate 0xC0101234:
PDE = va>>22; PTE = (va>>12)&0x3FF; offset = va&0xFFF
Which structures are touched?
The Interview Questions They’ll Ask
- Why is paging required for isolation?
- Role of CR3?
- What is a TLB and how to flush it?
- How do you safely handle a page fault?
Hints in Layers
- Identity-map first 4 MB.
- Enable paging and confirm kernel runs.
- Add higher-half mapping for kernel.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Operating Systems: Three Easy Pieces | 18–21 | Virtual memory | | Computer Architecture | 5 | TLB/memory hierarchy |
Common Pitfalls & Debugging
- Immediate triple fault: No identity map. Ensure current code mapped.
- Page fault loop: Handler/stack unmapped. Map them first.
- Corruption: Wrong flags/reused frames. Validate PTE flags and allocator.
Definition of Done
- Paging enabled without triple fault
- Page faults trapped and logged
- Higher-half mapping (if used) confirmed via address print
- Free frame count tracked and non-negative
Project 9: STM32 Bare Metal (ARM Cortex-M)
Real World Outcome
$ make && st-flash write firmware.bin 0x8000000 $ screen /dev/ttyUSB0 115200 === STM32F4 Bare Metal Demo === System Clock: 168 MHz LED blink ✓ UART ✓ Timer IRQ ✓ DMA transfer ✓
The Core Question You’re Answering
How do you bring up a professional MCU without HALs and still get reliable peripherals?
Concepts You Must Understand First
- Clock tree/RCC (STM32 RM0090; Making Embedded Systems Ch.2)
- NVIC/vector table (Definitive Guide to ARM Cortex-M3/M4 Ch.3–5)
- DMA basics (Making Embedded Systems Ch.7)
- CMSIS naming/layout
Questions to Guide Your Design
- Which clocks must be enabled before GPIO/UART?
- Where is the vector table placed/aligned?
- Safe IRQ priorities for UART vs timers?
- When is DMA worth it for UART/ADC?
Thinking Exercise
Compute flash wait states for 168 MHz; what happens if skipped?
The Interview Questions They’ll Ask
- What is NVIC and how differs from PIC?
- How to configure alternate function pins?
- Why is DMA useful for UART/ADC?
- How to debug a HardFault?
Hints in Layers
- Enable RCC for GPIO and USART.
- Configure GPIO mode + UART baud.
- Add DMA for UART RX into ring buffer.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Making Embedded Systems | 2, 7 | Clocks, DMA | | Definitive Guide to ARM Cortex-M3/M4 | 3–5 | NVIC, exceptions |
Common Pitfalls & Debugging
- GPIO won’t toggle: Clock not enabled. Read RCC_AHB1ENR.
- Wrong UART baud: PLL/prescalers off. Verify with MCO pin.
- HardFault on interrupt: Wrong ISR name/vector table location.
Definition of Done
- SysCLK configured and measured (MCO or timer)
- UART TX/RX works at target baud
- Timer IRQ fires and acknowledged
- DMA transfer completes and verified in memory
Project 10: Simple Kernel with Multitasking
Real World Outcome
$ qemu-system-i386 -kernel kernel.elf === Simple Multitasking Kernel === Task 1: Counter A Task 2: Counter B [Context switch: Task1 -> Task2] ... tasks alternate under timer tick
The Core Question You’re Answering
How can one CPU appear to run multiple tasks by saving and restoring context?
Concepts You Must Understand First
- Context switch mechanics (OSTEP Ch.6–9)
- Stack frames/calling conventions (Write Great Code Vol.1 Ch.7)
- Timer interrupts for preemption (OSDev PIT)
- Task state (ready/running/blocked)
Questions to Guide Your Design
- Which registers must be saved to resume a task?
- Start cooperative or preemptive?
- Where are task stacks allocated/initialized?
- How to prevent scheduler re-entry?
Thinking Exercise
Build fake stack frame for a new task; which values must be preset so iret/ret lands in entry?
The Interview Questions They’ll Ask
- Cooperative vs preemptive differences?
- Why disable interrupts during context switch?
- How does a timer ISR lead to context switch?
- What does a TCB store?
Hints in Layers
- Cooperative
yield()swap stacks. - Add PIT timer to call
schedule(). - Add ready queue and blocking API.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Operating Systems: Three Easy Pieces | 6–9 | Scheduling/context | | Write Great Code Vol.1 | 7 | Stack frames |
Common Pitfalls & Debugging
- Task returns to nowhere: Bad initial frame. Test with infinite loop task.
- Random crashes after switch: Not saving all regs/flags. Verify pusha/popa set.
- Scheduler re-enters: Mask interrupts inside scheduler.
Definition of Done
- Two tasks alternate under timer preemption
- All callee-saved + flags preserved across switches
- Scheduler non-reentrant (guarded)
- Tasks can block and resume
Project 11: File System Driver (FAT16)
Real World Outcome
$ qemu-system-i386 -kernel kernel.elf -hda disk.img === FAT16 Driver Demo === Root: HELLO.TXT 17 bytes SUBDIR/ cat hello.txt Hello from file!
The Core Question You’re Answering
How do raw disk sectors become files/directories without OS help?
Concepts You Must Understand First
- Block devices/sectors (OSTEP Ch.13–14)
- FAT16 layout (Microsoft FAT spec)
- Endianness parsing; PIO vs DMA (Making Embedded Systems Ch.7)
Questions to Guide Your Design
- Translate cluster → LBA?
- Handle fragmented files how?
- FAT chain end marker?
- How to cache sectors to reduce reads?
Thinking Exercise
Given 512 B/sector, 4 sectors/cluster, data at LBA 2048, what is LBA for cluster 5?
The Interview Questions They’ll Ask
- FAT table vs directory entries difference?
- How does fragmentation affect performance?
- Why cache sectors?
- Minimum data needed from boot sector?
Hints in Layers
- Read sector 0 (BPB) and parse BPB fields.
- List root directory entries.
- Follow FAT chain to read a file.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | Operating Systems: Three Easy Pieces | 13–14 | Disks & filesystems | | Making Embedded Systems | 7 | Block I/O performance |
Common Pitfalls & Debugging
- Garbage directory: Wrong offsets/endianness. Print BPB fields.
- Reads stop early: FAT chain not followed. Print each cluster.
- Slow reads: No caching. Cache FAT + directory sectors.
Definition of Done
- BPB parsed and validated
- Root directory listed correctly
- Files read through FAT chains
- Sector caching reduces repeated reads
Project 12: UEFI Application
Real World Outcome
$ qemu-system-x86_64 -bios OVMF.fd -drive file=fat:rw:esp UEFI APPLICATION v1.0 UEFI Version: 2.70 Screen: 1024x768 @ 32bpp Memory: 127 MB
The Core Question You’re Answering
How do you build software that runs before any OS on modern PCs using UEFI services?
Concepts You Must Understand First
- UEFI boot phases/protocols (UEFI Spec)
- PE/COFF format basics
- GOP (Graphics Output Protocol) and file protocol
- Memory map + ExitBootServices contract
Questions to Guide Your Design
- Which protocols must you locate first (ConsoleOut, GOP, File)?
- How to handle graphics vs text?
- What must be done before
ExitBootServices()? - How to parse a FAT file on the ESP?
Thinking Exercise
Minimal bootloader sequence:
Locate GOP → locate File protocol → load kernel → get memory map → ExitBootServices()
What breaks if memory map key changes?
The Interview Questions They’ll Ask
- How is UEFI different from BIOS?
- What is GOP?
- Why must
ExitBootServices()be called exactly once with fresh map? - Role of the EFI System Partition?
Hints in Layers
- Print text via ConsoleOut.
- Draw pixels via GOP framebuffer.
- Load a file and print its size.
Books That Will Help
| Book | Chapters | What You’ll Learn | |——|———-|——————-| | UEFI Specification | Boot Services, GOP | Pre-OS services | | Write Great Code Vol.1 | 1–2 | Execution flow |
Common Pitfalls & Debugging
- App not found: Wrong PE/COFF or path. Place at EFI/BOOT/BOOTX64.EFI.
- ExitBootServices fails: Memory map key stale; re-fetch map right before call.
- Black screen: Wrong pixel format/stride; read GOP mode info.
Definition of Done
- Application loads under OVMF/real firmware
- Console and GOP output verified
- File protocol reads a file from ESP
- Memory map fetched and
ExitBootServicessucceeds
Sources
- OSDev Wiki (Bootloader, Protected Mode, Paging, Interrupts)
- AVR Workshop (GPIO, timers, UART)
- Bare Metal C (startup, linker scripts)
- Making Embedded Systems (timing, DMA, debugging)
- Write Great Code Vol.1 (execution model, stacks)
- UEFI Specification (Boot Services, GOP)
- Raspberry Pi Bare Metal Tutorial (bztsrc)
- Valvers Raspberry Pi Tutorial
- Ciro Santilli x86 Paging
- Vivonomicon STM32 Bare Metal
- AVR Bare Metal Examples
- Baeldung UEFI Programming
- Microcontroller shipment stats 2024 (~28.5B MCUs) — Semiconductor Insight: https://semiconductorinsight.com/report/global-microcontroller-unit-mcu-market/
- RISC-V market/share 2024 (APAC >55% shipments, fastest growth) — MarketReportsWorld RISC-V Cores: https://www.marketreportsworld.com/market-reports/risc-v-cores-market-14713441 and Mordor Intelligence: https://www.mordorintelligence.com/industry-reports/risc-v-tech-market