Advanced Rust Ecosystem Mastery - Real World Projects
Goal: Build an end-to-end mental model of the advanced Rust ecosystem—pinning, async state machines, custom allocators, const generics, atomics, macros, and
no_std—so you can design production-grade systems that stay safe, fast, and observable. You will learn to see how the borrow checker, layout rules, and runtime glue interact, and you will be able to trace every byte from CPU register to heap allocator to IO device. By the end, you can architect high-performance services, embedded firmware, and runtime components with confidence, measurable outcomes, and verifiable safety invariants.
Introduction
- What is Advanced Rust? It is the set of patterns, runtime mechanics, and compile-time guarantees that let you control memory, concurrency, and I/O with zero-cost abstractions.
- What problem does it solve today? Eliminates entire vulnerability classes (use-after-free, data races) while still enabling low-latency systems and embedded targets.
- What you will build across the projects: custom futures, allocators,
no_stdkernels, lock-free structures, macros, parsers, and a high-performance KV store. - Scope in: ownership, pinning, async internals, allocators, type-level programming, atomics, macros,
no_std. - Scope out: GUIs, front-end ergonomics, and large-scale UI frameworks.
Big-picture architecture (data + control flow)
Application Layer (services, CLI, firmware)
│
▼
Async State Machines ──► Executors/Wakers ──► Syscalls/Drivers
│ │
▼ ▼
Type System (traits, lifetimes, const generics, macros)
│
▼
Memory Control (allocators, pinning, layout) ──► Concurrency (atomics, fences)
│
▼
Hardware / no_std (MMIO, interrupts, boot code)
How to Use This Guide
- Read the Theory Primer chapters before starting each project; the Concept Summary Table tells you what to internalize.
- Pick a learning path (beginner/intermediate/advanced) and follow the recommended project order.
- After each project, run the Definition of Done checklist and log measurable outputs (throughput, latency, code size).
- Use the Project-to-Concept Map to decide which concept to revisit when you get stuck.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Comfortable with Rust ownership/borrowing, traits, and the module system.
- Ability to read compiler errors and use
cargo test,cargo bench,cargo fmt,cargo clippy. - Systems basics: virtual memory, caches, syscalls, and threading.
- Recommended Reading: “The Rust Programming Language” Ch. 4–6; “Programming Rust” Ch. 11 (Traits & Generics).
Helpful But Not Required
- Prior embedded work (
no_std, linker scripts). - Familiarity with CPU cache behavior and memory ordering (learned in Projects 6, 8, 11).
Self-Assessment Questions
- Can you explain why
&mut Timplies uniqueness and how aliasing is prevented? - Can you draw the async state machine produced by an
async fnwith twoawaitpoints? - Can you describe the difference between
SeqCstandAcquire/Releaseordering?
Development Environment Setup
- Required Tools:
rustup(stable + nightly),cargo,llvm-tools-preview,perfordtrace,cargo-asm,cargo-expand. - Recommended Tools:
probe-rs(embedded),cargo-flamegraph,justtask runner,cargo-criterion.
Testing Your Setup
$ rustup show
$ cargo new sanity && cd sanity && cargo test
Expected: tests pass, toolchain reports stable + nightly installed.
Time Investment
- Simple projects: 4–8 hours
- Moderate projects: 10–20 hours
- Complex projects: 20–40 hours
- Total sprint: 6–10 weeks
Important Reality Check
- Compiler errors are guide rails; expect to iterate. Performance tuning requires measurement; do not trust intuition.
Big Picture / Mental Model
Memory & execution pipeline
┌───────────────────────────────────────────────┐
│ CPU Registers (Now, 0–1 cycles) │
├───────────────────────────────────────────────┤
│ L1/L2/L3 Caches (Near, 4–40 cycles) │
├───────────────────────────────────────────────┤
│ RAM: Stack (LIFO, movable) / Heap (flexible) │
├───────────────────────────────────────────────┤
│ NVMe/SSD (Eternal, 10k+ cycles) │
└───────────────────────────────────────────────┘
Rust toolchain flow
Source ──► HIR ──► MIR ──► LLVM IR ──► Machine Code
│ │
│ ├─ Borrow checker (lifetimes, variance)
└─ Type system (traits, const generics, GATs)
Theory Primer
This section provides deep, chapter-style coverage of each core concept. Read these before starting the projects—they form the theoretical foundation you’ll apply hands-on.
Concept 1: Pinning & Self-Referential Safety
Fundamentals
Pinning is Rust’s mechanism for guaranteeing that a value will not move in memory after it has been “pinned.” This guarantee is critical for self-referential data structures—types that contain pointers to their own fields. In Rust, values are implicitly movable: assigning a struct to a new variable, passing it to a function, or pushing it into a Vec all perform bitwise copies that relocate the data to a new address. For most types, this is harmless. But when a struct contains a raw pointer that points back to one of its own fields, moving the struct invalidates that pointer, creating a dangling reference and undefined behavior.
The Pin<P> wrapper type, where P is a pointer type like &mut T or Box<T>, provides a compile-time guarantee that the pointee will not be moved. The Unpin auto-trait acts as an escape hatch: types that implement Unpin are safe to move even when wrapped in Pin, because they contain no self-references. Most types in Rust are Unpin by default. Types that contain self-references (like futures generated by async/await) are marked !Unpin using PhantomPinned, which opts them out of the Unpin trait and enforces the pinning contract.
Deep Dive
Why Moves Are Dangerous for Self-Referential Types
In C++, move semantics require explicit move constructors to update internal pointers. Rust takes a different approach: moves are always bitwise copies, and there are no move constructors. This simplicity has a cost—if a struct holds a pointer to its own field, a bitwise copy creates a new struct at a new address, but the internal pointer still points to the old (now invalid) location.
Consider a self-referential buffer:
struct SelfRef {
data: [u8; 64],
ptr: *const u8, // Points to data[0]
}
If we move this struct from address 0x1000 to 0x2000, the ptr field still contains 0x1000, which is now stale. Dereferencing it is undefined behavior.
How Async/Await Creates Self-References
Rust’s async functions are transformed by the compiler into state machines—enums where each variant represents a suspension point. When you write:
async fn example(data: &str) {
let buffer = data.to_string();
some_future().await;
println!("{}", buffer);
}
The compiler generates a struct that stores both buffer and any references to it that span .await points. If buffer is referenced after the .await, the generated future holds both the owned String and a reference to it—a self-referential structure. The Comprehensive Rust documentation explains that the compiler transforms this into an enum with variants like Start, Waiting, and Done, each storing the variables needed for that state.
The Pin Contract
Pin<P> makes a promise: the value pointed to by P will not move until it is dropped. This is enforced through careful API design:
-
Getting a
Pin<P>: You can createPin<Box<T>>usingBox::pin(value), orPin<&mut T>using thepin!macro (which pins to the stack). Once pinned, the original value is consumed and cannot be accessed except through the pinned reference. -
The
UnpinEscape Hatch: IfT: Unpin, thenPin<&mut T>can be freely dereferenced to&mut T, allowing movement. This is safe becauseUnpintypes don’t have self-references. -
Projection: Accessing fields of a pinned struct requires “projection”—transforming
Pin<&mut Struct>into eitherPin<&mut Field>(for pinned fields) or&mut Field(for unpinned fields). This must be done carefully, as projecting incorrectly can violate the pin invariant.
Structural vs. Non-Structural Pinning
When projecting fields, you must decide whether each field is “structurally pinned” (the field shares the struct’s pin guarantee) or “not structurally pinned” (the field can be moved freely). The rules are:
- If a field is structurally pinned, you can only project to
Pin<&mut Field>, andFieldmust be treated as if it’s always pinned. - If a field is not structurally pinned, you can project to
&mut Field, but you cannot ever return aPin<&mut Field>to it.
The pin-project crate automates safe projection with a derive macro. Manual projection requires unsafe code with careful invariant maintenance.
Drop and Pin Interactions
The Drop trait has special interactions with pinning. When implementing Drop for a type that uses structural pinning, you receive &mut self, not Pin<&mut Self>. This means you must not move out of any structurally pinned fields in your Drop implementation. The phil-opp OS tutorial notes that violating this invariant leads to undefined behavior.
Moving Futures Before First Poll
It’s worth noting that futures can be safely moved before the first poll() call. The state machine starts in an initial state that contains only the function arguments—no internal references exist yet. Self-references are only created when execution reaches an .await point that requires storing a reference across the suspension. This is why Box::pin(async { ... }) works: the future is moved into the box, then pinned, before any self-references are created.
How This Fits in Projects
You’ll directly implement and understand pinning in:
- Project 1 (Manual Pin Projector): Implement projection by hand without macros
- Project 2 (Box-less Async Trait): Pin futures without heap allocation
- Project 8 (Custom Runtime): Store and pin tasks in your executor
Definitions & Key Terms
| Term | Definition |
|---|---|
| Pin<P> | A wrapper that guarantees the pointee won’t move until dropped |
| Unpin | Auto trait marking types safe to move even when pinned |
| PhantomPinned | Zero-sized type that opts a struct out of Unpin |
| Projection | Accessing fields through a Pin<&mut Self> reference |
| Structural Pinning | When a field shares its parent’s pinning guarantee |
| Self-Referential | A type containing pointers to its own fields |
Mental Model Diagram
BEFORE PINNING (move allowed) AFTER PINNING (move forbidden)
┌─────────────────────────────┐ ┌─────────────────────────────────┐
│ Stack Frame / Heap Block │ │ Pinned Location (fixed addr) │
│ ┌───────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ SelfRef @ 0x1000 │ │ pin │ │ SelfRef @ 0x2000 │ │
│ │ ├─ data: [u8; 64] │ │ ─────► │ │ ├─ data: [u8; 64] │ │
│ │ └─ ptr: 0x1000 ───────┼──┤ │ │ └─ ptr: 0x2000 ───────────┼─┤
│ │ (valid!) │ │ │ │ (still valid!) │ │
│ └───────────────────────┘ │ │ └───────────────────────────┘ │
└─────────────────────────────┘ └─────────────────────────────────┘
│ │
│ After move to 0x3000: │ Cannot move! Pin enforces:
▼ ▼
┌─────────────────────────────┐ ┌─────────────────────────────────┐
│ SelfRef @ 0x3000 │ │ Value stays at 0x2000 │
│ ├─ data: [u8; 64] │ │ ptr → 0x2000 remains valid │
│ └─ ptr: 0x1000 (DANGLING!) │ │ No UB possible │
└─────────────────────────────┘ └─────────────────────────────────┘
How It Works (Step-by-Step)
-
Create the value: Construct your self-referential struct, but do NOT initialize the self-reference yet.
-
Pin the value: Use
Box::pin()for heap pinning orpin!()macro for stack pinning. This moves the value to its final location and returns aPin<P>wrapper. -
Initialize self-references: Only after pinning, initialize any fields that point to other fields. Use
unsafecode withPin::get_unchecked_mut()to get mutable access. -
Use via pinned reference: All subsequent access goes through
Pin<&mut T>, ensuring no moves occur. - Project fields safely: For each field access:
- Structurally pinned:
unsafe { self.map_unchecked_mut(|s| &mut s.field) } - Not pinned:
&mut self.get_unchecked_mut().field
- Structurally pinned:
- Drop cleanup: When the value is dropped, the
Pinwrapper is gone, but the drop code runs with the value still in place.
Invariants to maintain:
- Never call
mem::swapormem::replaceon pinned values - Never move out of
Pin<&mut T>whenT: !Unpin - Projection must be consistent (a field is always structurally pinned or never)
Failure modes:
- Dangling pointer if value moves after self-reference created
- Double-free if custom Drop moves pinned fields
- Undefined behavior if
Unpinimplemented incorrectly
Minimal Concrete Example
// Pseudocode demonstrating manual pin projection
struct SelfRef {
data: String,
data_ptr: *const u8, // Points to data's buffer
_pin: PhantomPinned, // Opt out of Unpin
}
impl SelfRef {
fn new(s: &str) -> Pin<Box<Self>> {
let res = SelfRef {
data: s.to_string(),
data_ptr: std::ptr::null(), // Not yet initialized
_pin: PhantomPinned,
};
let mut boxed = Box::pin(res);
// SAFETY: We don't move the struct after this
unsafe {
let self_ptr = boxed.as_mut().get_unchecked_mut();
self_ptr.data_ptr = self_ptr.data.as_ptr();
}
boxed
}
// Projection to get &str from pinned self
fn data(self: Pin<&Self>) -> &str {
// Safe: we're only reading
&self.get_ref().data
}
}
Common Misconceptions
| Misconception | Reality |
|---|---|
| “Pin prevents all mutation” | Pin only prevents moving; Pin<&mut T> still allows field mutation |
| “I need Pin for all async code” | Most async code doesn’t need manual pinning—executors handle it |
| “Pin<&T> freezes the value” | Pin<&T> just means you have a pinned shared reference; interior mutability still works |
| “Pinning a Copy type is useful” | Copy types are always Unpin and can be freely copied out of a Pin |
| “Box::pin prevents deallocation” | Box::pin just ensures the heap location is stable; the Box can still be dropped |
Check-Your-Understanding Questions
- Why can
Pin<Box<T>>be moved between variables, but theTinside cannot move? - What happens if you implement
Unpinfor a type that is actually self-referential? - Why is it safe to move a
Futurebefore the firstpoll()call? - How does
PhantomPinnedactually prevent implementingUnpin? - What’s the difference between stack pinning with
pin!()and heap pinning withBox::pin()?
Check-Your-Understanding Answers
-
Box pointer vs. contents: When you move a
Pin<Box<T>>, you’re moving the box pointer (the address stored in the Box), not the heap-allocatedT. TheTremains at its heap address. The Pin contract applies to the pointee, not the pointer. -
Unsound Unpin impl: If you implement
Unpinfor a self-referential type, you’re lying to the compiler. Code can then callPin::into_inner()to get the value back and move it freely, invalidating self-references. This is undefined behavior. -
Moving before poll: Futures are lazy—they don’t execute until polled. The initial state of a generated future contains only function arguments, not internal references. Self-references are only created when execution hits an
.awaitand needs to store references across the suspension. -
PhantomPinned mechanism:
Unpinis an auto trait—it’s automatically implemented for types whose fields are allUnpin.PhantomPinnedis!Unpin, so including it in your struct prevents the auto-impl from firing. -
Stack vs. heap pinning:
Box::pin()allocates on the heap, returningPin<Box<T>>that owns the value. Thepin!()macro creates a pinned reference to a stack value, returningPin<&mut T>. Stack-pinned values cannot outlive their stack frame.
Real-World Applications
- Async runtimes (Tokio, async-std): All spawned tasks are pinned futures
- Intrusive linked lists: Node pointers remain valid as nodes are pinned
- Self-referential parsers: Buffer + reference to current position
- Generator/coroutine implementations: Yield points create self-references
- FFI callbacks: Passing
selfpointer to C code requires stable address
Where You’ll Apply It
| Project | How Pinning Is Used |
|---|---|
| Project 1: Manual Pin Projector | Implement projection by hand, prove address stability |
| Project 2: Box-less Async Trait | Pin futures in trait objects without heap allocation |
| Project 8: Custom Future Runtime | Store and poll pinned tasks in executor queues |
References
- “Rust for Rustaceans” by Jon Gjengset — Ch. 8: Explains async internals and pinning contracts
- Rust RFC 2349 (Pin) — Original RFC defining the pinning API
- Understanding Pin in Rust Async — Practical walkthrough
- Async/Await in Rust OS — Deep dive into async state machines
- pin-project crate docs — Safe projection derive macro
Key Insight
Pinning freezes location in memory, not mutability. It’s a spatial guarantee (address stability), not a temporal one (immutability).
Summary
Pinning (Pin<P>) guarantees that a value won’t move after being pinned, enabling safe self-referential types. The Unpin trait marks types that can be freely moved even when pinned. Async futures are the primary use case—the compiler generates self-referential state machines that require pinning. Manual projection lets you access fields through Pin<&mut Self> safely. Most application code never touches Pin directly; it’s handled by executors and macros. But understanding pinning is essential for low-level async work, custom futures, and intrusive data structures.
Homework/Exercises
-
Basic projection: Write a struct with two fields—one
String(not structurally pinned) and one self-referential pointer (structurally pinned). Implement safe getters for both. -
Stack vs. heap: Create the same self-referential struct using both
Box::pin()and thepin!()macro. Print the addresses before and after to verify stability. -
Unpin exploration: Create a struct containing only
Unpintypes. Verify you can callPin::into_inner()on it. Then addPhantomPinnedand observe the compile error. -
Future anatomy: Write an async function with two
.awaitpoints and a local variable used after both. Usecargo expandto see the generated state machine. Identify the self-reference.
Solutions to Exercises
Exercise 1 Solution:
struct TwoFields {
name: String, // Not structurally pinned
name_ptr: *const u8, // Structurally pinned (self-ref)
_pin: PhantomPinned,
}
impl TwoFields {
fn name(self: Pin<&Self>) -> &str {
// Safe: String is not structurally pinned, we can access it normally
&self.get_ref().name
}
fn name_ptr(self: Pin<&Self>) -> *const u8 {
// The pointer value itself can be read through shared ref
self.get_ref().name_ptr
}
}
Exercise 2 Solution:
// Heap pinning
let heap_pinned = Box::pin(SelfRef::new("hello"));
println!("Heap addr: {:p}", &*heap_pinned); // Stable
// Stack pinning
let value = SelfRef::uninit();
pin!(value); // Creates Pin<&mut SelfRef>
println!("Stack addr: {:p}", &*value); // Stable until scope ends
Exercise 3 Solution:
struct AllUnpin { x: i32, y: String }
let pinned = Box::pin(AllUnpin { x: 42, y: "hi".into() });
let _unpinned = Pin::into_inner(pinned); // Compiles: AllUnpin: Unpin
struct NotUnpin { x: i32, _pin: PhantomPinned }
let pinned = Box::pin(NotUnpin { x: 42, _pin: PhantomPinned });
// let _unpinned = Pin::into_inner(pinned); // ERROR: NotUnpin: !Unpin
Exercise 4 Solution:
// Original async fn
async fn two_awaits() {
let local = String::from("saved");
first_await().await;
second_await().await;
println!("{}", local); // local used after both awaits
}
// Expanded (simplified) - run `cargo expand` for actual output:
enum TwoAwaitsFuture {
Start { local: String },
AfterFirst { local: String, fut: SecondFuture },
Done,
}
// The `local` field appears in multiple variants -> self-referential
// if any variant held &local across states
Concept 2: Async State Machines & Executors
Fundamentals
Rust’s async/await system is built on a fundamental insight: asynchronous operations can be modeled as state machines that are cooperatively scheduled. When you write async fn, the compiler transforms your sequential-looking code into a state machine enum that implements the Future trait. Each .await point becomes a state transition, and local variables that span suspension points are stored in the state machine’s fields. Unlike thread-based concurrency (where the OS preemptively switches contexts), async Rust uses cooperative scheduling: futures voluntarily yield control by returning Poll::Pending, and an executor decides which future to poll next.
The three core components are: Futures (state machines that represent in-progress computations), Executors (schedulers that poll futures to completion), and Wakers (handles that signal when a future should be polled again). The Future trait has a single method: poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output>. The Pin requirement ensures self-referential futures stay valid (see Concept 1). The Context provides access to the Waker, which the future uses to request re-polling when an external event occurs.
Deep Dive
Async Function Desugaring
When the Rust compiler processes an async function, it performs a transformation called “desugaring.” According to EventHelix’s analysis, the compiler generates an enum where each variant corresponds to a suspension point. Consider:
async fn fetch_and_process(url: &str) -> Result<Data> {
let response = http_get(url).await?;
let parsed = parse(response).await?;
Ok(parsed)
}
This becomes roughly:
enum FetchAndProcessFuture<'a> {
Start { url: &'a str },
WaitingForGet { url: &'a str, get_future: HttpGetFuture },
WaitingForParse { parse_future: ParseFuture },
Done,
}
All local variables that live across .await points are stored in the enum variants. The Google Comprehensive Rust guide emphasizes that the compiler can optimize storage by analyzing which variables are needed in each state.
The Stackless Coroutine Model
Unlike stackful coroutines (Go goroutines, Lua coroutines) that allocate a separate stack per coroutine, Rust futures are stackless—they store only what’s needed for the current state. This has key implications:
- Memory efficiency: A future’s size is known at compile time (the largest variant)
- No stack switching: Yielding is just returning
Pending; resuming is callingpollagain - Recursive async functions are tricky: Each recursive call adds to the future’s size, potentially unboundedly. The fix is
Box::pinto add indirection.
The Executor-Reactor Architecture
A typical async runtime has two cooperating components:
┌────────────────────────────────────────────────────────────┐
│ EXECUTOR │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Task Queue: [Task1, Task2, Task3, ...] │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ poll() │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Active Task: poll(Pin<&mut Future>, &mut Context) │ │
│ │ Returns Poll::Ready(T) or Pending │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ ▼ ▼ │
│ Poll::Ready(T) Poll::Pending │
│ ├─ Complete task ├─ Store waker with reactor │
│ └─ Return result └─ Move to next task │
└────────────────────────────────────────────────────────────┘
▲
│ wake() callback
┌─────────────────────────┴──────────────────────────────────┐
│ REACTOR │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ epoll/kqueue/IOCP: monitors file descriptors │ │
│ │ Timer wheel: manages timeouts │ │
│ │ Event sources: signals, channels, etc. │ │
│ └─────────────────────────────────────────────────────┘ │
│ When event fires → call waker.wake() → reschedule task │
└────────────────────────────────────────────────────────────┘
The executor owns the task queue and drives futures by calling poll(). When a future returns Pending, the executor moves on to another task. The reactor monitors external event sources (network sockets, timers, signals) and calls waker.wake() when events occur, signaling the executor to re-poll the associated task.
Reactor vs. Proactor Models
-
Reactor pattern (epoll/kqueue): Wait for “readiness” events, then perform I/O. Common on Linux/macOS. The application calls non-blocking
read()/write()when the socket is ready. -
Proactor pattern (io_uring, IOCP): Submit I/O operations, get completion notifications. The kernel performs the I/O and notifies when done.
Tokio primarily uses the reactor pattern, but tokio-uring and io-uring crates support the proactor model on Linux 5.1+.
The Waker Contract
A Waker is a handle that tells the executor “this task needs re-polling.” The Tokio documentation explains the contract:
- Calling
waker.wake()must be safe from any thread (Send + Sync) - Waking is idempotent—multiple wakes just mean “poll soon,” not “poll N times”
- Waking a completed task is harmless (executor ignores it)
- The waker may be cloned and stored for later
Internally, a Waker is a fat pointer: a data pointer and a vtable with wake, clone, and drop functions. This allows different executor implementations to use different scheduling strategies.
Cooperative Scheduling and Fairness
Async Rust is cooperatively scheduled—futures must voluntarily yield by returning Pending. A future that runs a long computation without awaiting will block the entire executor thread (“task starvation”). Tokio addresses this with:
- Task budgeting: Each task gets a “budget” of operations before being forced to yield
tokio::task::yield_now(): Explicit yield point for CPU-bound worktokio::task::spawn_blocking(): Move blocking work to a dedicated thread pool
Cancellation
Dropping a future cancels it. Unlike other languages (C# CancellationToken, Go context.Context), Rust’s cancellation is implicit: when you drop a future, it stops being polled. However, this has implications:
- Drop runs destructors for all state machine fields
- Resources held by dropped futures are released
- If a future is dropped mid-operation, partial work may be lost
- Async drop is not yet supported—cleanup runs synchronously
Send + ‘static Bounds
When spawning a task with tokio::spawn(), the future must be Send + 'static:
- Send: The future can be moved to another thread (work-stealing)
- ‘static: No borrowed references that might dangle
This explains common errors like “future cannot be sent between threads safely”—usually caused by holding a non-Send type (like Rc or RefCell) across an .await.
How This Fits in Projects
You’ll build and understand async internals in:
- Project 2 (Box-less Async Trait): Implement async traits without heap allocation
- Project 8 (Custom Future Runtime): Build a complete executor with reactor integration
- Project 11 (no_std Game Boy Core): Simple async scheduler for embedded
Definitions & Key Terms
| Term | Definition |
|---|---|
| Future | A trait representing an asynchronous computation that yields a value |
| Executor | A scheduler that polls futures to completion |
| Waker | A handle to signal that a future should be re-polled |
| Reactor | Component that monitors external events and wakes futures |
| Poll::Pending | Return value indicating “not ready yet, I’ll wake you” |
| Poll::Ready(T) | Return value indicating completion with result T |
| Context | Provides access to the current task’s Waker |
| Task | A future wrapped with scheduling metadata (waker, queue position) |
| Cooperative Scheduling | Futures voluntarily yield; no preemption |
Mental Model Diagram
ASYNC FUNCTION TRANSFORMATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Source Code: Compiled State Machine:
┌──────────────────────┐ ┌────────────────────────────────────┐
│ async fn example() { │ │ enum ExampleFuture { │
│ let a = foo().await│ ───► │ State0 { }, │
│ let b = bar().await│ │ State1 { a: A, bar_fut: BarFut },│
│ a + b │ │ State2 { a: A, b: B }, │
│ } │ │ Complete, │
└──────────────────────┘ │ } │
└────────────────────────────────────┘
EXECUTION FLOW
━━━━━━━━━━━━━━
Executor Future Reactor
│ │ │
│──── poll() ──────────►│ │
│ │ │
│◄─── Pending ──────────│── register waker ────►│
│ (waiting for IO) │ │
│ │ │
│ (poll other tasks) │ │
│ │ │
│ │◄──── wake() ──────────│ (IO ready)
│◄─────────────────────────────────────────────│
│ │ │
│──── poll() ──────────►│ │
│ │ │
│◄─── Ready(value) ─────│ │
│ │ │
▼ ▼ ▼
Task complete Future dropped Event consumed
How It Works (Step-by-Step)
-
Create future: Call an async function; this returns a
Future(doesn’t execute yet) -
Spawn task: Pass the future to an executor (e.g.,
tokio::spawn()). Executor wraps it with metadata (waker, queue position) -
Initial poll: Executor calls
poll()on the future with aContextcontaining the waker - State machine executes: The future’s
pollimpl runs until it hits an.await:- If the awaited future is
Ready: consume result, continue to next statement - If the awaited future is
Pending: register waker with reactor, returnPending
- If the awaited future is
-
Wait for event: Executor runs other tasks. Reactor monitors registered events (sockets, timers)
-
Event fires: Reactor calls
waker.wake(), signaling the executor -
Re-poll: Executor moves the task back to the ready queue and polls it again
- Complete: When the outermost future returns
Ready(T), the task is done
Invariants:
- A future must eventually return
Readyor be dropped (no infinite Pending) - Wakers must be registered before returning
Pending - Polling a completed future is undefined behavior
- Futures must be pinned before polling (see Concept 1)
Failure modes:
- Starvation: CPU-bound future never yields, blocks other tasks
- Lost wake: Waker not registered before Pending; task never re-polled
- Deadlock: Awaiting on yourself or circular dependencies
- Resource leaks: Future dropped without cleanup (async Drop not yet stable)
Minimal Concrete Example
// Pseudocode: A minimal single-threaded executor
struct Executor {
ready_queue: VecDeque<Task>,
}
struct Task {
future: Pin<Box<dyn Future<Output = ()>>>,
waker: Waker,
}
impl Executor {
fn run(&mut self) {
loop {
while let Some(mut task) = self.ready_queue.pop_front() {
let waker = task.waker.clone();
let mut cx = Context::from_waker(&waker);
match task.future.as_mut().poll(&mut cx) {
Poll::Ready(()) => {
// Task complete, drop it
}
Poll::Pending => {
// Task will be re-queued when waker.wake() is called
}
}
}
// Block on reactor (epoll_wait, etc.) until wakers fire
self.reactor.wait();
}
}
}
// Waker implementation pushes task back to ready_queue
fn wake(task: Arc<Task>) {
executor.ready_queue.push_back(task);
}
Common Misconceptions
| Misconception | Reality |
|---|---|
| “await spawns a thread” | await is just a yield point; no threads are created |
| “Pending means the future failed” | Pending means “not ready yet”; it will be polled again |
| “Wakers are callbacks” | Wakers signal “re-poll me”; they don’t execute code |
| “Async is always faster” | Async adds overhead; it’s better for I/O-bound, not CPU-bound work |
| “Blocking in async is fine” | Blocking stalls the executor thread; use spawn_blocking |
| “Futures run when created” | Futures are lazy; they only run when polled |
Check-Your-Understanding Questions
- Why does
poll()takePin<&mut Self>instead of&mut self? - What happens if a future returns
Pendingwithout registering a waker? - Why must wakers be
Send + Syncin multi-threaded executors? - How does cooperative scheduling differ from preemptive scheduling?
- What’s the difference between a reactor and a proactor?
- Why does
tokio::spawn()requireSend + 'staticbounds?
Check-Your-Understanding Answers
-
Pin requirement: Async futures may be self-referential (storing references to their own fields across await points). Pin guarantees the future won’t move in memory, keeping these internal references valid.
-
Lost wake: If no waker is registered and the future returns Pending, the executor has no way to know when to re-poll it. The task is effectively “lost” and will never complete. This is a serious bug.
-
Send + Sync wakers: In work-stealing executors, a task might be polled on thread A, return Pending, then the waker is called from thread B (e.g., by the reactor). The waker must be safe to invoke from any thread.
-
Cooperative vs. preemptive: Cooperative scheduling relies on tasks voluntarily yielding (returning Pending). Preemptive scheduling (like OS threads) forcibly interrupts tasks at arbitrary points. Cooperative is more efficient (no context switch overhead) but requires well-behaved tasks.
-
Reactor vs. proactor: A reactor waits for “readiness” events (socket is readable) then the application does I/O. A proactor submits I/O operations to the kernel and gets “completion” events. Proactor offloads more work to the kernel.
-
Send + ‘static bounds:
Sendallows the executor to move the future to any thread (for work-stealing).'staticensures no borrowed references that might dangle if the spawning scope exits before the task completes.
Real-World Applications
- Web servers (Axum, Actix, Warp): Handle thousands of concurrent connections efficiently
- Database clients: Non-blocking queries with connection pooling
- Message queues: Async consumers processing events
- Microservices: Concurrent HTTP/gRPC calls without thread-per-request
- Embedded systems: Event-driven firmware on single-core MCUs
Where You’ll Apply It
| Project | How Async Internals Are Used |
|---|---|
| Project 2: Box-less Async Trait | Implement Future trait manually, understand state machine |
| Project 8: Custom Future Runtime | Build executor, reactor, waker from scratch |
| Project 11: no_std Game Boy Core | Simple timer-based scheduler for embedded |
References
- “Rust for Rustaceans” by Jon Gjengset — Ch. 8: Deep dive into async internals
- Tokio Tutorial: Async in Depth — Official runtime documentation
- Async Book — Comprehensive async programming guide
- Understanding Async/Await: State Machines to Assembly — Low-level compilation analysis
- Comprehensive Rust: State Machine — Google’s async teaching materials
- How to Think About Async/Await — Mental model for async programming
Key Insight
Async is not about parallelism—it’s about efficient multiplexing of I/O-bound tasks on a limited number of threads through cooperative state machine transformations.
Summary
Rust’s async system transforms async functions into state machine enums at compile time. Executors drive these futures by calling poll() until they return Ready. Wakers provide the callback mechanism for reactors to signal when I/O is available. The cooperative scheduling model requires futures to voluntarily yield—blocking operations starve other tasks. Understanding this machinery is essential for building custom runtimes, debugging performance issues, and implementing efficient async abstractions.
Homework/Exercises
-
State machine visualization: Write an async function with 3 await points. Use
cargo expandto see the generated enum. Calculate the maximum size of the future. -
Simple waker: Implement a
Wakerthat sets anAtomicBoolflag. Write a test that polls a future, checks the flag after Pending, and re-polls. -
Timer future: Build a
Sleepfuture that completes after a duration. Usestd::thread::sleepin a spawned thread to wake the future. -
Minimal executor: Write a single-threaded executor that can run multiple futures to completion. Use a
VecDequeas the task queue.
Solutions to Exercises
Exercise 1 Solution:
// Run: cargo expand --example three_awaits
async fn three_awaits() -> i32 {
let a = first().await; // Await point 1
let b = second().await; // Await point 2
let c = third().await; // Await point 3
a + b + c
}
// Expanded (simplified):
enum ThreeAwaitsFuture {
State0, // Before any await
State1 { first_fut: FirstFuture }, // Waiting for first()
State2 { a: i32, second_fut: SecondFuture }, // Waiting for second()
State3 { a: i32, b: i32, third_fut: ThirdFuture }, // Waiting for third()
Complete,
}
// Size = max(size of each variant) + discriminant
// Typically the last variant (with most accumulated state) is largest
Exercise 2 Solution:
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::task::{Waker, RawWaker, RawWakerVTable};
fn flag_waker(flag: Arc<AtomicBool>) -> Waker {
fn clone(data: *const ()) -> RawWaker { /* clone Arc */ }
fn wake(data: *const ()) {
let flag = unsafe { Arc::from_raw(data as *const AtomicBool) };
flag.store(true, Ordering::SeqCst);
}
fn drop(data: *const ()) { /* drop Arc */ }
static VTABLE: RawWakerVTable = RawWakerVTable::new(clone, wake, wake, drop);
let raw = RawWaker::new(Arc::into_raw(flag) as *const (), &VTABLE);
unsafe { Waker::from_raw(raw) }
}
Exercise 3 Solution:
struct Sleep {
deadline: Instant,
waker: Option<Waker>,
}
impl Future for Sleep {
type Output = ();
fn poll(mut self: Pin<&mut Self>, cx: &mut Context) -> Poll<()> {
if Instant::now() >= self.deadline {
return Poll::Ready(());
}
// Store waker for wake-up
let waker = cx.waker().clone();
let deadline = self.deadline;
// Spawn thread to wake after delay
std::thread::spawn(move || {
std::thread::sleep(deadline - Instant::now());
waker.wake();
});
Poll::Pending
}
}
Exercise 4 Solution:
struct MiniExecutor {
queue: VecDeque<Pin<Box<dyn Future<Output = ()>>>>,
}
impl MiniExecutor {
fn spawn(&mut self, fut: impl Future<Output = ()> + 'static) {
self.queue.push_back(Box::pin(fut));
}
fn run(&mut self) {
while let Some(mut fut) = self.queue.pop_front() {
// Simple waker that re-queues the task
let waker = noop_waker(); // Or implement proper waker
let mut cx = Context::from_waker(&waker);
match fut.as_mut().poll(&mut cx) {
Poll::Ready(()) => { /* done */ }
Poll::Pending => self.queue.push_back(fut),
}
}
}
}
Concept 3: Memory Control & Custom Allocators
Fundamentals
Memory allocation is one of the most impactful performance decisions in systems programming. The default allocator (malloc/free or Rust’s System) is designed for general-purpose use—it handles arbitrary sizes, arbitrary lifetimes, and thread safety. But this generality comes at a cost: fragmentation, lock contention, and cache misses. Custom allocators let you trade generality for speed by exploiting known patterns in your workload.
Rust provides two allocator traits: GlobalAlloc (the low-level trait for replacing the system allocator) and Allocator (the newer, more flexible trait for per-collection allocators). The GlobalAlloc trait requires implementing alloc and dealloc methods that work with raw pointers and Layout descriptors. The Layout type encodes size and alignment requirements—the two fundamental properties of any allocation.
Different allocator strategies optimize for different patterns: bump allocators are fast for short-lived, sequential allocations that reset together; slab allocators eliminate fragmentation for fixed-size objects; buddy allocators balance flexibility and fragmentation; pool allocators recycle objects without calling the underlying allocator.
Deep Dive
The GlobalAlloc Trait
According to the Rust documentation, GlobalAlloc is an unsafe trait because implementations must uphold strict invariants:
pub unsafe trait GlobalAlloc {
unsafe fn alloc(&self, layout: Layout) -> *mut u8;
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout);
// Optional, with default implementations
unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8;
unsafe fn realloc(&self, ptr: *mut u8, layout: Layout, new_size: usize) -> *mut u8;
}
The caller must ensure:
layout.size() > 0(zero-sized allocations are undefined behavior)deallocis called with the exact samelayoutused foralloc- The pointer passed to
dealloccame fromallocon this allocator - The allocator must not panic (unwinding is UB for global allocators)
Memory Layout and Alignment
Every allocation has two properties: size (how many bytes) and alignment (what addresses are valid). Alignment must be a power of two, and the returned pointer must be a multiple of the alignment. Common alignments:
u8: 1 byteu32: 4 bytesu64/pointers: 8 bytes on 64-bit- SIMD types: 16, 32, or 64 bytes
- Cache line: typically 64 bytes
The Layout type encodes both:
let layout = Layout::from_size_align(size, align).unwrap();
// or
let layout = Layout::new::<MyStruct>(); // Inferred from type
Misaligned access causes crashes on some architectures (ARM) and performance penalties on others (x86).
Allocator Strategies
1. Bump Allocator (Arena)
The simplest allocator: maintain a pointer that “bumps” forward on each allocation. Deallocation is a no-op—you reset the entire arena at once.
┌──────────────────────────────────────────────────────────┐
│ Arena Memory Region │
│ │
│ [Alloc 1][Alloc 2][Alloc 3][ Free space ... ] │
│ ▲ │
│ │ bump_ptr │
│ │ │
│ alloc(16) → bump_ptr += 16 + padding │
│ reset() → bump_ptr = start │
└──────────────────────────────────────────────────────────┘
Advantages:
- O(1) allocation (just pointer increment)
- Zero fragmentation (everything freed together)
- Excellent cache locality (sequential access)
Disadvantages:
- No individual deallocation
- Fixed maximum size
Use cases: Per-request allocations in web servers, compiler AST nodes, game frame allocations.
The Bumpalo crate provides a production-ready implementation.
2. Slab Allocator
Pre-allocates blocks of fixed size. Great for allocating many objects of the same type.
┌─────────────────────────────────────────────────────────┐
│ Slab (fixed-size slots) │
│ │
│ [Used][Free][Used][Free][Used][Free][Free][Free] │
│ ▲ ▲ │
│ │ │ │
│ └──────────┴──── Free list (linked via slots) │
│ │
│ alloc() → pop from free list │
│ dealloc() → push to free list │
└─────────────────────────────────────────────────────────┘
Advantages:
- O(1) allocation and deallocation
- Zero external fragmentation
- Objects are reusable (reduced syscalls)
Disadvantages:
- Internal fragmentation if object < slot size
- Fixed object size per slab
Use cases: Network packet buffers, connection objects, tree/graph nodes.
3. Buddy Allocator
Divides memory into power-of-two sized blocks. When allocating, find the smallest block that fits; when freeing, merge adjacent “buddy” blocks.
Initial: [───────────────────── 1024 ─────────────────────]
After alloc(100): Split recursively until 128-byte block
[128*][128 ][────── 256 ──────][────────── 512 ──────────]
▲ allocated
After alloc(200): Need 256-byte block
[128*][128 ][256*──────────────][────────── 512 ──────────]
▲ allocated
After free(first): Check if buddy is free → merge if possible
[128 ][128 ][256*──────────────][────────── 512 ──────────]
Advantages:
- Handles variable sizes efficiently
- Automatic coalescing reduces fragmentation
- O(log n) allocation/deallocation
Disadvantages:
- Internal fragmentation (next power of two)
- More complex than bump/slab
Use cases: Kernel page allocation, general-purpose with bounded fragmentation.
Cache Lines and False Sharing
Modern CPUs load memory in cache lines (typically 64 bytes). When two threads access different variables on the same cache line, they cause “false sharing”—the cache line bounces between CPU caches even though there’s no actual data sharing.
Thread A writes x Thread B writes y
│ │
▼ ▼
┌───────────────────────────────────────────┐
│ Cache Line (64 bytes) │
│ [x: 8 bytes][padding...][y: 8 bytes][...] │
└───────────────────────────────────────────┘
│ │
└──── INVALIDATES ────────┘
each other!
Solution: Align to cache lines
┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐
│ Cache Line 1 │ │ Cache Line 2 │
│ [x: 8 bytes][padding to 64 bytes ] │ │ [y: 8 bytes][padding to 64 bytes ] │
└─────────────────────────────────────┘ └─────────────────────────────────────┘
In Rust:
#[repr(align(64))]
struct CacheAligned<T>(T);
Measuring Allocator Impact
Tools for profiling allocation:
perf stat -e cache-misses— Cache miss ratesheaptrack— Allocation profilingjemalloc+MALLOC_CONF=prof:true— Detailed heap profilesvalgrind --tool=massif— Heap snapshots over time
The nical’s blog post on custom allocators demonstrates real-world performance gains from strategic allocator use.
Thread Safety Considerations
Global allocators must be thread-safe. Since GlobalAlloc methods take &self (not &mut self), you need interior mutability. Options:
- Spinlock: Simple but can spin under contention
- Thread-local caches: Each thread has its own bump pointer; refill from shared pool
- Lock-free: Complex but highest throughput
The os.phil-opp allocator designs walks through implementing several allocator types for a kernel.
How This Fits in Projects
You’ll implement and use custom allocators in:
- Project 3 (Custom Arena Allocator): Build a bump allocator with reset semantics
- Project 6 (Lock-Free Queue): Custom node allocation with epoch-based reclamation
- Project 12 (High-Performance KV Store): Arena for index, slab for value buffers
Definitions & Key Terms
| Term | Definition |
|---|---|
| Layout | Size and alignment requirements for an allocation |
| Alignment | Memory address must be divisible by this power of two |
| Fragmentation | Wasted memory from allocation patterns (internal or external) |
| Arena/Bump | Allocator that only moves pointer forward; bulk reset |
| Slab | Fixed-size slot allocator with free list |
| Buddy | Power-of-two block allocator with merging |
| False Sharing | Cache invalidation from unrelated variables on same cache line |
| GlobalAlloc | Trait for replacing Rust’s global memory allocator |
Mental Model Diagram
ALLOCATOR STRATEGY COMPARISON
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
BUMP/ARENA SLAB
┌───────────────────────────┐ ┌───────────────────────────┐
│ Memory Region │ │ Fixed-Size Slots │
│ │ │ │
│ [A][B][C][ free... ] │ │ [U][F][U][F][U][F][F][F] │
│ ▲ bump_ptr │ │ │ │ │
│ │ │ └──┬──┘ free list │
│ + Fast (ptr increment) │ │ │ │
│ + No fragmentation │ │ + Fast (pop/push) │
│ - No individual free │ │ + No external frag │
│ │ │ - Only one size │
└───────────────────────────┘ └───────────────────────────┘
BUDDY GENERAL PURPOSE (malloc)
┌───────────────────────────┐ ┌───────────────────────────┐
│ Power-of-Two Blocks │ │ Mixed Sizes │
│ │ │ │
│ [128][128][256 ][ 512 ]│ │ [A][B][ ][C][ ][D][E][] │
│ │ │ │ │ ▲ ▲ │
│ └─┘ buddy pair │ │ fragmentation │
│ │ │ │
│ + Variable sizes │ │ + Fully general │
│ + Coalescing │ │ - Fragmentation │
│ - Internal frag (round up)│ │ - Lock contention │
└───────────────────────────┘ └───────────────────────────┘
How It Works (Step-by-Step)
Building a Bump Allocator:
- Reserve memory:
mmapor allocate a large byte array - Track state: Store
start,end, andcurrentpointers - Allocate: Round
currentup for alignment, bump forward by size, return old pointer - Handle overflow: Return null or grow if allocation exceeds
end - Reset: Set
current = start(optionally zero memory) - Register: Use
#[global_allocator]to make it the default
fn alloc(&self, layout: Layout) -> *mut u8 {
let current = self.current.load(Ordering::Relaxed);
let aligned = align_up(current, layout.align());
let new_current = aligned + layout.size();
if new_current > self.end {
return null_mut(); // Out of memory
}
// CAS loop for thread safety
match self.current.compare_exchange_weak(...) {
Ok(_) => aligned as *mut u8,
Err(_) => self.alloc(layout), // Retry
}
}
Invariants:
allocmust return aligned pointer or nulldeallocmust be called with matching layout- Thread safety via atomics or locks
- Never read from uninitialized allocator state
Failure modes:
- Out of memory (arena exhausted)
- Use-after-reset (dangling pointer)
- Double-free (dealloc called twice)
- Alignment violation (crash on ARM, slow on x86)
Minimal Concrete Example
// Pseudocode: A simple thread-local bump allocator
struct BumpAllocator {
arena: [u8; 1024 * 1024], // 1 MB arena
offset: Cell<usize>,
}
impl BumpAllocator {
fn alloc(&self, size: usize, align: usize) -> *mut u8 {
let offset = self.offset.get();
let aligned = (offset + align - 1) & !(align - 1);
let new_offset = aligned + size;
if new_offset > self.arena.len() {
return std::ptr::null_mut();
}
self.offset.set(new_offset);
unsafe { self.arena.as_ptr().add(aligned) as *mut u8 }
}
fn reset(&self) {
self.offset.set(0);
}
}
// Usage pattern
let allocator = BumpAllocator::new();
let ptr1 = allocator.alloc(100, 8);
let ptr2 = allocator.alloc(200, 16);
// ... use allocations ...
allocator.reset(); // All allocations freed at once
Common Misconceptions
| Misconception | Reality |
|---|---|
| “Custom allocators are always faster” | They’re faster only when the workload matches the strategy |
| “Vec::with_capacity is a custom allocator” | It pre-allocates but still uses the global allocator |
| “Rust has no malloc” | Rust uses the system allocator by default (malloc/jemalloc) |
| “Arenas leak memory” | Arenas defer freeing; they don’t leak if reset/dropped properly |
| “Zero-cost abstractions never allocate” | Many abstractions (Box, Vec, String) allocate by default |
Check-Your-Understanding Questions
- Why must
GlobalAlloc::allocnever panic? - What happens if you call
deallocwith a differentLayoutthanalloc? - Why does a bump allocator have O(1) allocation?
- When would you prefer a slab allocator over an arena?
- How does false sharing affect multi-threaded performance?
- Why do allocator methods take
&selfinstead of&mut self?
Check-Your-Understanding Answers
-
No panic: Global allocators can be called from any context, including unwinding code. Panicking during a panic is undefined behavior. Allocation failure must return null or call
handle_alloc_error. -
Wrong layout: Undefined behavior. The allocator may corrupt its internal state, return wrong memory, or crash. Always use
Layout::new::<T>()for type-safe layouts. -
O(1) bump: Allocation is just pointer arithmetic (align + increment). No free list traversal, no block splitting, no locks in the single-threaded case.
-
Slab vs arena: Use slab when objects have independent lifetimes (created/destroyed individually). Use arena when objects share a lifecycle (all freed together).
-
False sharing: When threads modify different variables on the same cache line, the line bounces between CPU caches. This causes massive slowdowns—up to 100x in extreme cases.
-
&self interior mutability: Allocators must work from immutable references because
Box::new()and other allocating APIs don’t have mutable access to the allocator. UseMutex,AtomicPtr, or other interior mutability patterns.
Real-World Applications
- Game engines: Frame allocators reset each frame; object pools for entities
- Web servers: Per-request arenas for parsing and response building
- Compilers: Arenas for AST nodes, type tables, symbol tables
- Databases: Buffer pools for pages, slab for connection objects
- Embedded: Static allocators for deterministic memory use
Where You’ll Apply It
| Project | How Allocators Are Used |
|---|---|
| Project 3: Custom Arena Allocator | Build bump allocator from scratch |
| Project 6: Lock-Free Queue | Custom node recycling with epoch GC |
| Project 12: High-Performance KV Store | Arena for index structures |
References
- std::alloc module — Official allocator documentation
- Allocator Designs (os.phil-opp) — Tutorial implementing kernel allocators
- Custom Allocators in Rust — Real-world performance analysis
- Global Allocator Interception — Wrapping allocators for instrumentation
- “The Linux Programming Interface” by Kerrisk — Ch. 7: Memory Allocation
- Bumpalo crate docs — Production bump allocator implementation
Key Insight
Allocation strategy is a performance multiplier—match the allocator to your object lifetimes and size distribution.
Summary
Custom allocators trade generality for performance by exploiting known allocation patterns. Bump/arena allocators excel for short-lived, bulk-freed allocations. Slab allocators eliminate fragmentation for fixed-size objects. The GlobalAlloc trait enables replacing the system allocator, while the Allocator trait allows per-collection customization. Understanding Layout, alignment, and cache behavior is essential for high-performance systems. Always measure—custom allocators add complexity that’s only worthwhile when the workload matches.
Homework/Exercises
-
Counting wrapper: Implement a
GlobalAllocthat wraps the system allocator and counts total bytes allocated/deallocated. Print stats on drop. -
Bump with reset: Build a bump allocator with a
reset()method. Allocate 1000 objects, reset, allocate 1000 more. Verify memory reuse. -
Alignment exploration: Create allocations with different alignments (1, 8, 64). Print the returned pointers and verify alignment.
-
Cache line padding: Benchmark a counter incremented by 8 threads. First without padding, then with cache-line-aligned counters. Measure speedup.
Solutions to Exercises
Exercise 1 Solution:
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicUsize, Ordering};
struct CountingAlloc {
allocated: AtomicUsize,
deallocated: AtomicUsize,
}
unsafe impl GlobalAlloc for CountingAlloc {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
self.allocated.fetch_add(layout.size(), Ordering::Relaxed);
System.alloc(layout)
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
self.deallocated.fetch_add(layout.size(), Ordering::Relaxed);
System.dealloc(ptr, layout)
}
}
#[global_allocator]
static ALLOC: CountingAlloc = CountingAlloc {
allocated: AtomicUsize::new(0),
deallocated: AtomicUsize::new(0),
};
// At end: print ALLOC.allocated.load() and ALLOC.deallocated.load()
Exercise 2 Solution:
struct BumpAlloc {
memory: UnsafeCell<Vec<u8>>,
offset: AtomicUsize,
}
impl BumpAlloc {
fn alloc(&self, layout: Layout) -> *mut u8 {
let offset = self.offset.load(Ordering::Acquire);
let aligned = (offset + layout.align() - 1) & !(layout.align() - 1);
let new_offset = aligned + layout.size();
if self.offset.compare_exchange(offset, new_offset, ...).is_ok() {
unsafe { (*self.memory.get()).as_mut_ptr().add(aligned) }
} else {
self.alloc(layout) // Retry
}
}
fn reset(&self) {
self.offset.store(0, Ordering::Release);
// Optional: zero memory for security
}
}
Exercise 3 Solution:
fn check_alignment() {
let align1 = Box::new(1u8);
let align8 = Box::new(1u64);
let align64: Box<[u8; 64]> = Box::new([0; 64]);
println!("u8 ptr: {:p} (align 1)", &*align1);
println!("u64 ptr: {:p} (align 8)", &*align8);
println!("[u8;64] ptr: {:p} (align 1)", &*align64);
// Verify: ptr as usize % alignment == 0
assert_eq!(&*align8 as *const _ as usize % 8, 0);
}
Exercise 4 Solution:
// Without padding (false sharing)
struct Counters {
c: [AtomicUsize; 8], // All on ~1 cache line
}
// With cache line padding
#[repr(align(64))]
struct PaddedCounter(AtomicUsize);
struct PaddedCounters {
c: [PaddedCounter; 8], // Each on separate cache line
}
// Benchmark: 8 threads increment their counter 10M times
// Result: PaddedCounters is typically 10-100x faster
Concept 4: Type System Power (Const Generics, GATs, Lifetimes)
Fundamentals
Rust’s type system is one of the most expressive in mainstream programming languages. It goes far beyond simple type checking—it can encode invariants, state machines, dimensional analysis, and ownership patterns directly in types, catching errors at compile time that would be runtime bugs in other languages. The three major advanced features are: const generics (compile-time integer/bool parameters), GATs (Generic Associated Types) (associated types with their own generic parameters), and advanced lifetime bounds (including higher-ranked trait bounds for universal quantification over lifetimes).
Const generics allow types to be parameterized by values, not just types. Array<T, const N: usize> is an array of exactly N elements—the compiler knows the size at compile time and can reject operations that would violate bounds. GATs enable associated types that vary with lifetime or type parameters, crucial for expressing lending iterators and async traits. Higher-ranked trait bounds (for<'a> Fn(&'a T)) express “this works for any lifetime,” enabling flexible APIs that work with borrowed data.
Together, these features let you move invariant checking from runtime to compile time—type-safe matrices that refuse incompatible dimensions, protocol state machines that prevent illegal transitions, and zero-cost unit systems that catch Mars Climate Orbiter-style bugs.
Deep Dive
Const Generics: Values in Types
Rust 1.51+ supports const generics, allowing types to be parameterized by compile-time constants:
struct Array<T, const N: usize> {
data: [T; N],
}
impl<T, const N: usize> Array<T, N> {
fn len(&self) -> usize { N } // Known at compile time
}
// Usage
let a: Array<i32, 5> = Array { data: [0; 5] };
let b: Array<i32, 10> = Array { data: [0; 10] };
// a and b are DIFFERENT types
This enables:
- Stack-allocated arrays of any size (replacing
Vecwhen size is known) - Dimension-checked matrices (see Project 5)
- Fixed-capacity containers (no heap allocation)
- Compile-time bounds checking
Const generic bounds can express relationships:
fn concat<T, const A: usize, const B: usize>(
a: Array<T, A>,
b: Array<T, B>,
) -> Array<T, { A + B }> // Computed size!
Current limitations: const generic expressions ({ A + B }) require nightly and #![feature(generic_const_exprs)].
GATs (Generic Associated Types)
Stabilized in Rust 1.65, GATs allow associated types to have their own generic parameters:
trait LendingIterator {
type Item<'a> where Self: 'a; // Associated type with lifetime parameter
fn next(&mut self) -> Option<Self::Item<'_>>;
}
This solves the “lending iterator” problem—returning borrowed data from an iterator without requiring the data to outlive the iterator itself.
GATs are essential for:
- Async traits without Box:
type Future<'a>: Future<Output = T> where Self: 'a - Self-borrowing patterns: Returning references to self
- Database query builders: Rows that borrow from the statement
// Before GATs: Impossible without allocation
// After GATs: Works!
struct Windows<'data, T> {
data: &'data [T],
pos: usize,
size: usize,
}
impl<'data, T> LendingIterator for Windows<'data, T> {
type Item<'a> = &'a [T] where Self: 'a;
fn next(&mut self) -> Option<Self::Item<'_>> {
if self.pos + self.size <= self.data.len() {
let window = &self.data[self.pos..self.pos + self.size];
self.pos += 1;
Some(window)
} else {
None
}
}
}
Variance and Subtyping
Rust has subtyping for lifetimes: 'long is a subtype of 'short if 'long outlives 'short. Variance describes how this subtyping extends to generic types:
Covariant (+): If 'a: 'b, then Container<'a> <: Container<'b>
Example: &'a T (can use longer-lived reference where shorter expected)
Contravariant (-): If 'a: 'b, then Container<'b> <: Container<'a>
Example: fn(&'a T) (can use fn accepting shorter where longer expected)
Invariant: No subtyping relationship
Example: &'a mut T (must be exactly right lifetime)
The Rustonomicon details how variance affects generic code:
struct Holder<'a, T: 'a> {
data: &'a T, // Covariant in 'a, covariant in T
}
struct MutHolder<'a, T: 'a> {
data: &'a mut T, // Invariant in 'a, invariant in T
}
Understanding variance is crucial when designing APIs that accept references or closures.
Higher-Ranked Trait Bounds (HRTBs)
HRTBs express “for all lifetimes”:
fn call_with_ref<F>(f: F)
where
F: for<'a> Fn(&'a str) -> &'a str, // Works for ANY lifetime
{
let s = String::from("hello");
let result = f(&s); // 'a is determined here
println!("{}", result);
}
Without for<'a>, you’d have to pick a specific lifetime at the call site, limiting flexibility. HRTBs are common in:
- Closure-accepting APIs
- Parser combinators
- Async callback patterns
PhantomData: Encoding Without Storing
PhantomData<T> is a zero-sized type that tells the compiler “this type logically contains T” without actually storing it:
struct Id<T> {
value: u64,
_marker: PhantomData<T>, // Type-safe IDs
}
let user_id: Id<User> = Id { value: 1, _marker: PhantomData };
let post_id: Id<Post> = Id { value: 1, _marker: PhantomData };
// user_id != post_id at type level, even though both are u64
PhantomData uses:
- Ownership encoding:
PhantomData<T>implies ownership of T (affects drop check) - Variance control:
PhantomData<fn(T)>makes type contravariant in T - Type-safe handles: IDs, indices, foreign keys
- FFI lifetime tracking: Associate lifetimes with raw pointers
Typestates: Encoding Protocol States
Typestate programming uses types to represent states, with transitions as methods that consume self and return the new state:
// TCP Connection typestate
struct Closed;
struct SynSent;
struct Established;
struct TcpConnection<State> {
// ... connection data ...
_state: PhantomData<State>,
}
impl TcpConnection<Closed> {
fn connect(self) -> TcpConnection<SynSent> {
// Send SYN
TcpConnection { _state: PhantomData }
}
}
impl TcpConnection<SynSent> {
fn ack_received(self) -> TcpConnection<Established> {
TcpConnection { _state: PhantomData }
}
}
impl TcpConnection<Established> {
fn send(&mut self, data: &[u8]) { /* ... */ }
fn close(self) -> TcpConnection<Closed> {
TcpConnection { _state: PhantomData }
}
}
// Compile error: send() not available on Closed state!
// let conn = TcpConnection::<Closed>::new();
// conn.send(b"data"); // ERROR: method not found
This pattern is used in embedded drivers (GPIO pin modes), file handles (read/write modes), and protocol implementations.
Sealed Traits: Preventing External Implementation
Sealed traits can only be implemented within your crate:
mod private {
pub trait Sealed {}
}
pub trait MyTrait: private::Sealed {
fn method(&self);
}
// External crates cannot implement Sealed, so they cannot implement MyTrait
This enables:
- Adding methods to traits without breaking compatibility
- Exhaustive matching on implementors
- API stability guarantees
How This Fits in Projects
You’ll leverage advanced type features in:
- Project 5 (Const Generic Matrix): Compile-time dimension checking
- Project 9 (Physical Units Lib): Type-safe dimensional analysis
- Project 10 (Procedural Macro): Generate type-safe reflection code
Definitions & Key Terms
| Term | Definition |
|---|---|
| Const Generic | A compile-time constant (integer, bool) as a type parameter |
| GAT | Generic Associated Type—associated type with its own generics |
| HRTB | Higher-Ranked Trait Bound—for<'a> universal lifetime quantification |
| Variance | How subtyping of parameters affects subtyping of the whole type |
| PhantomData | Zero-sized marker encoding type relationships |
| Typestate | Using distinct types to represent states, enforcing valid transitions |
| Sealed Trait | Trait that cannot be implemented outside its defining crate |
Mental Model Diagram
TYPE SYSTEM AS COMPILE-TIME ENFORCER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CONST GENERICS: Values become types
┌─────────────────────────────────────────────────────┐
│ Array<i32, 5> Array<i32, 10> │
│ │ │ │
│ └───── DIFFERENT TYPES ─┘ │
│ │
│ Matrix<3, 4> * Matrix<4, 5> → Matrix<3, 5> ✓ │
│ Matrix<3, 4> * Matrix<2, 5> → COMPILE ERROR ✗ │
└─────────────────────────────────────────────────────┘
TYPESTATE: Valid transitions only
┌─────────────────────────────────────────────────────┐
│ │
│ Connection<Closed> ──connect()──► Connection<SynSent>
│ │ │
│ ack_received()│
│ │ │
│ Connection<Closed> ◄──close()── Connection<Established>
│ │ │
│ send() ←───┘│
│ │
│ send() on Closed → COMPILE ERROR │
└─────────────────────────────────────────────────────┘
GATs: Associated types with lifetimes
┌─────────────────────────────────────────────────────┐
│ trait LendingIterator { │
│ type Item<'a> where Self: 'a; │
│ fn next(&mut self) -> Option<Self::Item<'_>>; │
│ } │
│ │
│ // Item borrows from self—impossible before GATs │
└─────────────────────────────────────────────────────┘
How It Works (Step-by-Step)
Building a Type-Safe Matrix:
- Define with const generics:
struct Matrix<T, const R: usize, const C: usize> - Implement indexing:
impl Index<(usize, usize)>with bounds checks - Constrain operations: Multiplication requires
Matrix<R, K> * Matrix<K, C> = Matrix<R, C> - Use type aliases:
type Vec3 = Matrix<f64, 3, 1>for ergonomics - Add const operations:
fn transpose(self) -> Matrix<C, R>swaps dimensions
impl<T, const R: usize, const K: usize> Matrix<T, R, K> {
fn mul<const C: usize>(self, other: Matrix<T, K, C>) -> Matrix<T, R, C>
where
T: Mul<Output = T> + Add<Output = T> + Default + Copy
{
// Inner dimensions (K) must match—enforced by signature!
let mut result = Matrix::default();
for i in 0..R {
for j in 0..C {
for k in 0..K {
result[(i, j)] += self[(i, k)] * other[(k, j)];
}
}
}
result
}
}
Invariants:
- Types must be consistent (compiler enforces)
- Lifetimes must be valid (borrow checker enforces)
- Transitions must be defined (methods only exist on valid states)
Failure modes:
- Compile errors for invalid operations (intended!)
- Complex error messages with many generics
- Monomorphization bloat for many const values
Minimal Concrete Example
// Typestate: File that tracks open/closed at compile time
mod file {
use std::marker::PhantomData;
use std::fs::File as StdFile;
pub struct Closed;
pub struct Open;
pub struct TypedFile<State> {
inner: Option<StdFile>,
_state: PhantomData<State>,
}
impl TypedFile<Closed> {
pub fn new() -> Self {
TypedFile { inner: None, _state: PhantomData }
}
pub fn open(self, path: &str) -> Result<TypedFile<Open>, std::io::Error> {
let file = StdFile::open(path)?;
Ok(TypedFile { inner: Some(file), _state: PhantomData })
}
}
impl TypedFile<Open> {
pub fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
// read() only available when Open
self.inner.as_mut().unwrap().read(buf)
}
pub fn close(self) -> TypedFile<Closed> {
// inner is dropped here
TypedFile { inner: None, _state: PhantomData }
}
}
}
// Usage:
let file = TypedFile::<Closed>::new();
// file.read(&mut buf); // ERROR: method not found for Closed
let mut file = file.open("test.txt")?;
file.read(&mut buf)?; // OK: Open state has read()
let file = file.close(); // Back to Closed
// file.read(&mut buf); // ERROR again
Common Misconceptions
| Misconception | Reality |
|---|---|
| “Const generics are like macros” | Macros operate on syntax; const generics are first-class type parameters |
| “Lifetimes represent time” | Lifetimes represent borrow scopes—lexical regions, not duration |
| “PhantomData allocates memory” | PhantomData is zero-sized; it exists only for the type system |
| “GATs are only for async” | GATs enable many patterns: lending iterators, self-borrowing, etc. |
| “More types = slower code” | Monomorphization can enable better optimization (inlining, const propagation) |
Check-Your-Understanding Questions
- What does
for<'a> Fn(&'a T) -> &'a Tmean in plain English? - Why is
&'a mut Tinvariant in both'aandT? - How do you make a type covariant in a lifetime?
- What problem do GATs solve that regular associated types can’t?
- When would you use
PhantomData<fn(T)>instead ofPhantomData<T>? - How does typestate prevent runtime errors?
Check-Your-Understanding Answers
-
HRTB meaning: “A function that, for any lifetime ‘a you give it, takes a reference with that lifetime and returns a reference with the same lifetime.” It works for all lifetimes, determined at call site.
-
Invariance of &mut T: A mutable reference must be exactly the right lifetime—you can’t substitute shorter (data might be accessed after invalidation) or longer (source might not live long enough). T is invariant because &mut allows writing, so substituting a subtype/supertype would allow invalid writes.
-
Covariance in lifetime: Use
&'a Tor holdPhantomData<&'a T>. The reference is naturally covariant—longer lifetimes can be used where shorter ones are expected. -
GAT advantage: Regular associated types are fixed per impl. GATs can vary with call-site lifetimes. For
fn next(&mut self) -> Option<Self::Item>, GATs allowItem<'_>to borrow fromself, which regular associated types cannot express. -
PhantomData<fn(T)>: To make a type contravariant in T.
PhantomData<T>is covariant;PhantomData<fn(T)>is contravariant. Use when your type consumes T rather than produces it. -
Typestate prevention: Invalid operations simply don’t compile. If
send()only exists onConnection<Established>, callingsend()on a closed connection is a type error, not a runtime check that might be forgotten.
Real-World Applications
- Embedded Rust: GPIO pins with typestate (Input/Output modes)
- Web frameworks (Axum, Actix): Request extractors with type-safe parameters
- Cryptography: Secret vs. Public key types preventing accidental exposure
- Database ORMs: Query builders that ensure valid SQL
- Protocol implementations: State machines for TCP, TLS, HTTP/2
Where You’ll Apply It
| Project | How Type System Is Used |
|---|---|
| Project 5: Const Generic Matrix | Compile-time dimension checking |
| Project 9: Physical Units Lib | Type-safe dimensional analysis (meters, seconds, etc.) |
| Project 10: Procedural Macro | Generate type-safe reflection tables |
References
- “Programming Rust” by Blandy & Orendorff — Ch. 11: Traits and Generics
- Rust RFC 2000 (Const Generics) — Original RFC
- Rustonomicon: Subtyping and Variance — Deep dive on variance
- GATs Stabilization Announcement — What GATs enable
- “Type-Driven Development” — General approach to leveraging types
Key Insight
Types are executable specifications—push correctness checks to compile time to eliminate entire classes of runtime bugs.
Summary
Rust’s advanced type system features—const generics, GATs, HRTBs, variance, and typestate—enable encoding complex invariants directly in types. Const generics parameterize types by values, enabling dimension-checked arrays and matrices. GATs allow associated types to vary with lifetimes, solving the lending iterator problem. PhantomData and typestate patterns encode ownership, capability, and protocol state. The result is code where invalid operations don’t compile, eliminating runtime checks and entire bug categories.
Homework/Exercises
-
Const generic array: Implement a
StaticVec<T, const N: usize>that tracks length at runtime but has compile-time max capacity. -
Typestate builder: Create a
HttpRequestbuilder where headers can only be set beforebody(), andsend()only works afterbody(). -
PhantomData ownership: Create a
Handle<T>type that doesn’t own T but affects drop order as if it did. -
GAT exploration: Implement a
LendingIteratorthat yields overlapping windows of a slice.
Solutions to Exercises
Exercise 1 Solution:
struct StaticVec<T, const N: usize> {
data: [MaybeUninit<T>; N],
len: usize,
}
impl<T, const N: usize> StaticVec<T, N> {
fn new() -> Self {
StaticVec {
data: unsafe { MaybeUninit::uninit().assume_init() },
len: 0,
}
}
fn push(&mut self, value: T) -> Result<(), T> {
if self.len >= N {
return Err(value); // Capacity exceeded
}
self.data[self.len].write(value);
self.len += 1;
Ok(())
}
fn capacity(&self) -> usize { N } // Compile-time constant
}
Exercise 2 Solution:
struct NoHeaders;
struct HasHeaders;
struct HasBody;
struct RequestBuilder<HeaderState, BodyState> {
url: String,
headers: Vec<(String, String)>,
body: Option<Vec<u8>>,
_h: PhantomData<HeaderState>,
_b: PhantomData<BodyState>,
}
impl RequestBuilder<NoHeaders, ()> {
fn new(url: &str) -> RequestBuilder<NoHeaders, ()> { ... }
fn header(self, k: &str, v: &str) -> RequestBuilder<HasHeaders, ()> { ... }
}
impl<H> RequestBuilder<H, ()> {
fn body(self, data: &[u8]) -> RequestBuilder<H, HasBody> { ... }
}
impl<H> RequestBuilder<H, HasBody> {
fn send(self) -> Response { ... } // Only available after body()
}
Exercise 3 Solution:
struct Handle<T> {
id: u64,
_owns: PhantomData<T>, // Tells drop checker we "own" T
}
impl<T> Handle<T> {
fn new(id: u64) -> Self {
Handle { id, _owns: PhantomData }
}
}
// Drop check now considers T's drop behavior when checking Handle
// This affects code like:
// let x = SomeType::new();
// let h = Handle::<SomeType>::new(42);
// // Compiler considers h as if it might access x during drop
Exercise 4 Solution:
trait LendingIterator {
type Item<'a> where Self: 'a;
fn next(&mut self) -> Option<Self::Item<'_>>;
}
struct OverlappingWindows<'data, T> {
data: &'data [T],
pos: usize,
size: usize,
}
impl<'data, T> LendingIterator for OverlappingWindows<'data, T> {
type Item<'a> = &'a [T] where Self: 'a;
fn next(&mut self) -> Option<Self::Item<'_>> {
if self.pos + self.size <= self.data.len() {
let window = &self.data[self.pos..self.pos + self.size];
self.pos += 1; // Overlapping: advance by 1, not size
Some(window)
} else {
None
}
}
}
Concept 5: Concurrency, Atomics, and Lock-Free Design
Fundamentals
Lock-free programming enables multiple threads to make progress without traditional locks, eliminating deadlock and reducing contention. The foundation is atomic operations—CPU instructions that appear indivisible to other threads. Rust’s std::sync::atomic module exposes these primitives: AtomicBool, AtomicUsize, AtomicPtr<T>, etc. Each atomic operation takes a memory ordering parameter that specifies synchronization guarantees with other operations.
The key orderings are:
- Relaxed: No synchronization; only guarantees atomicity
- Acquire: Subsequent reads see writes that happened before matching Release
- Release: Previous writes are visible to threads doing Acquire loads
- SeqCst: Total order across all SeqCst operations (strongest, slowest)
Lock-free data structures use compare-and-swap (CAS) loops: read current value, compute new value, atomically swap if unchanged. If another thread modified it first, retry. The challenge is ensuring correctness across all interleavings.
Deep Dive
Memory Ordering and Happens-Before
Modern CPUs and compilers reorder memory operations for performance. Without synchronization, one thread’s writes may appear in a different order to another thread. Memory orderings establish “happens-before” relationships:
Thread 1 Thread 2
──────── ────────
data = 42;
ready.store(true, Release); ───► if ready.load(Acquire) {
assert_eq!(data, 42); // Guaranteed!
}
The Release-Acquire pair ensures that all writes before the Release are visible after the Acquire. Without this, data = 42 might not be visible when ready is true.
Ordering Strength Hierarchy:
SeqCst > AcqRel > Acquire/Release > Relaxed
SeqCst: Total order + Acquire + Release (slowest)
AcqRel: Acquire + Release combined
Acquire: Prevents reordering of subsequent reads
Release: Prevents reordering of previous writes
Relaxed: Only guarantees atomicity (fastest)
Compare-and-Swap (CAS) Loops
CAS is the fundamental lock-free primitive:
fn increment(counter: &AtomicUsize) {
loop {
let current = counter.load(Ordering::Relaxed);
let new = current + 1;
match counter.compare_exchange_weak(
current, new, Ordering::SeqCst, Ordering::Relaxed
) {
Ok(_) => break,
Err(_) => continue, // Another thread changed it; retry
}
}
}
compare_exchange_weak may spuriously fail (even if value matches) but is faster on some platforms. Use compare_exchange when you can’t tolerate spurious failure.
The ABA Problem
ABA occurs when a pointer is reused, causing CAS to succeed incorrectly:
Initial: stack = [A] -> [B] -> null
Thread 1: pop() - reads A, prepares to CAS(A -> B)
<preempted>
Thread 2: pop() - removes A
pop() - removes B
push(A) - reuses A (from free list)
Stack now: [A] -> [C] -> null (different A!)
Thread 1 resumes: CAS(A -> B) succeeds!
Stack becomes: [B] -> null
But B was already freed! Corruption.
Solutions:
- Tagged pointers: Include a counter that increments on each CAS
- Epoch-based reclamation: Defer freeing until all threads advance
- Hazard pointers: Threads publish pointers they’re accessing
Epoch-Based Reclamation (EBR)
The Crossbeam crate implements EBR for safe memory reclamation:
fn remove(&self, key: K) {
let guard = epoch::pin(); // Enter critical section
// Find and unlink node...
let node = ...;
// Defer deallocation until safe
guard.defer_destroy(node); // Freed when all threads advance
}
Threads periodically “advance” the global epoch. Garbage is freed only when all threads have passed the epoch when it was retired.
Lock-Free Queue: Michael-Scott Algorithm
The classic lock-free MPMC queue:
struct Node<T> {
data: MaybeUninit<T>,
next: AtomicPtr<Node<T>>,
}
struct Queue<T> {
head: AtomicPtr<Node<T>>, // Dummy node
tail: AtomicPtr<Node<T>>,
}
// Enqueue: CAS new node onto tail
fn enqueue(&self, value: T) {
let node = Box::into_raw(Box::new(Node::new(value)));
loop {
let tail = self.tail.load(Acquire);
let next = (*tail).next.load(Acquire);
if next.is_null() {
if (*tail).next.compare_exchange(null, node, Release, Relaxed).is_ok() {
self.tail.compare_exchange(tail, node, Release, Relaxed);
return;
}
} else {
self.tail.compare_exchange(tail, next, Release, Relaxed);
}
}
}
Backoff Strategies
Under contention, spinning wastes CPU. Backoff strategies:
- Spin: Tight loop (high contention = wasted cycles)
- Yield:
std::hint::spin_loop()orthread::yield_now() - Exponential backoff: Double wait time on each failure
- Parking: Block thread until notified (for high contention)
Contention Profiling
Tools for measuring lock-free performance:
perf stat -e cache-misses,bus-cycles— Cache behaviorcargo benchwith criterion — Throughput/latency- Loom — Model checker for concurrency bugs
perf c2c— False sharing detection
How This Fits in Projects
You’ll implement lock-free structures in:
- Project 6 (Lock-Free Queue): Build Michael-Scott queue with EBR
- Project 8 (Custom Runtime): Atomic task queues for scheduling
- Project 12 (KV Store): Concurrent index with atomic updates
Definitions & Key Terms
| Term | Definition |
|---|---|
| Atomic | Operation that appears indivisible to other threads |
| CAS | Compare-And-Swap: conditional atomic update |
| Memory Ordering | Synchronization guarantee (Relaxed, Acquire, Release, SeqCst) |
| ABA Problem | CAS succeeds with reused pointer, causing corruption |
| Epoch-Based Reclamation | Defer freeing until all threads advance past retirement epoch |
| Lock-Free | At least one thread makes progress; no deadlock |
| Wait-Free | Every thread makes progress in bounded steps |
| Linearizability | Operations appear to occur atomically at some point |
Mental Model Diagram
MEMORY ORDERING RELATIONSHIPS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread 1 Thread 2
──────── ────────
writes X
writes Y
RELEASE ─────────────────────► ACQUIRE
reads Y (sees Thread 1's Y)
reads X (sees Thread 1's X)
All writes before RELEASE are visible after ACQUIRE
CAS LOOP ANATOMY
━━━━━━━━━━━━━━━━
┌────────────────────────────┐
│ Load current value │
└──────────────┬─────────────┘
▼
┌────────────────────────────┐
│ Compute new value locally │
└──────────────┬─────────────┘
▼
┌────────────────────────────┐
│ CAS(expected, new) │◄─────┐
└──────────────┬─────────────┘ │
┌───────┴───────┐ │
▼ ▼ │
Success Failure ─────────┘
(done) (retry)
How It Works (Step-by-Step)
Building a Lock-Free Stack:
- Define node:
struct Node<T> { data: T, next: AtomicPtr<Node<T>> } - Push: Create node, CAS onto head
- Pop: Load head, CAS to head.next, return data
- Handle ABA: Use tagged pointer or epoch
- Test: Stress test with multiple threads; verify with Loom
fn push(&self, value: T) {
let node = Box::into_raw(Box::new(Node { data: value, next: AtomicPtr::new(null()) }));
loop {
let head = self.head.load(Acquire);
(*node).next.store(head, Relaxed);
if self.head.compare_exchange_weak(head, node, Release, Relaxed).is_ok() {
return;
}
}
}
fn pop(&self) -> Option<T> {
loop {
let head = self.head.load(Acquire);
if head.is_null() { return None; }
let next = (*head).next.load(Relaxed);
if self.head.compare_exchange_weak(head, next, Release, Relaxed).is_ok() {
let data = (*head).data;
// MUST defer deallocation (ABA!)
epoch::defer_destroy(head);
return Some(data);
}
}
}
Invariants:
- CAS must use correct orderings for visibility
- Deallocation must be deferred until safe
- Contention must be bounded (backoff if needed)
Failure modes:
- ABA corruption (use tags/epochs)
- Lost updates (wrong CAS ordering)
- Livelock (threads repeatedly invalidate each other)
- Memory leaks (forgot to free retired nodes)
Minimal Concrete Example
// Simple lock-free counter with fetch_add
use std::sync::atomic::{AtomicU64, Ordering};
struct Counter(AtomicU64);
impl Counter {
fn increment(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
}
fn get(&self) -> u64 {
self.0.load(Ordering::Relaxed)
}
}
// Thread-safe without locks!
// Relaxed is fine because we don't need ordering with other operations
Common Misconceptions
| Misconception | Reality |
|---|---|
| “Lock-free is always faster” | Under high contention, locks can outperform CAS spinning |
| “SeqCst is always safe” | SeqCst is correct but often overkill; weaker orderings are faster |
| “Atomics prevent data races” | Atomics on the pointer don’t protect the pointed-to data |
| “CAS success means correctness” | ABA can cause CAS to succeed incorrectly |
| “Relaxed is dangerous” | Relaxed is correct for independent counters/flags |
Check-Your-Understanding Questions
- Why can’t you use
Relaxedfor a spinlock? - What’s the difference between
compare_exchangeandcompare_exchange_weak? - How does epoch-based reclamation solve ABA?
- When would you prefer a lock over lock-free?
- Why do lock-free structures often use dummy nodes?
Check-Your-Understanding Answers
-
Relaxed spinlock: A spinlock needs Release when unlocking (so protected data is visible) and Acquire when locking (so you see previous writes). Relaxed provides neither—threads might see stale data.
-
CAS weak vs strong:
compare_exchange_weakmay fail spuriously even if the value matches, but is faster on LL/SC architectures (ARM). Useweakin loops,strongwhen spurious failure isn’t acceptable. -
EBR and ABA: EBR delays freeing until all threads have advanced. A pointer can’t be reused while any thread might still access it, preventing ABA.
-
Lock over lock-free: When contention is high (many threads fighting for same location), locks with parking outperform spinning. Also when correctness is more important than performance—locks are easier to reason about.
-
Dummy nodes: A dummy head/tail node means head and tail are never null, simplifying edge cases. You never CAS the dummy itself, only what it points to.
Real-World Applications
- Tokio runtime: Lock-free work-stealing scheduler
- RocksDB: Lock-free skiplist for memtable
- Linux kernel: RCU (Read-Copy-Update) for reader-heavy workloads
- HFT systems: Sub-microsecond messaging queues
- Arc reference counting:
fetch_add/fetch_subfor thread-safe refcounts
Where You’ll Apply It
| Project | How Atomics Are Used |
|---|---|
| Project 6: Lock-Free Queue | Full Michael-Scott implementation with EBR |
| Project 8: Custom Runtime | Atomic task queues, waker registration |
| Project 12: KV Store | Concurrent hash index with atomic buckets |
References
- “Rust Atomics and Locks” by Mara Bos — The definitive Rust concurrency book
- “The Art of Multiprocessor Programming” by Herlihy & Shavit — Theory of lock-free design
- Crossbeam crate docs — Production-ready lock-free primitives
- Loom crate — Model checker for concurrency testing
- Memory Ordering at Compile Time — Jeff Preshing’s blog
Key Insight
Pick the weakest memory ordering that maintains correctness—stronger orderings are slower and rarely necessary.
Summary
Lock-free programming uses atomic operations and memory orderings to synchronize threads without locks. CAS loops are the fundamental building block. The ABA problem requires careful memory management (epochs, hazard pointers). Memory orderings (Relaxed/Acquire/Release/SeqCst) control visibility between threads. Lock-free structures excel under low-to-moderate contention; under high contention, consider locks. Always test with tools like Loom to catch subtle bugs.
Homework/Exercises
-
Atomic counter: Implement a counter with
fetch_add. Benchmark with 8 threads incrementing 1M times each. -
Spinlock: Build a spinlock using
AtomicBool. Add exponential backoff and measure impact. -
Tagged pointer: Pack a 16-bit tag with a 48-bit pointer into
AtomicU64. Implement tagged CAS. -
Loom verification: Take your spinlock and verify correctness with Loom.
Solutions to Exercises
Exercise 1 Solution:
struct Counter(AtomicU64);
impl Counter {
fn inc(&self) { self.0.fetch_add(1, Ordering::Relaxed); }
fn get(&self) -> u64 { self.0.load(Ordering::Relaxed) }
}
// Benchmark: spawn 8 threads, each calling inc() 1M times
// Final value should be exactly 8_000_000
Exercise 2 Solution:
struct SpinLock {
locked: AtomicBool,
}
impl SpinLock {
fn lock(&self) {
let mut backoff = 1;
while self.locked.compare_exchange_weak(
false, true, Ordering::Acquire, Ordering::Relaxed
).is_err() {
for _ in 0..backoff {
std::hint::spin_loop();
}
backoff = (backoff * 2).min(1024); // Cap backoff
}
}
fn unlock(&self) {
self.locked.store(false, Ordering::Release);
}
}
Exercise 3 Solution:
const TAG_BITS: u64 = 16;
const PTR_MASK: u64 = (1 << 48) - 1;
fn pack(ptr: *mut Node, tag: u16) -> u64 {
(ptr as u64 & PTR_MASK) | ((tag as u64) << 48)
}
fn unpack(val: u64) -> (*mut Node, u16) {
let ptr = (val & PTR_MASK) as *mut Node;
let tag = (val >> 48) as u16;
(ptr, tag)
}
// CAS with automatic tag increment
fn tagged_cas(atomic: &AtomicU64, old_ptr: *mut Node, old_tag: u16, new_ptr: *mut Node) -> bool {
let old = pack(old_ptr, old_tag);
let new = pack(new_ptr, old_tag.wrapping_add(1));
atomic.compare_exchange(old, new, Ordering::SeqCst, Ordering::Relaxed).is_ok()
}
Exercise 4 Solution:
#[cfg(loom)]
use loom::sync::atomic::{AtomicBool, Ordering};
#[cfg(not(loom))]
use std::sync::atomic::{AtomicBool, Ordering};
#[test]
fn test_spinlock() {
loom::model(|| {
let lock = Arc::new(SpinLock::new());
let counter = Arc::new(AtomicU64::new(0));
let handles: Vec<_> = (0..2).map(|_| {
let lock = lock.clone();
let counter = counter.clone();
loom::thread::spawn(move || {
lock.lock();
counter.fetch_add(1, Ordering::Relaxed);
lock.unlock();
})
}).collect();
for h in handles { h.join().unwrap(); }
assert_eq!(counter.load(Ordering::Relaxed), 2);
});
}
Concept 6: Metaprogramming & Macros
Fundamentals
Rust macros are compile-time code generators—they transform source code before type checking. Unlike C preprocessor macros (text substitution), Rust macros operate on structured tokens and are hygienic (they don’t accidentally capture or shadow variables). There are two macro systems: declarative macros (macro_rules!) for pattern-based expansion, and procedural macros for arbitrary code manipulation using the proc_macro API.
Declarative macros match patterns in the input and expand to templated output. They’re great for simple code generation: vec![1, 2, 3] expands to Vec allocation code. Procedural macros receive a TokenStream (parsed tokens), manipulate it programmatically, and return a new TokenStream. They power #[derive(Debug)], custom attributes, and function-like macros that do complex transformations.
Macros enable: boilerplate elimination, DSL creation, compile-time validation, and code generation that would otherwise require external build tools.
Deep Dive
Declarative Macros: Pattern Matching
macro_rules! defines macros with pattern arms:
macro_rules! make_fn {
($name:ident) => {
fn $name() {
println!(stringify!($name));
}
};
}
make_fn!(hello); // Expands to: fn hello() { println!("hello"); }
Fragment specifiers control what can be matched:
$x:ident— identifier (variable/function name)$x:expr— expression$x:ty— type$x:tt— single token tree$x:item— item (fn, struct, impl, etc.)
Repetitions handle variadic input:
macro_rules! vec_of {
($($x:expr),*) => {
{
let mut v = Vec::new();
$(v.push($x);)*
v
}
};
}
vec_of![1, 2, 3] // Creates Vec with 1, 2, 3
Procedural Macros: Token Manipulation
Proc macros are functions that transform TokenStream. Three types:
- Derive macros:
#[derive(MyTrait)]— add impl blocks - Attribute macros:
#[my_attr]— transform items - Function-like macros:
my_macro!(...)— arbitrary transformation
// In proc macro crate (must be separate crate!)
use proc_macro::TokenStream;
use quote::quote;
use syn::{parse_macro_input, DeriveInput};
#[proc_macro_derive(MyDebug)]
pub fn derive_my_debug(input: TokenStream) -> TokenStream {
let ast = parse_macro_input!(input as DeriveInput);
let name = &ast.ident;
let expanded = quote! {
impl std::fmt::Debug for #name {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, stringify!(#name))
}
}
};
expanded.into()
}
The syn and quote Crates
- syn: Parses
TokenStreaminto Rust AST structures (DeriveInput,ItemFn,Expr, etc.) - quote: Generates
TokenStreamfrom quasi-quoted Rust code
// Parse struct fields
let fields = match &ast.data {
syn::Data::Struct(data) => &data.fields,
_ => panic!("Only structs supported"),
};
// Generate code for each field
let field_impls = fields.iter().map(|f| {
let name = &f.ident;
quote! { self.#name.fmt(f)?; }
});
Macro Hygiene
Rust macros are hygienic—they don’t accidentally capture variables from the call site:
macro_rules! make_x {
() => {
let x = 42; // This 'x' is in the macro's scope
};
}
let x = 10;
make_x!();
println!("{}", x); // Prints 10, not 42!
The macro’s x is distinct from the caller’s x. This prevents subtle bugs common in C macros.
Spans and Diagnostics
Spans track source location for error messages. Proc macros should preserve spans:
use proc_macro2::Span;
use syn::Error;
// Create error pointing to specific token
let error = Error::new(ident.span(), "invalid identifier");
return error.to_compile_error().into();
This produces errors like:
error: invalid identifier
--> src/main.rs:5:10
|
5 | #[derive(MyMacro)]
| ^^^^^^^
Attribute Macros: Item Transformation
Attribute macros receive the item they’re attached to:
#[proc_macro_attribute]
pub fn log_calls(_attr: TokenStream, item: TokenStream) -> TokenStream {
let input = parse_macro_input!(item as ItemFn);
let name = &input.sig.ident;
let block = &input.block;
quote! {
fn #name() {
println!("Entering {}", stringify!(#name));
#block
println!("Leaving {}", stringify!(#name));
}
}.into()
}
// Usage:
#[log_calls]
fn my_function() {
// ...
}
Debugging Macros
Tools for macro development:
cargo expand— Show expanded codecargo expand --lib | rustfmt— Formatted expansioncompile_error!("message")— Emit custom compile erroreprintln!in proc macros — Print during compilation
Macro Limitations
What macros cannot do:
- Access type information (no type checker at macro time)
- Create hygiene-breaking name bindings (by design)
- Run arbitrary runtime code (they run at compile time)
- Access files or network (in most configurations)
How This Fits in Projects
You’ll write macros in:
- Project 2 (Box-less Async Trait): Generate state machine boilerplate
- Project 7 (Zero-Copy Parser): DSL for protocol definitions
- Project 10 (Procedural Macro): Full reflection derive macro
Definitions & Key Terms
| Term | Definition |
|---|---|
| Declarative Macro | macro_rules! pattern-based macro |
| Procedural Macro | Function transforming TokenStream |
| TokenStream | Sequence of parsed tokens (input/output of proc macros) |
| Hygiene | Macros don’t capture/shadow caller’s variables |
| Span | Source location for error reporting |
| syn | Crate for parsing TokenStream into AST |
| quote | Crate for generating TokenStream from templates |
| Derive Macro | #[derive(Trait)] that generates impl |
Mental Model Diagram
MACRO EXPANSION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━
Source Code Tokenization Macro Expansion Type Check
┌────────────────┐ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ #[derive(Debug)]│ │ TokenStream: │ │ Expanded Code: │ │ Validated AST: │
│ struct Foo { │──►│ struct, Foo, │──►│ impl Debug │──►│ Types checked, │
│ x: i32, │ │ x, :, i32 ... │ │ for Foo { } │ │ borrow checked │
│ } │ │ │ │ │ │ │
└────────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
PROC MACRO ARCHITECTURE
━━━━━━━━━━━━━━━━━━━━━━━
┌─────────────────────────────────┐
│ Your proc-macro crate │
│ │
TokenStream ───────►│ syn::parse() ──► Transform ──► │───────► TokenStream
│ ↓ │
│ Rust AST structures │
│ (DeriveInput, etc.) │
│ ↓ │
│ quote! { ... } │
└─────────────────────────────────┘
How It Works (Step-by-Step)
Creating a Derive Macro:
- Create proc-macro crate:
cargo new my_macro --lib, addproc-macro = true - Add dependencies:
syn,quote,proc-macro2 - Parse input:
parse_macro_input!(input as DeriveInput) - Extract information: Get struct name, fields, generics
- Generate code: Use
quote!to template the impl - Handle errors: Use
syn::Errorfor good diagnostics - Test: Use
cargo expandand unit tests withtrybuild
// Cargo.toml for proc-macro crate
[lib]
proc-macro = true
[dependencies]
syn = { version = "2", features = ["full"] }
quote = "1"
proc-macro2 = "1"
Invariants:
- Proc macros must be in separate crates
- Output must be valid Rust tokens
- Spans should point to relevant input for errors
Failure modes:
- Infinite recursion in recursive macros
- Unhelpful error spans (use input spans)
- Breaking hygiene accidentally (rare but possible)
- Performance issues with very large expansions
Minimal Concrete Example
// Procedural macro that generates getters for all fields
#[proc_macro_derive(Getters)]
pub fn derive_getters(input: TokenStream) -> TokenStream {
let ast = parse_macro_input!(input as DeriveInput);
let name = &ast.ident;
let fields = match &ast.data {
syn::Data::Struct(data) => match &data.fields {
syn::Fields::Named(fields) => &fields.named,
_ => panic!("Only named fields supported"),
},
_ => panic!("Only structs supported"),
};
let getters = fields.iter().map(|f| {
let field_name = f.ident.as_ref().unwrap();
let field_type = &f.ty;
let getter_name = syn::Ident::new(
&format!("get_{}", field_name),
field_name.span()
);
quote! {
pub fn #getter_name(&self) -> &#field_type {
&self.#field_name
}
}
});
quote! {
impl #name {
#(#getters)*
}
}.into()
}
// Usage:
#[derive(Getters)]
struct User {
name: String,
age: u32,
}
// Generates: get_name() -> &String, get_age() -> &u32
Common Misconceptions
| Misconception | Reality |
|---|---|
| “Macros run at runtime” | Macros run at compile time, generating code |
| “Macros can access types” | Macros only see tokens; type info comes later |
| “All macros need proc-macro crate” | macro_rules! works in any crate |
| “Macros are unhygienic like C” | Rust macros are hygienic by default |
| “quote! evaluates expressions” | quote! just generates tokens; evaluation is later |
Check-Your-Understanding Questions
- Why must proc macros be in a separate crate?
- What’s the difference between
$x:exprand$x:tt? - How does macro hygiene prevent name collisions?
- Why can’t a macro check if a type implements a trait?
- How do you debug a macro that generates invalid code?
Check-Your-Understanding Answers
-
Separate crate: Proc macros are dynamically loaded compiler plugins. They must be compiled before the code that uses them, requiring a separate crate. Also ensures the proc macro code itself doesn’t become part of the final binary.
-
expr vs tt:
$x:exprmatches a complete expression and can only be followed by specific tokens.$x:ttmatches a single token tree (identifier, literal, or balanced()/{}/[]group) and can be followed by anything. Usettfor maximum flexibility,exprwhen you need expression semantics. -
Hygiene mechanism: Each macro expansion gets a unique “syntax context.” Variables defined inside the macro live in that context and don’t clash with identically-named variables outside. The compiler tracks contexts and resolves names accordingly.
-
No type info: Macros run during parsing, before type checking. The compiler hasn’t yet determined what traits each type implements. That’s why
#[derive(Debug)]generates the code unconditionally—it can’t verifyDebugis implementable. -
Debugging: (1) Use
cargo expandto see generated code. (2) Addcompile_error!("...")at strategic points. (3) Write the expected output manually, then compare. (4) Usetrybuildfor testing expected compile errors.
Real-World Applications
- Serde:
#[derive(Serialize, Deserialize)]for automatic serialization - Clap:
#[derive(Parser)]for CLI argument parsing - Async-trait: Makes async functions work in traits
- Thiserror/Anyhow: Error handling derive macros
- Diesel: Database ORM with compile-time SQL checking
Where You’ll Apply It
| Project | How Macros Are Used |
|---|---|
| Project 2: Box-less Async Trait | Generate Future enum variants |
| Project 7: Zero-Copy Parser | Protocol definition DSL |
| Project 10: Procedural Macro | Full reflection derive implementation |
References
- “Programming Rust” by Blandy & Orendorff — Ch. 20: Macros
- The Little Book of Rust Macros — Comprehensive macro guide
- Proc Macro Workshop — Hands-on exercises
- syn crate docs — Parsing API reference
- quote crate docs — Code generation API reference
Key Insight
Macros trade writing code for writing code that writes code—they must be tested as carefully as the code they generate.
Summary
Rust macros are compile-time code generators: declarative macros (macro_rules!) for pattern-based transformations, and procedural macros for arbitrary manipulation via the syn/quote ecosystem. Macros are hygienic (no accidental name capture), operate on tokens (not types), and enable powerful patterns like derive macros, DSLs, and reflection. Debug with cargo expand, test with trybuild, and preserve spans for good error messages.
Homework/Exercises
-
Declarative macro: Write a
hashmap!macro that creates aHashMapfrom key-value pairs. -
Derive macro: Create
#[derive(Builder)]that generates a builder pattern for structs. -
Attribute macro: Write
#[time_it]that wraps a function to print execution time. -
Error handling: Modify your derive macro to emit helpful compile errors for unsupported inputs.
Solutions to Exercises
Exercise 1 Solution:
macro_rules! hashmap {
($($key:expr => $val:expr),* $(,)?) => {{
let mut map = std::collections::HashMap::new();
$(map.insert($key, $val);)*
map
}};
}
let m = hashmap! {
"a" => 1,
"b" => 2,
};
Exercise 2 Solution:
#[proc_macro_derive(Builder)]
pub fn derive_builder(input: TokenStream) -> TokenStream {
let ast = parse_macro_input!(input as DeriveInput);
let name = &ast.ident;
let builder_name = format_ident!("{}Builder", name);
let fields = /* extract fields */;
let builder_fields = fields.iter().map(|f| {
let name = &f.ident;
let ty = &f.ty;
quote! { #name: Option<#ty> }
});
let builder_methods = fields.iter().map(|f| {
let name = &f.ident;
let ty = &f.ty;
quote! {
fn #name(mut self, val: #ty) -> Self {
self.#name = Some(val);
self
}
}
});
quote! {
struct #builder_name { #(#builder_fields),* }
impl #builder_name { #(#builder_methods)* }
impl #name {
fn builder() -> #builder_name { ... }
}
}.into()
}
Exercise 3 Solution:
#[proc_macro_attribute]
pub fn time_it(_attr: TokenStream, item: TokenStream) -> TokenStream {
let input = parse_macro_input!(item as ItemFn);
let name = &input.sig.ident;
let block = &input.block;
let sig = &input.sig;
quote! {
#sig {
let start = std::time::Instant::now();
let result = (|| #block)();
println!("{} took {:?}", stringify!(#name), start.elapsed());
result
}
}.into()
}
Exercise 4 Solution:
// In derive macro:
let fields = match &ast.data {
syn::Data::Struct(data) => &data.fields,
syn::Data::Enum(_) => {
return syn::Error::new(
ast.ident.span(),
"Builder derive only supports structs, not enums"
).to_compile_error().into();
}
syn::Data::Union(_) => {
return syn::Error::new(
ast.ident.span(),
"Builder derive does not support unions"
).to_compile_error().into();
}
};
Concept 7: no_std, FFI, and Hardware Boundaries
Fundamentals
When you compile Rust with #![no_std], you are telling the compiler: “I have no operating system, no heap allocator by default, and no I/O primitives.” This fundamentally changes what code can do. You lose access to the std crate and gain only core—a library of pure functions with zero platform assumptions. Optionally, you can enable alloc for heap-based collections if you provide a global allocator.
no_std Rust is used for embedded systems, OS kernels, bootloaders, hypervisors, WebAssembly modules, and any context where the standard library’s OS dependencies are unavailable or undesirable. You must explicitly provide:
- A panic handler (
#[panic_handler]) because the default unwind mechanism requires OS support - An entry point (not
main()) because_startor the reset vector is your true entry - Memory initialization for
.dataand.bsssections - An allocator if you want heap allocations
- I/O mechanisms because
println!relies onstd::io
FFI (Foreign Function Interface) is the boundary between Rust and other languages, typically C. This requires understanding calling conventions (extern "C"), memory layout (repr(C)), and the fundamental truth that FFI boundaries are always unsafe—Rust cannot verify invariants across language boundaries.
Together, no_std and FFI represent the deepest level of systems programming in Rust: you own everything, nothing is hidden, and every assumption must be explicit.
Deep Dive
The no_std Hierarchy
Rust’s standard library is actually three crates:
┌─────────────────────────────────────────────────────────────────┐
│ std │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ alloc │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ core │ │ │
│ │ │ │ │ │
│ │ │ - Primitive types (bool, i32, f64, etc.) │ │ │
│ │ │ - Basic traits (Copy, Clone, Debug, Eq, Ord) │ │ │
│ │ │ - Option, Result, iterators │ │ │
│ │ │ - Slices and arrays │ │ │
│ │ │ - Atomic types and core::sync::atomic │ │ │
│ │ │ - ptr, mem, cell, marker modules │ │ │
│ │ │ - fmt (but no I/O) │ │ │
│ │ │ - panic infrastructure (no default handler) │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ + Vec, String, Box, Rc, Arc │ │
│ │ + BTreeMap, BTreeSet, BinaryHeap │ │
│ │ + Requires global allocator │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ + std::io, std::fs, std::net, std::thread │
│ + std::sync (Mutex, RwLock, Condvar) │
│ + std::env, std::process, std::os │
│ + Default panic handler with backtraces │
│ + Default global allocator (system malloc) │
└─────────────────────────────────────────────────────────────────┘
Startup Sequence on Bare Metal
When power is applied to a microcontroller, this sequence occurs:
┌──────────────────────────────────────────────────────────────────────────┐
│ BARE METAL BOOT SEQUENCE │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RESET VECTOR │
│ ├── Hardware fetches address from vector table entry 0 │
│ ├── CPU jumps to that address │
│ └── Typically points to _start in your binary │
│ │
│ 2. _start (Assembly/Linker-provided) │
│ ├── Zero the .bss section (uninitialized globals) │
│ ├── Copy .data from FLASH to RAM (initialized globals) │
│ ├── Set up the stack pointer (SP register) │
│ └── Call the Rust entry point │
│ │
│ 3. RUST ENTRY (your #[entry] function) │
│ ├── Initialize hardware peripherals │
│ ├── Set up interrupt handlers │
│ ├── Configure clocks and power │
│ └── Enter main loop or scheduler │
│ │
│ 4. MAIN LOOP / SCHEDULER │
│ ├── Poll peripherals or wait for interrupts │
│ ├── Handle events (timers, UART, GPIO) │
│ └── Never return (embedded systems run forever) │
│ │
│ INTERRUPT PATH (parallel to main loop): │
│ ├── Hardware triggers IRQ │
│ ├── CPU saves context to stack │
│ ├── Fetches handler address from vector table │
│ ├── Executes ISR (Interrupt Service Routine) │
│ └── Restores context and returns │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Memory-Mapped I/O (MMIO)
On embedded systems, hardware registers are mapped to memory addresses. Reading or writing these addresses controls hardware:
┌─────────────────────────────────────────────────────────────────────────┐
│ MEMORY MAP (ARM Cortex-M example) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 0xFFFF_FFFF ┌───────────────────────────────────────────────────────┐ │
│ │ System Control Block │ │
│ 0xE000_0000 ├───────────────────────────────────────────────────────┤ │
│ │ Private Peripheral Bus │ │
│ 0x4000_0000 ├───────────────────────────────────────────────────────┤ │
│ │ Peripheral registers (UART, GPIO, SPI, I2C) │ │
│ │ │ │
│ │ Example: GPIOA_ODR at 0x4800_0014 │ │
│ │ - Writing 0x0001 turns on LED at PA0 │ │
│ │ - Reading returns current pin states │ │
│ │ │ │
│ 0x2000_0000 ├───────────────────────────────────────────────────────┤ │
│ │ SRAM (Stack, Heap, .bss, .data) │ │
│ 0x0800_0000 ├───────────────────────────────────────────────────────┤ │
│ │ FLASH (Code, .rodata, vector table) │ │
│ 0x0000_0000 └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Critical MMIO Rules:
- Use
volatilereads/writes - The compiler must not optimize away or reorder hardware accesses - Correct width matters - Some registers require exactly 32-bit access; byte access may fail or trigger undefined behavior
- Read-modify-write hazards - Registers with write-only or read-clear bits need special handling
- Alignment - Most architectures require natural alignment for register access
Linker Scripts
The linker script tells the linker where to place code and data in memory:
MEMORY {
FLASH : ORIGIN = 0x08000000, LENGTH = 256K
RAM : ORIGIN = 0x20000000, LENGTH = 64K
}
SECTIONS {
.text : {
*(.vector_table) /* Vector table must be first */
*(.text .text.*) /* Code */
*(.rodata .rodata.*) /* Read-only data */
} > FLASH
.data : {
_sdata = .;
*(.data .data.*) /* Initialized globals */
_edata = .;
} > RAM AT > FLASH /* Load from FLASH, run from RAM */
.bss : {
_sbss = .;
*(.bss .bss.*) /* Zero-initialized globals */
_ebss = .;
} > RAM
_stack_top = ORIGIN(RAM) + LENGTH(RAM);
}
FFI: The Unsafe Boundary
FFI with C requires three key elements:
extern "C"ABI - Use the C calling convention (argument passing, return values, name mangling)repr(C)layout - Match C’s struct layout rules (field order, padding, alignment)unsafeblocks - Rust cannot verify C code’s safety invariants
┌─────────────────────────────────────────────────────────────────────────┐
│ FFI SAFETY BOUNDARY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ RUST SIDE (Safe by default) C SIDE (Unsafe assumptions) │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ fn process(data: &[u8]) │ │ void process( │ │
│ │ -> Result<Output, E> │ │ uint8_t* ptr, │ │
│ │ { │ │ size_t len │ │
│ │ // Bounds checked │ │ ); │ │
│ │ // Null-safe │ │ // Trusts ptr != NULL │ │
│ │ // Lifetime tracked │ │ // Trusts len is valid │ │
│ │ } │ │ // No ownership model │ │
│ └──────────┬──────────────┘ └──────────┬──────────────┘ │
│ │ │ │
│ │ ┌─────────────────────┐ │ │
│ └─────► FFI BOUNDARY ◄───────┘ │
│ │ │ │
│ │ extern "C" fn │ │
│ │ #[repr(C)] struct │ │
│ │ unsafe { ... } │ │
│ │ │ │
│ │ YOU MUST VERIFY: │ │
│ │ - Pointer validity │ │
│ │ - Length bounds │ │
│ │ - Aliasing rules │ │
│ │ - Thread safety │ │
│ │ - Lifetime safety │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Panic Handling in no_std
Without std, you must provide a panic handler:
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
// Option 1: Halt
loop {}
// Option 2: Reset
cortex_m::peripheral::SCB::sys_reset();
// Option 3: Log and halt
if let Some(location) = info.location() {
// Output to debug probe, UART, or LED pattern
rprintln!("Panic at {}:{}", location.file(), location.line());
}
loop {}
}
The -> ! (never type) indicates this function never returns—panics are terminal in embedded systems.
Cross-Compilation Toolchain
Building for embedded targets requires:
┌─────────────────────────────────────────────────────────────────────────┐
│ CROSS-COMPILATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Install target: │
│ rustup target add thumbv7em-none-eabihf │
│ │
│ 2. Configure .cargo/config.toml: │
│ [build] │
│ target = "thumbv7em-none-eabihf" │
│ │
│ [target.thumbv7em-none-eabihf] │
│ runner = "probe-run --chip STM32F411CEUx" │
│ rustflags = ["-C", "link-arg=-Tlink.x"] │
│ │
│ 3. Build: │
│ cargo build --release │
│ │
│ 4. Flash: │
│ cargo run --release # Uses probe-run to flash and run │
│ │
│ Common targets: │
│ ├── thumbv6m-none-eabi (Cortex-M0/M0+) │
│ ├── thumbv7m-none-eabi (Cortex-M3) │
│ ├── thumbv7em-none-eabi (Cortex-M4/M7, no FPU) │
│ ├── thumbv7em-none-eabihf (Cortex-M4F/M7F, with FPU) │
│ ├── riscv32i-unknown-none-elf (RISC-V 32-bit) │
│ └── x86_64-unknown-none (x86-64 bare metal) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
How This Fits on Projects
no_std and FFI concepts are central to:
- Project 4 (no_std Kernel Core): Building a minimal kernel requires understanding boot sequences, linker scripts, and panic handlers
- Project 11 (no_std Game Boy Core): Game Boy emulation without OS dependencies requires MMIO simulation and custom memory regions
- Project 12 (High-Performance KV Store): FFI may be needed for integrating with C libraries or system calls
- Final Overall Project: Combines kernel, emulation, and low-level memory management
Definitions & Key Terms
| Term | Definition |
|---|---|
core |
Rust’s platform-agnostic standard library subset; no heap, no I/O, no OS dependencies |
alloc |
Heap allocation primitives (Vec, Box, String) requiring a global allocator |
no_std |
Crate attribute opting out of the standard library; enables core only |
| MMIO | Memory-Mapped I/O; accessing hardware registers through memory addresses |
| BSP | Board Support Package; hardware abstraction layer for a specific board |
| PAC | Peripheral Access Crate; auto-generated register definitions from SVD files |
| HAL | Hardware Abstraction Layer; safe, higher-level APIs over PAC |
| Linker Script | Configuration file defining memory layout and section placement |
| Vector Table | Array of function pointers for reset and interrupt handlers |
repr(C) |
Attribute ensuring C-compatible struct layout |
extern "C" |
Calling convention matching the C ABI |
| volatile | Access semantics preventing compiler optimization of memory operations |
| ISR | Interrupt Service Routine; handler executed when hardware triggers an interrupt |
.bss |
Section for zero-initialized static variables |
.data |
Section for initialized static variables (copied from FLASH to RAM) |
.text |
Section containing executable code |
Mental Model Diagram
┌─────────────────────────────────────────────────────────────────────────────────┐
│ no_std RUST MENTAL MODEL │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ APPLICATION LAYER │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Your Rust Code (#![no_std]) │ │ │
│ │ │ ├── Uses: core, alloc (optional) │ │ │
│ │ │ ├── Provides: #[panic_handler], #[entry], allocator │ │ │
│ │ │ └── Cannot use: std::io, std::fs, std::thread, println! │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ HAL (Hardware Abstraction Layer) │ │ │
│ │ │ - Safe APIs: gpio.set_high(), uart.write(data) │ │ │
│ │ │ - Owns peripherals via singleton pattern │ │ │
│ │ │ - Examples: stm32f4xx-hal, nrf52-hal, esp-hal │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ PAC (Peripheral Access Crate) │ │ │
│ │ │ - Auto-generated from SVD files │ │ │
│ │ │ - Raw register access: GPIOA.odr.write(|w| w.odr0().set_bit())│ │ │
│ │ │ - Unsafe in many places; HAL wraps safely │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ └──────────────────────────────┼──────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────┼──────────────────────────────────────────┐ │
│ │ RUNTIME LAYER │ │
│ │ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ cortex-m-rt / riscv-rt (Runtime Crate) │ │ │
│ │ │ - Provides: vector table, _start, memory initialization │ │ │
│ │ │ - Calls your #[entry] function │ │ │
│ │ │ - Handles stack setup, .bss zeroing, .data copying │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Linker Script (memory.x) │ │ │
│ │ │ - Defines FLASH and RAM regions │ │ │
│ │ │ - Places .text, .data, .bss, .stack │ │ │
│ │ │ - Exports symbols: _sdata, _edata, _sbss, _ebss, _stack_top │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ └──────────────────────────────┼──────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────┼──────────────────────────────────────────┐ │
│ │ HARDWARE LAYER │ │
│ │ │ │ │
│ │ ┌──────────────────────────▼──────────────────────────────────────┐ │ │
│ │ │ CPU + Peripherals │ │ │
│ │ │ │ │ │
│ │ │ Registers Memory Map Interrupts │ │ │
│ │ │ ┌──────┐ ┌─────────────┐ ┌──────────┐ │ │ │
│ │ │ │ R0 │ │ 0x0800_0000 │──────│ NVIC │ │ │ │
│ │ │ │ R1 │ │ FLASH │ │ Vector │ │ │ │
│ │ │ │ ... │ │ 0x2000_0000 │ │ Table │ │ │ │
│ │ │ │ SP │ │ SRAM │ │ IRQ 0-N │ │ │ │
│ │ │ │ PC │ │ 0x4000_0000 │ └──────────┘ │ │ │
│ │ │ └──────┘ │ Periph │ │ │ │
│ │ │ └─────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ FFI BOUNDARY (when interacting with C code): │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ extern "C" { fn c_function(ptr: *const u8, len: usize); } │ │
│ │ #[repr(C)] struct CCompatibleStruct { field1: u32, field2: u32 } │ │
│ │ unsafe { c_function(data.as_ptr(), data.len()); } │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
How It Works (Step-by-Step)
Setting up a no_std project:
Step 1: Create the project
$ cargo new --lib my_embedded
$ cd my_embedded
Step 2: Add #![no_std] to lib.rs
#![no_std]
// Required: panic handler
use core::panic::PanicInfo;
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
loop {}
}
Step 3: Configure Cargo.toml
[dependencies]
cortex-m = "0.7"
cortex-m-rt = "0.7"
stm32f4xx-hal = { version = "0.17", features = ["stm32f411"] }
[profile.release]
opt-level = "s" # Optimize for size
lto = true # Link-time optimization
codegen-units = 1 # Better optimization
Step 4: Add .cargo/config.toml
[build]
target = "thumbv7em-none-eabihf"
[target.thumbv7em-none-eabihf]
runner = "probe-run --chip STM32F411CEUx"
rustflags = ["-C", "link-arg=-Tlink.x"]
Step 5: Create memory.x (linker script)
MEMORY {
FLASH : ORIGIN = 0x08000000, LENGTH = 512K
RAM : ORIGIN = 0x20000000, LENGTH = 128K
}
Step 6: Write your #[entry] function
#[entry]
fn main() -> ! {
// Initialize peripherals
let dp = stm32f4xx_hal::pac::Peripherals::take().unwrap();
// ... setup GPIO, timers, etc.
loop {
// Main loop
}
}
Step 7: Build and flash
$ cargo run --release
Implementing FFI safely:
Step 1: Define the C interface
// lib.h (C header)
typedef struct {
uint32_t x;
uint32_t y;
} Point;
int process_points(const Point* points, size_t count);
Step 2: Mirror in Rust with repr(C)
#[repr(C)]
pub struct Point {
x: u32,
y: u32,
}
Step 3: Declare the extern function
extern "C" {
fn process_points(points: *const Point, count: usize) -> i32;
}
Step 4: Create a safe wrapper
pub fn process_points_safe(points: &[Point]) -> Result<i32, Error> {
if points.is_empty() {
return Err(Error::EmptyInput);
}
let result = unsafe {
process_points(points.as_ptr(), points.len())
};
if result < 0 {
Err(Error::CError(result))
} else {
Ok(result)
}
}
Step 5: Link the C library
// build.rs
fn main() {
println!("cargo:rustc-link-lib=mylib");
println!("cargo:rustc-link-search=/path/to/lib");
}
MMIO register access:
Step 1: Define the register address
const GPIOA_ODR: *mut u32 = 0x4002_0014 as *mut u32;
Step 2: Use volatile access
// WRONG - optimizer may eliminate
unsafe { *GPIOA_ODR = 0x0001; }
// CORRECT - volatile prevents optimization
unsafe { core::ptr::write_volatile(GPIOA_ODR, 0x0001); }
Step 3: Prefer type-safe wrappers
use volatile_register::RW;
#[repr(C)]
struct GpioRegisters {
moder: RW<u32>,
otyper: RW<u32>,
ospeedr: RW<u32>,
pupdr: RW<u32>,
idr: RW<u32>,
odr: RW<u32>,
// ...
}
let gpio = unsafe { &*(0x4002_0000 as *const GpioRegisters) };
unsafe { gpio.odr.write(0x0001); }
Step 4: Use PAC/HAL for real projects
// PAC level (auto-generated, still unsafe)
let dp = pac::Peripherals::take().unwrap();
dp.GPIOA.odr.write(|w| unsafe { w.bits(0x0001) });
// HAL level (safe, ergonomic)
let gpioa = dp.GPIOA.split();
let mut led = gpioa.pa0.into_push_pull_output();
led.set_high();
Minimal Concrete Example
Blinking LED on STM32 (no_std):
#![no_std]
#![no_main]
use cortex_m_rt::entry;
use panic_halt as _;
use stm32f4xx_hal::{
pac,
prelude::*,
};
#[entry]
fn main() -> ! {
// Take ownership of peripherals
let dp = pac::Peripherals::take().unwrap();
let cp = cortex_m::Peripherals::take().unwrap();
// Configure clocks (8 MHz HSI -> 84 MHz system clock)
let rcc = dp.RCC.constrain();
let clocks = rcc.cfgr.sysclk(84.MHz()).freeze();
// Configure GPIO
let gpioa = dp.GPIOA.split();
let mut led = gpioa.pa5.into_push_pull_output();
// Configure delay provider
let mut delay = cp.SYST.delay(&clocks);
// Blink forever
loop {
led.set_high();
delay.delay_ms(500u32);
led.set_low();
delay.delay_ms(500u32);
}
}
FFI with C library:
// Rust side
#[repr(C)]
pub struct Buffer {
data: *mut u8,
len: usize,
capacity: usize,
}
extern "C" {
fn compress(input: *const u8, input_len: usize, output: *mut Buffer) -> i32;
fn free_buffer(buf: *mut Buffer);
}
pub fn compress_data(input: &[u8]) -> Result<Vec<u8>, i32> {
let mut output = Buffer {
data: core::ptr::null_mut(),
len: 0,
capacity: 0,
};
let result = unsafe {
compress(input.as_ptr(), input.len(), &mut output)
};
if result != 0 {
return Err(result);
}
// Convert to Vec and take ownership
let vec = unsafe {
Vec::from_raw_parts(output.data, output.len, output.capacity)
};
// Note: C's free_buffer won't be called since Rust now owns the memory
Ok(vec)
}
Common Misconceptions
| Misconception | Reality |
|---|---|
“no_std means unsafe Rust” |
no_std only removes OS dependencies; you can still write safe Rust with core |
“You can’t use async without std” |
core::future::Future exists; you need a custom executor, but async works |
“repr(C) guarantees no padding” |
It guarantees C-compatible layout WITH padding; use #[repr(packed)] for no padding (but beware alignment) |
| “All FFI is inherently slow” | FFI overhead is minimal; the extern "C" call itself is just a few instructions |
| “You need assembly for embedded” | Rust covers most cases; inline assembly (asm!) handles the rest |
“volatile in Rust is like C” |
Rust uses read_volatile/write_volatile functions, not a type qualifier |
| “Embedded Rust is less safe than C” | Embedded Rust maintains safety guarantees; only MMIO and FFI require unsafe |
“You can’t debug no_std code” |
RTT, semihosting, and debug probes provide excellent debugging |
Check-Your-Understanding Questions
- What happens if you forget to provide a
#[panic_handler]in ano_stdcrate? - Why is
volatileaccess required for MMIO registers? - What’s the difference between
.dataand.bsssections? - When would you use
repr(C)vsrepr(transparent)? - Why can’t you call
Vec::new()inno_stdwithout enablingalloc? - What does the linker script symbol
_sdatarepresent? - Why does the entry point return
!(never type)? - How does Rust ensure thread safety when a global allocator is used in
no_std?
Check-Your-Understanding Answers
-
Missing
#[panic_handler]: The linker fails with “undefined symbol: rust_begin_panic” or similar. The panic handler is required becausecoredefines the panic infrastructure but not the handler. - Why volatile for MMIO: The compiler doesn’t know that reading/writing a memory address has side effects. Without volatile, it may:
- Eliminate “redundant” reads (but the register value may change from hardware)
- Reorder writes (but peripheral timing may depend on order)
- Eliminate “dead” writes (but the write triggers hardware actions)
.datavs.bss:.data: Initialized static variables (e.g.,static X: u32 = 42). Stored in FLASH, copied to RAM at startup..bss: Zero-initialized static variables (e.g.,static mut Y: u32 = 0). Not stored in FLASH, just zeroed in RAM at startup.
repr(C)vsrepr(transparent):repr(C): Use for FFI; guarantees C-compatible layout with potential paddingrepr(transparent): Use for single-field wrappers; guarantees identical layout to the wrapped type
-
Vec::new()inno_std:Vecis in thealloccrate, which requires heap allocation. Without a global allocator, there’s no heap. Enableallocand provide#[global_allocator]. -
_sdatasymbol: Linker script exports this as the start address of the.datasection in RAM. Startup code uses_sdataand_edatato know where to copy initialized data from FLASH. -
Entry returns
!: Embedded systems run forever. Returning from the entry point is undefined behavior (where would you return to?). The!type enforces this at compile time. - Thread safety with global allocator:
GlobalAlloctrait requires&self, not&mut self, so implementations must be internally synchronized. Most embedded allocators are single-threaded and use critical sections to prevent interrupt interference.
Real-World Applications
| Domain | Example | Why no_std/FFI |
|---|---|---|
| Microcontrollers | STM32 firmware | No OS, direct hardware access |
| Operating Systems | Redox OS, Tock OS | Kernel cannot depend on itself |
| Bootloaders | oreboot (Rust coreboot) | Runs before any OS exists |
| Hypervisors | Firecracker, cloud-hypervisor | Must manage VMs at bare-metal level |
| Real-time Systems | Drone flight controllers | Deterministic timing, no allocator jitter |
| Cryptography | ring, rustls | FFI to constant-time assembly routines |
| Game Engines | Bevy (plugins) | FFI to platform graphics APIs |
| Databases | TiKV (RocksDB) | FFI to C++ storage engine |
| WebAssembly | WASM modules | No OS in browser sandbox |
| Hardware Security Modules | TPM firmware | Isolated, minimal trusted computing base |
Where You’ll Apply It
| Project | How no_std/FFI Is Used |
|---|---|
| Project 4 | Full no_std kernel with panic handler, linker script, boot sequence |
| Project 11 | no_std Game Boy emulator simulating hardware without OS |
| Project 12 | FFI with system libraries for efficient I/O and memory management |
| Final Overall Project | Combines all techniques in a no_std kernel with emulation |
References
Books:
- “The Rust Programming Language” Ch. 19 (Unsafe Rust) - Klabnik & Nichols
- “Rust for Rustaceans” Ch. 10 (Concurrency and FFI) - Jon Gjengset
- “The Secret Life of Programs” Ch. 5 (Hardware foundations) - Jonathan E. Steinhart
- “Programming Rust, 3rd Edition” Ch. 22 (Unsafe Code) - Blandy, Orendorff, Tindall
- “Making Embedded Systems, 2nd Edition” - Elecia White (general embedded concepts)
Documentation:
- The Embedded Rust Book - Official embedded guide
- cortex-m-rt documentation - ARM Cortex-M runtime
- The Rustonomicon - Unsafe Rust deep dive
- The FFI Omnibus - FFI patterns
Specifications:
- ARM Architecture Reference Manual (for Cortex-M specifics)
- SVD files (System View Description) for peripheral register definitions
- ELF specification for understanding linker script output
Key Insights
“With
no_std, you own the platform contract—nothing is implicit. Every assumption about memory, panics, and I/O must be explicitly provided.”
This is the essence of bare-metal Rust: maximum control, maximum responsibility. The compiler still helps you with memory safety, but you must tell it everything about your platform.
Summary
no_std and FFI represent the deepest layer of Rust systems programming:
no_stdremoves OS dependencies, giving you onlycore(and optionallyalloc)- You must provide: panic handler, entry point, memory initialization, and optionally an allocator
- Linker scripts define where code and data live in memory
- MMIO requires volatile access to prevent compiler optimization
- FFI requires
extern "C",repr(C), and careful safety reasoning - The embedded ecosystem (PAC, HAL, BSP) provides safe abstractions over unsafe hardware access
- Testing requires both host (mocking) and target (probe) strategies
- Despite the complexity, Rust’s safety guarantees still apply—only the hardware boundary is
unsafe
Homework/Exercises
Exercise 1: Panic Handler with LED Pattern
Write a #[panic_handler] that blinks an LED in a pattern (e.g., SOS in Morse code) before halting. You’ll need to use MMIO to toggle GPIO.
Exercise 2: Safe MMIO Wrapper
Create a safe wrapper type for a 32-bit read-write register that:
- Takes the register address at construction time
- Provides
read()andwrite()methods using volatile access - Prevents misuse through the type system
Exercise 3: FFI Struct Alignment
Given this C header, write the matching Rust definitions:
// C code
typedef struct {
uint8_t flags;
uint32_t value;
uint16_t count;
} DataPacket; // What size is this?
int process_packet(const DataPacket* packet);
Calculate the struct size considering alignment rules and verify with core::mem::size_of.
Exercise 4: Custom Global Allocator
Implement a simple bump allocator for no_std:
- Fixed-size static buffer (e.g., 4KB)
- Never frees (bump pointer only advances)
- Implement
GlobalAlloctrait - Handle alignment correctly
Homework Solutions
Exercise 1 Solution:
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
// Assume LED is on PA5, already configured as output
const GPIOA_ODR: *mut u32 = 0x4002_0014 as *mut u32;
const LED_BIT: u32 = 1 << 5;
// Simple delay (not accurate, depends on clock speed)
fn delay(count: u32) {
for _ in 0..count {
core::hint::spin_loop();
}
}
// SOS pattern: ... --- ...
let dot = 100_000;
let dash = 300_000;
let gap = 100_000;
let letter_gap = 300_000;
loop {
// S: ...
for _ in 0..3 {
unsafe { core::ptr::write_volatile(GPIOA_ODR, LED_BIT); }
delay(dot);
unsafe { core::ptr::write_volatile(GPIOA_ODR, 0); }
delay(gap);
}
delay(letter_gap);
// O: ---
for _ in 0..3 {
unsafe { core::ptr::write_volatile(GPIOA_ODR, LED_BIT); }
delay(dash);
unsafe { core::ptr::write_volatile(GPIOA_ODR, 0); }
delay(gap);
}
delay(letter_gap);
// S: ...
for _ in 0..3 {
unsafe { core::ptr::write_volatile(GPIOA_ODR, LED_BIT); }
delay(dot);
unsafe { core::ptr::write_volatile(GPIOA_ODR, 0); }
delay(gap);
}
delay(letter_gap * 2);
}
}
Exercise 2 Solution:
use core::ptr;
pub struct Register<T> {
addr: *mut T,
}
impl<T: Copy> Register<T> {
/// Safety: addr must be a valid MMIO register address
pub const unsafe fn new(addr: usize) -> Self {
Self { addr: addr as *mut T }
}
pub fn read(&self) -> T {
unsafe { ptr::read_volatile(self.addr) }
}
pub fn write(&self, value: T) {
unsafe { ptr::write_volatile(self.addr, value) }
}
pub fn modify<F>(&self, f: F)
where
F: FnOnce(T) -> T,
{
let val = self.read();
self.write(f(val));
}
}
// Usage:
// const GPIO_ODR: Register<u32> = unsafe { Register::new(0x4002_0014) };
// GPIO_ODR.write(0x0001);
// GPIO_ODR.modify(|v| v | 0x0002);
Exercise 3 Solution:
#[repr(C)]
pub struct DataPacket {
flags: u8,
// 3 bytes padding here (value must be 4-byte aligned)
value: u32,
count: u16,
// 2 bytes padding here (struct must be aligned to largest member)
}
// Size: 1 + 3 (padding) + 4 + 2 + 2 (padding) = 12 bytes
// Alignment: 4 bytes (due to u32)
extern "C" {
fn process_packet(packet: *const DataPacket) -> i32;
}
// Verification:
// assert_eq!(core::mem::size_of::<DataPacket>(), 12);
// assert_eq!(core::mem::align_of::<DataPacket>(), 4);
// If you need packed (no padding):
#[repr(C, packed)]
pub struct PackedDataPacket {
flags: u8, // offset 0
value: u32, // offset 1 (UNALIGNED - may be slow or illegal on some architectures)
count: u16, // offset 5
}
// Size: 7 bytes, Alignment: 1 byte
// Warning: Taking references to unaligned fields is undefined behavior!
Exercise 4 Solution:
use core::alloc::{GlobalAlloc, Layout};
use core::cell::UnsafeCell;
use core::ptr;
const HEAP_SIZE: usize = 4096;
struct BumpAllocator {
heap: UnsafeCell<[u8; HEAP_SIZE]>,
next: UnsafeCell<usize>,
}
unsafe impl Sync for BumpAllocator {}
impl BumpAllocator {
const fn new() -> Self {
Self {
heap: UnsafeCell::new([0; HEAP_SIZE]),
next: UnsafeCell::new(0),
}
}
}
unsafe impl GlobalAlloc for BumpAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let next = &mut *self.next.get();
let heap = &mut *self.heap.get();
// Align up to required alignment
let align = layout.align();
let aligned = (*next + align - 1) & !(align - 1);
let end = aligned + layout.size();
if end > HEAP_SIZE {
ptr::null_mut() // Out of memory
} else {
*next = end;
heap.as_mut_ptr().add(aligned)
}
}
unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {
// Bump allocator never frees
// In production, you'd track allocations or use a different strategy
}
}
#[global_allocator]
static ALLOCATOR: BumpAllocator = BumpAllocator::new();
// Note: This simple implementation is NOT interrupt-safe.
// For real use, wrap critical sections:
// cortex_m::interrupt::free(|_| { ... })
Glossary
- Pin: Wrapper guaranteeing no further moves of a value.
- Unpin: Marker trait allowing safe moves even when pinned.
- Waker: Handle used by executors to reschedule futures.
- Allocator: Strategy that governs heap layout and lifetimes.
- Const Generic: Compile-time constant parameter in a type or function.
- ABA Problem: CAS hazard where pointer value is reused.
- Proc Macro: Compile-time function transforming token streams.
- no_std: Rust mode without the standard library.
Why Advanced Rust Matters
- Memory safety removes the majority of C/C++-style bugs (Microsoft estimates ~70% of vulns stem from unsafe memory).
- 2024 Rust survey: ~45% of organizations report production Rust use; 38% of developers use Rust for most work; 93.4% are active users.
- Developers cite precise control (47%) and productivity gains (53% self-reported productivity).
- Industry proof: AWS Firecracker, Linux kernel modules, Cloudflare workers, Shopify YJIT, and HFT engines rely on Rust for safety + speed.
- Sources: Rust Survey 2024 results, Enterprise adoption ~45%, Rust popularity stats.
Old vs new (safety model)
C/C++: Raw pointers ──► UB risk ──► runtime crashes
Rust: Borrow checker ──► compile-time fences ──► predictable runtime
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Pinning & Self-Refs | Location invariants, projection rules, Unpin. |
| Async Internals | Futures as enums, wakers, executors. |
| Memory Control | Layout, alignment, custom allocators, fragmentation. |
| Type System Power | Const generics, GATs, variance, typestates. |
| Concurrency & Atomics | Ordering, CAS loops, lock-free patterns. |
| Macros & Metaprogramming | Token streams, hygiene, diagnostics. |
no_std & FFI |
core/alloc, linker scripts, MMIO safety. |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| 1. Manual Pin Projector | Pinning & Self-Refs |
| 2. Box-less Async Trait | Pinning, Async Internals, Macros |
| 3. Custom Arena Allocator | Memory Control |
| 4. no_std Kernel Core | no_std, Memory Control |
| 5. Const Generic Matrix | Type System Power |
| 6. Atomic Lock-Free Queue | Concurrency & Atomics |
| 7. Zero-Copy Protocol Parser | Lifetimes, Type System, Macros |
| 8. Custom Future Runtime | Async Internals, Concurrency |
| 9. Physical Units Lib | Type System Power |
| 10. Procedural Macro Reflection | Macros & Metaprogramming |
| 11. no_std Game Boy Core | no_std, Concurrency, Memory Control |
| 12. High-Performance KV Store | Memory Control, Concurrency |
Deep Dive Reading by Concept
| Concept | Book and Chapter | Why This Matters |
|---|---|---|
| Pinning & Self-Refs | “Rust for Rustaceans” Ch. 8 | Explains futures and pinning contracts |
| Async Internals | “Rust for Rustaceans” Ch. 8 | Executor and waker design |
| Memory Control | “The Linux Programming Interface” Ch. 7 | Allocation behavior and system calls |
| Type System | “Programming Rust” Ch. 11 | Traits, generics, const parameters |
| Concurrency | “Rust Atomics and Locks” Ch. 3, 9 | Ordering and lock-free design |
| Macros | “Programming Rust” Ch. 20 | Proc macro patterns |
no_std |
“The Secret Life of Programs” Ch. 5 | Hardware foundations for bare-metal |
Quick Start: Your First 48 Hours
Day 1:
- Read Theory Primer: Concept 1 (Pinning) and Concept 2 (Async).
- Build Project 1 scaffold; run address-stability check.
Day 2:
- Finish Project 1 Definition of Done.
- Skim Project 2 core question and pitfalls; run
cargo expandon async code.
Recommended Learning Paths
Path 1: Systems Newcomer
Projects 1 -> 3 -> 5 -> 7 -> 9.
Path 2: Runtime Engineer
Projects 2 -> 6 -> 8 -> 12.
Path 3: Embedded / Kernel
Projects 3 -> 4 -> 11 -> Final Overall Project.
Success Metrics
- Able to explain pinning invariants and show a manual projection without UB.
- Can draw the async state machine for any
async fnwith Nawaits. - Can implement or swap a global allocator and measure impact on cache misses.
- Can justify memory ordering choices for a lock-free queue.
- Can ship a
no_stdbinary that boots on hardware or emulator.
Optional Domain Appendices
- Debugging checklist:
RUSTFLAGS="-Zsanitizer=address"(nightly),cargo miri test,cargo-asm,perf record -g,cargo flamegraph. - Common failure signatures: stuck futures (never woken), double-free in custom allocator, ABA in queues, misaligned MMIO registers.
Project Overview Table
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Manual Pin Projector | Rust | Advanced | 3-5 days |
| 2 | Box-less Async Trait | Rust | Expert | 1 week |
| 3 | Custom Arena Allocator | Rust | Advanced | 1 week |
| 4 | no_std Kernel Core | Rust | Expert | 2 weeks |
| 5 | Const Generic Matrix | Rust | Intermediate | 1-2 weeks |
| 6 | Atomic Lock-Free Queue | Rust | Master | 2-3 weeks |
| 7 | Zero-Copy Parser | Rust | Advanced | 1 week |
| 8 | Custom Future Runtime | Rust | Master | 2-3 weeks |
| 9 | Physical Units Lib | Rust | Advanced | 1 week |
| 10 | Reflect Derive Macro | Rust | Expert | 1 week |
| 11 | no_std Game Boy Core | Rust | Master | 1 month+ |
| 12 | High-Performance KV Store | Rust | Master | 1-2 months |
Project List
Project 1: The Manual Pin Projector (Understanding the Pin Contract)
- File: P01-manual-pin-projector.md
- Main Programming Language: Rust
- Alternative Programming Languages: C++20 (for contrast), Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: Level 2: Foundation, not a product
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory management / safety contracts
- Software or Tool:
pin_utils,cargo-expand - Main Book: “Rust for Rustaceans” Ch. 8
What you will build: A self-referential future with manual pin projection (no macros) and an address-stability verifier.
Why it teaches pinning: Forces you to uphold the pin contract without macro scaffolding.
Core challenges you will face:
- Structural vs non-structural pinning -> Pinning & Self-Refs
- Unsafe projection correctness -> Pinning & Self-Refs
- Address stability tests -> Memory Control
Real World Outcome
CLI that prints before/after addresses of a pinned self-referential struct, polls it to completion, and proves invariants via tests.
$ cargo run --example self_ref
[addr] before pin: 0x7ffd_1234
[addr] after pin: 0x6000_abcd
poll #1 -> Pending (waker registered)
poll #2 -> Ready("done")
address stable: true
The Core Question You’re Answering
“Why can’t I move this value after creating internal self-references?”
Concepts You Must Understand First
- Unpin vs !Unpin – Book: “Rust for Rustaceans” Ch. 8
- Pointer aliasing rules – Book: “The Rust Programming Language” Ch. 19
- Drop + pin interaction – Book: “Programming Rust” Ch. 21
Questions to Guide Your Design
- Safety: How will you ensure projection never moves the pinned field?
- Testing: How will you observe address stability across heap/stack moves?
Thinking Exercise
Draw the state diagram for your future: Start -> Waiting -> Ready. Mark where the self-reference is created and when it could be invalidated.
The Interview Questions They’ll Ask
- How does pinning prevent UB in async futures?
- When is it safe to implement
Unpinmanually? - Why does
Pin<&T>allow shared access but not movement?
Hints in Layers
- Hint 1: Use
Box::pinto freeze location. - Hint 2: Use
Pin::map_unchecked_mutonly for pinned fields. - Hint 3: Add an address logger before/after heap move.
- Hint 4:
cargo-expandto inspect the desugared projection.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Pinning | “Rust for Rustaceans” | Ch. 8 |
| Unsafe | “Programming Rust” | Ch. 19 |
Common Pitfalls and Debugging
Problem: Dangling pointer after move.
- Fix: Ensure self-reference created after pinning; never move pinned value.
- Quick test: Log addresses pre/post move; assert equality.
Definition of Done
- Address stability proven by tests.
- Manual projection compiles without macros.
- UB-free under
miri test. - Documentation of invariants is checked in CI.
Project 2: The Box-less Async Trait (Zero-Cost Async)
- File: P02-boxless-async-trait.md
- Language: Rust
- Coolness Level: Level 4
- Business Potential: Level 3
- Difficulty: Level 4
- Knowledge Area: Async runtimes, macros
- Main Book: “Rust for Rustaceans” Ch. 8
What you will build: An async trait without boxing, using GATs and async-trait-like ergonomics but zero heap allocation for leaf futures.
Why it teaches async internals: Forces you to manage lifetimes, projection, and state machine size manually.
Core challenges:
- GAT bounds for lifetimes -> Type System Power
- Pinning futures inside trait objects -> Pinning & Async
- Macro hygiene for ergonomic API -> Macros
Real World Outcome
Library crate exposing #[async_trait_no_box] derive; example service runs with zero heap allocations (validated by cargo -Zbuild-std --profile=release + perf).
The Core Question You’re Answering
“How can I avoid boxed futures while keeping trait ergonomics?”
Concepts You Must Understand First
- GATs – Book: “Programming Rust” Ch. 11
- Future poll contract – Book: “Rust for Rustaceans” Ch. 8
- Macro hygiene – Book: “Programming Rust” Ch. 20
Questions to Guide Your Design
- How will you pin generated futures without heap boxing?
- How will you ensure lifetimes of borrows outlive the future?
Thinking Exercise
Sketch the generated enum for a two-method async trait. Label captured borrows and which states store them.
The Interview Questions They’ll Ask
- Why does
async fnin traits require special handling? - How does
Pin<&mut T>relate to trait objects? - How do you minimize state size in async futures?
Hints in Layers
- Hint 1: Start with a macro that emits a manual
Futureenum. - Hint 2: Use
#[must_use]to catch forgotten awaits. - Hint 3: Measure allocation count with
jemallocprofiling.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Async | “Rust for Rustaceans” | Ch. 8 |
| Macros | “Programming Rust” | Ch. 20 |
Common Pitfalls and Debugging
Problem: Lifetime mismatch in generated futures.
- Fix: Add explicit HRTB (
for<'a>) and GATs on trait methods. - Quick test: Run
cargo expandand check state fields lifetimes.
Definition of Done
- No heap allocations in microbenchmarks.
- Generated futures are
Sendwhen inputs areSend. - Works with at least two executors (Tokio, async-std).
- Derive macro documented with examples.
Project 3: Custom Arena Allocator (Memory Locality)
- File: P03-custom-arena-allocator.md
- Language: Rust
- Coolness Level: Level 3
- Business Potential: Level 3
- Difficulty: Level 3
- Knowledge Area: Allocators
- Software:
cargo-criterion - Main Book: “The Linux Programming Interface” Ch. 7
What you will build: A bump/arena allocator with #[global_allocator] option and per-request reset semantics.
Why it teaches memory control: Demonstrates lifetime-based allocation and locality benefits.
Core challenges:
- Layout math -> Memory Control
- Safe vs unsafe deallocation -> Memory Control
- Measuring cache hits -> Concurrency & Performance
Real World Outcome
Benchmarks comparing system allocator vs arena for parsing workloads; shows >2x throughput for short-lived allocations.
The Core Question You’re Answering
“When is a general-purpose allocator too expensive?”
Concepts You Must Understand First
- Layout/Alignment – Book: “The Linux Programming Interface” Ch. 7
- Ownership of raw pointers – Book: “Programming Rust” Ch. 19
- Cache behavior – Book: “Computer Systems: A Programmer’s Perspective” Ch. 6
Questions to Guide Your Design
- How will you prevent double-free in a reset-only arena?
- How will you expose safe APIs with lifetimes tied to the arena?
Thinking Exercise
Draw memory after sequential allocations and after a reset. Mark fragmentation (or absence).
The Interview Questions They’ll Ask
- Difference between bump, slab, buddy allocators?
- Why is
Droporder tricky with arenas? - How do you benchmark allocator impact?
Hints in Layers
- Hint 1: Use
std::alloc::Systemas baseline. - Hint 2: Track high-water mark; reset by moving bump pointer.
- Hint 3: Guard with
PhantomData<&'a mut T>to enforce lifetimes.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Allocation | “The Linux Programming Interface” | Ch. 7 |
| Pointers | “Programming Rust” | Ch. 19 |
Common Pitfalls and Debugging
Problem: Use-after-reset.
- Fix: Tie allocations to lifetime
'arena; invalidate after reset. - Quick test: Miri; add debug asserts on bump pointer bounds.
Definition of Done
- Benchmarks show improvement vs system allocator.
miri testpasses (no UB).- Optional global allocator flag documented.
- Reset API proven safe via lifetimes.
Project 4: The no_std Kernel Core (The Naked Machine)
- File: P04-nostd-kernel-core.md
- Language: Rust (
no_std) - Coolness Level: Level 5
- Business Potential: Level 2
- Difficulty: Level 4
- Knowledge Area: Bare metal, boot
- Software:
qemu,bootimage,probe-rs - Main Book: “The Secret Life of Programs” Ch. 5
What you will build: A minimal kernel core that boots, sets up page tables, prints via serial, and schedules a tiny async task.
Why it teaches no_std: You own startup, memory, and I/O contracts.
Core challenges:
- Linker script + memory map ->
no_std - Panic/allocator strategy -> Memory Control
- Interrupts + async executor -> Async Internals
Real World Outcome
Bootable image under QEMU; serial output shows page table init and async task completion.
The Core Question You’re Answering
“What does it take to run Rust when no OS exists?”
Concepts You Must Understand First
- Linker scripts – Book: “Computer Systems: A Programmer’s Perspective” Ch. 7
- MMIO & volatile – Book: “Programming Rust” Ch. 21
- Panic handlers – Rust
core::panic::PanicInfodocs
Questions to Guide Your Design
- How do you set up stack and .bss/.data without
std? - How do you handle interrupts and wakers in
no_std?
Thinking Exercise
Draw the boot flow: reset vector -> init -> jump to Rust main -> run executor -> idle loop.
The Interview Questions They’ll Ask
- What is
no_stdand when do you need it? - Why use
#[repr(C)]for device registers? - How do you debug without
println!?
Hints in Layers
- Hint 1: Start from
cargo new --bin --target x86_64-blog_os.json. - Hint 2: Use
panic = "abort"for smaller binary. - Hint 3: Implement a basic waker tied to a timer interrupt.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Boot | “The Secret Life of Programs” | Ch. 5 |
| Linking | “Computer Systems: A Programmer’s Perspective” | Ch. 7 |
Common Pitfalls and Debugging
Problem: Triple fault on boot.
- Fix: Validate GDT/IDT addresses; use
qemu -d intto inspect faults. - Quick test: Add serial logging early; verify stack pointer.
Definition of Done
- Boots in QEMU with serial log.
- Async task runs and completes.
- No unexpected page faults.
- Binary size + map file documented.
Project 5: Const Generic Matrix (Type-Level Math)
- File: P05-const-generic-matrix.md
- Language: Rust
- Coolness Level: Level 3
- Business Potential: Level 3
- Difficulty: Level 3
- Knowledge Area: Type system, numerics
- Main Book: “Programming Rust” Ch. 11
What you will build: A fixed-size matrix library using const generics to enforce dimensions at compile time, with SIMD-friendly layout.
Why it teaches type power: Demonstrates compile-time shape guarantees and specialization for performance.
Core challenges:
- Const generic bounds -> Type System Power
repr(C)layout for FFI -> Memory Control- SIMD alignment -> Performance
Real World Outcome
API that refuses to multiply incompatible shapes; bench shows speedup from stack allocation and SIMD-friendly layout.
The Core Question You’re Answering
“How can the type system eliminate shape errors?”
Concepts You Must Understand First
- Const generics basics – Book: “Programming Rust” Ch. 11
- Variance & lifetimes – Rustonomicon
- SIMD alignment – CPU docs,
std::arch
Questions to Guide Your Design
- How do you encode compatible dimensions in traits?
- How do you store data to maximize cache hits?
Thinking Exercise
Write the signature for fn mul<const R: usize, const C: usize, const K: usize>(a: Matrix<R, K>, b: Matrix<K, C>) -> Matrix<R, C>.
The Interview Questions They’ll Ask
- How do const generics differ from macros?
- How do you avoid heap allocations for small matrices?
- How do you align for SIMD loads?
Hints in Layers
- Hint 1: Use
[[T; C]; R]storage first; optimize later. - Hint 2: Implement
Index/IndexMutfor ergonomics. - Hint 3: Add
assert!(K % 4 == 0)for SIMD path.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Const Generics | “Programming Rust” | Ch. 11 |
| SIMD | Intel/ARM manuals | N/A |
Common Pitfalls and Debugging
Problem: Large stack frames.
- Fix: Add
#[inline(always)]to hot paths; consider heap orBoxfor huge matrices. - Quick test:
cargo asmto inspect loops.
Definition of Done
- Shape errors fail at compile time.
- Benchmarks show SIMD speedup where available.
repr(C)optional for FFI tested.- Docs include dimension rules.
Project 6: Atomic Lock-Free Queue (The Concurrency Beast)
- File: P06-atomic-lock-free-queue.md
- Language: Rust
- Coolness Level: Level 5
- Business Potential: Level 3
- Difficulty: Level 5
- Knowledge Area: Concurrency
- Main Book: “Rust Atomics and Locks” Ch. 9
What you will build: Michael–Scott lock-free queue with tagged pointers to avoid ABA, plus benchmarks.
Why it teaches concurrency: Demands correct ordering, reclamation, and ABA mitigation.
Core challenges:
- CAS loops -> Concurrency & Atomics
- Memory reclamation -> Concurrency & Atomics
- False sharing mitigation -> Performance
Real World Outcome
Queue sustaining millions of ops/sec on 8 cores; verified linearizable via stress tests.
The Core Question You’re Answering
“How do you prove correctness of a lock-free queue?”
Concepts You Must Understand First
- Memory ordering – “Rust Atomics and Locks” Ch. 3
- ABA problem – Herlihy & Shavit
- Epoch GC – Crossbeam docs
Questions to Guide Your Design
- Which operations need
SeqCstvsAcquire/Release? - How will you reclaim nodes safely?
Thinking Exercise
Draw enqueue/dequeue pointer transitions with tags; mark happens-before edges.
The Interview Questions They’ll Ask
- What causes ABA and how do you prevent it?
- Why can relaxed ordering be safe for counters?
- How do you test lock-free structures?
Hints in Layers
- Hint 1: Start with Crossbeam’s epoch for reclamation.
- Hint 2: Use cacheline padding on head/tail.
- Hint 3: Stress test with Loom.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Atomics | “Rust Atomics and Locks” | Ch. 3, 9 |
| Lock-free | “Art of Multiprocessor Programming” | Ch. 10 |
Common Pitfalls and Debugging
Problem: Throughput collapse under contention.
- Fix: Add backoff or batching; reduce false sharing.
- Quick test:
perf stat -e cache-missesduring benchmark.
Definition of Done
- Linearizability tests pass.
- ABA prevented via tagging/epoch.
- Benchmarks on multi-core documented.
- Loom model passes for small scenarios.
Project 7: The Zero-Copy Protocol Parser (Lifetime Mastery)
- File: P07-zero-copy-protocol-parser.md
- Language: Rust
- Coolness Level: Level 4
- Business Potential: Level 3
- Difficulty: Level 4
- Knowledge Area: Parsing, lifetimes
- Main Book: “Programming Rust” Ch. 19
What you will build: A zero-copy binary protocol parser that returns borrowed views, with fallible parsing and error spans.
Why it teaches lifetimes: Enforces borrow-scope correctness without allocations.
Core challenges:
- Slicing & lifetimes -> Type System Power
- Alignment & layout -> Memory Control
- Error reporting -> Developer Experience
Real World Outcome
Parser that can process MBs/s without allocations; includes property tests with arbitrary byte streams.
The Core Question You’re Answering
“How do I parse without owning the data?”
Concepts You Must Understand First
- Borrowing rules – “The Rust Programming Language” Ch. 4
- Nom-style parser combinators – Nom docs
- Endian + alignment – CPU manuals
Questions to Guide Your Design
- How will you ensure slices remain valid after parsing?
- How do you signal partial/invalid frames?
Thinking Exercise
Sketch a lifetime diagram: input buffer 'a borrowed by parsed views 'a; ensure no escaped references.
The Interview Questions They’ll Ask
- Why prefer zero-copy for networking?
- How do you prevent out-of-bounds reads?
- How do you fuzz parsers?
Hints in Layers
- Hint 1: Use
nom::bytes::complete::takewith lifetimes. - Hint 2: Validate lengths before transmute; prefer
from_le_bytes. - Hint 3: Fuzz with
cargo fuzz.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Lifetimes | “Programming Rust” | Ch. 19 |
| Parsing | “The Rust Programming Language” | Ch. 9 |
Common Pitfalls and Debugging
Problem: Dangling reference after buffer reuse.
- Fix: Tie output lifetime to input; forbid buffer reuse while views live.
- Quick test: Compile-time borrow check + Miri.
Definition of Done
- No allocations in happy-path parse.
- Property tests cover malformed inputs.
- Benchmarks exceed baseline by X%.
- Error spans include byte offsets.
Project 8: Building a Custom Runtime (Waker/Executor)
- File: P08-custom-async-runtime.md
- Language: Rust
- Coolness Level: Level 5
- Business Potential: Level 3
- Difficulty: Level 5
- Knowledge Area: Async runtime
- Main Book: “Rust for Rustaceans” Ch. 8
What you will build: A single-threaded then multi-threaded executor with timers, IO polling (via epoll/kqueue), and cooperative task budgeting.
Why it teaches async internals: You implement the entire poll/wake loop and fairness controls.
Core challenges:
- Waker design -> Async Internals
- Reactor integration -> OS APIs
- Budgeting/backpressure -> Concurrency
Real World Outcome
runtime run --bench outputs latency/throughput metrics; tasks are fair (no starvation) and timers accurate.
The Core Question You’re Answering
“What does an executor actually do beyond polling?”
Concepts You Must Understand First
- Epoll/kqueue – OS man pages
- Waker contract – Rust docs
- Task budgeting – Runtime design articles
Questions to Guide Your Design
- How will you store tasks and wakers?
- How do you avoid blocking the reactor thread?
Thinking Exercise
Draw queues: ready queue, timer wheel, IO readiness list; show transitions on wake.
The Interview Questions They’ll Ask
- Why is
Send + 'staticoften required? - How do you implement cancellation?
- How do you integrate timers efficiently?
Hints in Layers
- Hint 1: Start single-threaded; add multi-thread later.
- Hint 2: Use
miofor portability. - Hint 3: Instrument with
tracingspans.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Async | “Rust for Rustaceans” | Ch. 8 |
| Concurrency | “Rust Atomics and Locks” | Ch. 9 |
Common Pitfalls and Debugging
Problem: Tasks never wake.
- Fix: Ensure waker clones store Arc back to task queue.
- Quick test: Add counter of pending tasks; ensure it hits zero in tests.
Definition of Done
- Timers accurate within threshold.
- IO tasks progress without starvation.
- Benchmarks vs Tokio (small workload) documented.
- Cancellation API works.
Project 9: Physical Units Library (Type-Safe Engineering)
- File: P09-physical-units-lib.md
- Language: Rust
- Coolness Level: Level 3
- Business Potential: Level 3
- Difficulty: Level 3
- Knowledge Area: Type system, DSL
- Main Book: “Programming Rust” Ch. 11
What you will build: A units-of-measure library enforcing dimensional correctness at compile time (meters, seconds, newtons).
Why it teaches type power: Typestate + const generics eliminate unit bugs.
Core challenges:
- Dimensional exponents in types -> Type System Power
- Operator overloading -> Traits
- Display/serialization -> Ergonomics
Real World Outcome
force = mass * acceleration compiles; mass + time fails at compile time. Docs show SI conversions and serialization.
The Core Question You’re Answering
“Can the compiler prevent unit mistakes?”
Concepts You Must Understand First
- PhantomData – to encode dimensions
- Operator traits –
Add,Mul, etc. - Const generics arithmetic – nightly features (optional)
Questions to Guide Your Design
- How to encode exponents for units?
- How to integrate with serde for APIs?
Thinking Exercise
Design a type Quantity<Unit<M, Kg, S>> where exponents are consts.
The Interview Questions They’ll Ask
- Why typestates?
- How do you prevent users from constructing invalid units?
- How do you keep ergonomics (type inference)?
Hints in Layers
- Hint 1: Start with base dimensions as const tuple.
- Hint 2: Implement macros for derived units.
- Hint 3: Provide
const fnfor literals.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Traits | “Programming Rust” | Ch. 11 |
| DSLs | “Programming Rust” | Ch. 20 |
Common Pitfalls and Debugging
Problem: Type explosions in error messages.
- Fix: Type aliases and helper traits; good docs.
- Quick test: Compile invalid expressions to ensure errors fire.
Definition of Done
- Invalid unit arithmetic fails to compile.
- Serde integration demonstrated.
- Benchmarks show zero runtime overhead vs raw numbers.
- Docs include unit table.
Project 10: Procedural Macro for Trait Reflection (Metaprogramming)
- File: P10-procedural-macro-reflection.md
- Language: Rust
- Coolness Level: Level 4
- Business Potential: Level 3
- Difficulty: Level 4
- Knowledge Area: Macros
- Main Book: “Programming Rust” Ch. 20
What you will build: A derive macro that generates a reflection table for trait implementors (method names, types) for diagnostics.
Why it teaches macros: Requires parsing token streams, handling hygiene, and emitting friendly errors.
Core challenges:
- Token parsing -> Macros
- Span-aware diagnostics -> Developer UX
- Generated code ergonomics -> Type System
Real World Outcome
#[derive(Reflect)] adds a fn describe() returning metadata. Example binary prints reflection table.
The Core Question You’re Answering
“How do I turn Rust syntax into data at compile time?”
Concepts You Must Understand First
- TokenStream & syn – crate docs
- quote! macro – crate docs
- Span hygiene – Rust reference
Questions to Guide Your Design
- How will you handle generics in reflected types?
- How will you surface compile errors clearly?
Thinking Exercise
Draw pipeline: AST -> syn -> transform -> quote -> generated impl.
The Interview Questions They’ll Ask
- Difference between declarative and procedural macros?
- How to debug macro expansions?
- How to avoid name collisions?
Hints in Layers
- Hint 1: Start with structs only; add enums later.
- Hint 2: Use
Span::call_sitefor generated identifiers. - Hint 3: Provide
--features debug-expandto log output.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Macros | “Programming Rust” | Ch. 20 |
Common Pitfalls and Debugging
Problem: Infinite recursion in macro expansion.
- Fix: Use distinct helper names; guard re-entry.
- Quick test:
cargo expandto inspect output.
Definition of Done
- Reflection table generated for structs/enums.
- Friendly compile-time errors with spans.
- Works with generics and lifetimes.
- Tests cover expansion snapshots.
Project 11: The no_std Game Boy Core (CPU Simulation)
- File: P11-nostd-gameboy-core.md
- Language: Rust (
no_std) - Coolness Level: Level 5
- Business Potential: Level 3
- Difficulty: Level 5
- Knowledge Area: Emulation, hardware
- Main Book: “Computer Organization and Design” Ch. 4
What you will build: A Game Boy CPU/PPU/APU core targeting no_std, with MMU and instruction decoding, plus debug CLI on host.
Why it teaches boundaries: Combines no_std, memory mapping, instruction decoding, and timing.
Core challenges:
- Instruction decode/execute loop -> Concurrency & no_std
- Memory bank controller -> Memory Control
- Timing/interrupts -> Async/Hardware
Real World Outcome
Emulator passes Blargg test ROMs; can be cross-compiled to ARM Cortex-M.
The Core Question You’re Answering
“How does a CPU actually process instructions?”
Concepts You Must Understand First
- Binary/hex math – TAOCP Vol. 1
- Instruction sets – CPU manuals
- Memory banking (MBC) – Game Boy docs
Questions to Guide Your Design
- How will you model registers (8-bit pairs, flags)?
- How will you schedule PPU timings vs CPU cycles?
Thinking Exercise
Trace fetch-decode-execute for one opcode; mark flag mutations.
The Interview Questions They’ll Ask
- How to implement an emulator main loop?
- Why is
no_stduseful for cores? - How to validate timing correctness?
Hints in Layers
- Hint 1: Use a jump table
match opcode. - Hint 2: Keep MMU abstracted for host vs embedded.
- Hint 3: Use cycle counters to gate PPU steps.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| CPU Arch | “Computer Organization and Design” | Ch. 4 |
| Emulation | “Game Boy Coding Adventure” | Full |
Common Pitfalls and Debugging
Problem: Incorrect flag handling (DAA).
- Fix: Add exhaustive tests for flags; compare against known-good traces.
- Quick test: Run Blargg ROMs and assert pass.
Definition of Done
- Passes Blargg test ROMs.
- Cross-builds for ARM
thumbv7em. - Cycle timing within tolerance.
- Memory banking works for multiple cartridges.
Project 12: High-Performance KV Store (Custom Everything)
- File: P12-high-performance-kv-store.md
- Language: Rust
- Coolness Level: Level 5
- Business Potential: Level 4
- Difficulty: Level 5
- Knowledge Area: Databases / systems
- Main Book: “Designing Data-Intensive Applications” Ch. 3
What you will build: A log-structured KV store with custom arena index, zero-copy reads, atomics for concurrency, and WAL for durability.
Why it teaches mastery: Integrates allocators, atomics, async IO, and type-level safety into a system component.
Core challenges:
- Log-structured storage + compaction -> Memory Control
- Concurrency on index -> Concurrency & Atomics
- Crash safety -> Systems design
Real World Outcome
CLI benchmark shows millions of ops/sec on single core; durability validated via crash-recovery test.
$ ./kv --bench --ops 1000000
Throughput: 3.0M ops/s
P99 latency: 0.8 ms
Crash-recovery: PASS
The Core Question You’re Answering
“How do I build a production-grade system component?”
Concepts You Must Understand First
- LSM vs B-Tree – “DDIA” Ch. 3
- mmap & fsync – “The Linux Programming Interface” Ch. 49
- Atomic indexing – “Rust Atomics and Locks” Ch. 9
Questions to Guide Your Design
- How will you bound tail latency under compaction?
- How will you ensure WAL durability without stalling writers?
Thinking Exercise
Design compaction trigger thresholds; map write amplification vs space.
The Interview Questions They’ll Ask
- Difference between LSM and B+Tree?
- Why use mmap for reads?
- How to recover after crash?
Hints in Layers
- Hint 1: Append-only log for writes; index in arena.
- Hint 2: Readers use
mmap+ zero-copy parse. - Hint 3: Background compaction with epoch GC.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Storage | “Designing Data-Intensive Applications” | Ch. 3 |
| Systems | “The Linux Programming Interface” | Ch. 49 |
| Atomics | “Rust Atomics and Locks” | Ch. 9 |
Common Pitfalls and Debugging
Problem: Write amplification explodes.
- Fix: Tune compaction thresholds; use tiered levels.
- Quick test: Measure WAL size vs data size after 100k updates.
Definition of Done
- Benchmarks meet target throughput/latency.
- Crash-recovery test passes.
- Compaction keeps disk usage bounded.
- Documentation explains allocator + index design.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Manual Pin Projector | Advanced | 3-5 days | High | ★★★★☆ |
| 2. Box-less Async Trait | Expert | 1 week | High | ★★★★☆ |
| 3. Custom Arena Allocator | Advanced | 1 week | Medium | ★★★☆☆ |
| 4. no_std Kernel Core | Expert | 2 weeks | Very High | ★★★★★ |
| 5. Const Generic Matrix | Intermediate | 1-2 weeks | Medium | ★★★☆☆ |
| 6. Atomic Lock-Free Queue | Master | 2-3 weeks | Very High | ★★★★★ |
| 7. Zero-Copy Parser | Advanced | 1 week | High | ★★★★☆ |
| 8. Custom Future Runtime | Master | 2-3 weeks | Very High | ★★★★★ |
| 9. Physical Units Lib | Advanced | 1 week | Medium | ★★★☆☆ |
| 10. Reflect Derive Macro | Expert | 1 week | High | ★★★★☆ |
| 11. no_std Game Boy Core | Master | 1 month+ | Very High | ★★★★★ |
| 12. High-Performance KV Store | Master | 1-2 months | Very High | ★★★★★ |
Recommendation
- If you are new to advanced Rust: Start with Projects 1, 3, 5.
- If you are targeting runtimes: Do Projects 2, 6, 8, then 12.
- If you are embedded-focused: Do Projects 3, 4, 11, then Final Overall Project.
Final Overall Project: A Self-Hosting, no_std, Async OS Core
The Goal: Combine allocators, async runtime, and device drivers into a bootable micro-kernel that runs its own shell.
- Bootloader + panic handler + allocator.
- Async executor for interrupts and timers.
- Syscall-like interface using macros for user tasks.
Success Criteria: Boots in QEMU/real hardware, runs shell, handles timer + serial IO, and can reload tasks.
From Learning to Production: What Is Next
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| Custom Runtime | Tokio/Glommio | Cross-platform IO drivers, observability |
| Lock-Free Queue | Crossbeam/SegQueue | Production benchmarking, GC tuning |
| KV Store | RocksDB/Redis modules | Replication, clustering, durability SLAs |
| no_std Core | RTOS firmware | Certification, power management, drivers |
Summary
This sprint covers Advanced Rust through 12 hands-on projects that span pinning, async, allocators, type-level programming, atomics, macros, and no_std.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Manual Pin Projector | Rust | Advanced | 3-5 days |
| 2 | Box-less Async Trait | Rust | Expert | 1 week |
| 3 | Custom Arena Allocator | Rust | Advanced | 1 week |
| 4 | no_std Kernel Core | Rust | Expert | 2 weeks |
| 5 | Const Generic Matrix | Rust | Intermediate | 1-2 weeks |
| 6 | Atomic Lock-Free Queue | Rust | Master | 2-3 weeks |
| 7 | Zero-Copy Parser | Rust | Advanced | 1 week |
| 8 | Custom Future Runtime | Rust | Master | 2-3 weeks |
| 9 | Physical Units Lib | Rust | Advanced | 1 week |
| 10 | Reflect Derive Macro | Rust | Expert | 1 week |
| 11 | no_std Game Boy Core | Rust | Master | 1 month+ |
| 12 | High-Performance KV Store | Rust | Master | 1-2 months |
Expected Outcomes
- Understand pinning invariants and async state machines.
- Swap and tune allocators for specific workloads.
- Build lock-free structures with correct ordering.
- Ship
no_stdbinaries and embedded cores. - Integrate macros and type-level guarantees into real systems.
Additional Resources and References
- Rust RFC 2349 (Pin) and Rustonomicon (unsafe guidelines).
- Tokio runtime internals docs;
async-stddesign notes. - Crossbeam + Loom for concurrency testing.
cortex-m-rt,probe-rsfor embedded targets.- “Designing Data-Intensive Applications” for storage architectures.