Project 2: The Box-less Async Trait (Zero-Cost Async)

Goal: Master Generic Associated Types (GATs) and understand why async in traits was historically problematic by building a zero-allocation async trait system that rivals async-trait in ergonomics while eliminating all heap allocations.

Project Metadata

Main Programming Language: Rust
Coolness Level: Level 3: Genuinely Clever
Difficulty: Level 4: Expert
Knowledge Area: Async / Metaprogramming
Time Estimate: 1 week
Prerequisites: Solid understanding of async/await, basic trait design, familiarity with lifetimes

Learning Objectives

By completing this project, you will:

Understand the historical problem - Why async fn in traits was impossible before Rust 1.75 and what the compiler needs to make it work
Master Generic Associated Types (GATs) - How to define associated types with their own lifetime and type parameters
Deeply comprehend Higher-Ranked Trait Bounds (HRTBs) - The for<'a> syntax and when/why you need it
Know the three approaches - Compare async-trait (boxing), GAT-based (manual), and RPITIT (Rust 1.75+)
Visualize memory layouts - See the difference between stack-allocated and heap-allocated futures
Measure real performance - Build benchmarks that prove the allocation difference matters
Understand object safety - Why GAT-based traits sacrifice dyn Trait capability
Apply to real-world patterns - See how Tower, Axum, and other async crates solve this problem
Navigate lifetime complexity - Master the where Self: 'a pattern and why it’s essential
Debug async trait issues - Recognize common error patterns and their solutions

Deep Theoretical Foundation

The Historical Problem: Why Async in Traits Was Hard

Before we build the solution, we must deeply understand the problem. When Rust introduced async/await in version 1.39 (November 2019), it came with a significant limitation: you could not write async fn directly in trait definitions.

The core issue is that async functions return anonymous types:

// When you write this:
async fn hello() -> String {
    "Hello".to_string()
}

// The compiler generates something like this:
fn hello() -> impl Future<Output = String> {
    // An anonymous struct implementing Future
    // Size known only to the compiler at this call site
}

The impl Future return type works in free functions because the compiler knows the exact type at the call site. But in traits, the implementor determines the concrete type:

trait Greeter {
    async fn greet(&self) -> String;  // ERROR before Rust 1.75!
}

Why the compiler can’t handle this:

The Size Problem:
=================

When you call a trait method through a reference:

    fn use_greeter(g: &impl Greeter) {
        let future = g.greet();  // What is the SIZE of this future?
        //           ^^^^^^^^^^
        //           Compiler needs to know at compile time
        //           But the size depends on which impl is used!
    }

For static dispatch (&impl Greeter), the compiler monomorphizes
and can figure it out. But for dynamic dispatch (dyn Greeter),
it's impossible because:

    1. The future type is different for each implementor
    2. Different futures have different sizes
    3. dyn Trait requires uniform sizing for vtable dispatch

What async-trait Actually Does: The Boxing Solution

The async-trait crate by David Tolnay solved this with a procedural macro that transforms your code:

// What you write:
#[async_trait]
trait Greeter {
    async fn greet(&self) -> String;
}

#[async_trait]
impl Greeter for MyGreeter {
    async fn greet(&self) -> String {
        "Hello!".to_string()
    }
}

// What the macro generates:
trait Greeter {
    fn greet<'life0, 'async_trait>(
        &'life0 self
    ) -> Pin<Box<dyn Future<Output = String> + Send + 'async_trait>>
    where
        'life0: 'async_trait,
        Self: 'async_trait;
}

impl Greeter for MyGreeter {
    fn greet<'life0, 'async_trait>(
        &'life0 self
    ) -> Pin<Box<dyn Future<Output = String> + Send + 'async_trait>>
    where
        'life0: 'async_trait,
        Self: 'async_trait,
    {
        Box::pin(async move {
            "Hello!".to_string()
        })
    }
}

Visual representation of the boxing:

async-trait Desugaring:
=======================

┌─────────────────────────────────────────────────────────────────┐
│  Your async fn body:                                            │
│  ┌───────────────────────────────────────────────────┐          │
│  │  async move { "Hello!".to_string() }              │          │
│  │                                                   │          │
│  │  This is a state machine (anonymous struct)       │          │
│  │  Size: 0 bytes (this example) to hundreds         │          │
│  │        of bytes (complex async functions)         │          │
│  └───────────────────────────────────────────────────┘          │
│                          │                                      │
│                          ▼                                      │
│  ┌───────────────────────────────────────────────────┐          │
│  │  Box::pin( ... )                                  │          │
│  │                                                   │          │
│  │  Allocates on the HEAP:                           │          │
│  │  - 16 bytes for Box pointer + vtable pointer      │          │
│  │  - N bytes for the actual future                  │          │
│  └───────────────────────────────────────────────────┘          │
│                          │                                      │
│                          ▼                                      │
│  ┌───────────────────────────────────────────────────┐          │
│  │  Return type: Pin<Box<dyn Future<Output = T>>>    │          │
│  │                                                   │          │
│  │  Always the same size (2 * usize = 16 bytes)      │          │
│  │  Works with dyn Trait (object safe!)              │          │
│  │  BUT: Heap allocation on EVERY call               │          │
│  └───────────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────────┘

The cost of boxing:

Heap allocation: Every async method call requires malloc + free
Indirection: Two pointer dereferences (Box + vtable)
No inlining: The optimizer cannot inline across dynamic dispatch
Cache unfriendly: Heap-allocated futures scatter across memory

Generic Associated Types (GATs): The Key to Zero-Cost

GATs, stabilized in Rust 1.65 (November 2022), allow associated types to have their own generic parameters:

// Before GATs (Rust < 1.65):
trait Iterator {
    type Item;  // No generics allowed
}

// With GATs (Rust >= 1.65):
trait LendingIterator {
    type Item<'a> where Self: 'a;  // Lifetime parameter!

    fn next<'a>(&'a mut self) -> Option<Self::Item<'a>>;
}

Why GATs solve the async trait problem:

trait AsyncService {
    // The future type can now vary with the lifetime of &self
    type ProcessFut<'a>: Future<Output = String> + 'a
    where
        Self: 'a;  // The future can't outlive self

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a>;
}

impl AsyncService for MyService {
    // Each implementor specifies its own future type
    type ProcessFut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a> {
        async move {
            format!("{}: {}", self.prefix, input)
        }
    }
}

Memory layout with GATs (stack allocation):

GAT-based Async Trait:
======================

┌─────────────────────────────────────────────────────────────────┐
│  Your async fn body:                                            │
│  ┌───────────────────────────────────────────────────┐          │
│  │  async move { format!("{}: {}", self.prefix, input) }        │
│  │                                                   │          │
│  │  State machine struct:                            │          │
│  │  - Reference to self (&MyService)                 │          │
│  │  - Reference to input (&str)                      │          │
│  │  - State enum (Start, Waiting, Done)              │          │
│  │  Size: ~24-48 bytes (implementation specific)     │          │
│  └───────────────────────────────────────────────────┘          │
│                          │                                      │
│                          ▼                                      │
│  ┌───────────────────────────────────────────────────┐          │
│  │  Return type: Self::ProcessFut<'a>                │          │
│  │                                                   │          │
│  │  Type is CONCRETE (known at compile time)         │          │
│  │  Lives on the STACK (caller's stack frame)        │          │
│  │  NO heap allocation!                              │          │
│  │  CAN be inlined by optimizer!                     │          │
│  └───────────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────────┘

Stack Frame:
┌──────────────────────────────────────────────┐
│  Local variables                             │  ← Caller's stack
├──────────────────────────────────────────────┤
│  ProcessFut<'a> {                            │
│      self_ref: &MyService,                   │
│      input_ref: &str,                        │
│      state: State::Start,                    │
│  }                                           │
├──────────────────────────────────────────────┤
│  More locals...                              │
└──────────────────────────────────────────────┘
    NO HEAP ALLOCATION!

Higher-Ranked Trait Bounds (HRTBs): The for<’a> Syntax

HRTBs are crucial when you need to express “works for any lifetime”:

// Without HRTB - specific lifetime
fn call_once<'a>(service: &'a impl AsyncService) -> impl Future<Output = String> + 'a {
    service.process("hello")
}

// With HRTB - works for ANY lifetime
fn accepts_any_service<S>(service: S)
where
    S: for<'a> AsyncService,  // For ANY lifetime 'a
    for<'a> S::ProcessFut<'a>: Send,  // The future must be Send for all 'a
{
    // ...
}

Understanding for<’a> visually:

HRTB: for<'a> Trait<'a>
=======================

Without HRTB (specific lifetime):
┌─────────────────────────────────────────────────┐
│  fn foo<'x>(t: &impl Trait<'x>)                 │
│                                                 │
│  The caller picks 'x. The function must work    │
│  with whatever specific lifetime is chosen.     │
│                                                 │
│  Lifetime 'x: ────────────────────────►         │
│              (one specific duration)            │
└─────────────────────────────────────────────────┘

With HRTB (any lifetime):
┌─────────────────────────────────────────────────┐
│  fn bar(t: impl for<'a> Trait<'a>)              │
│                                                 │
│  The function works for ALL possible lifetimes  │
│  - The function can use 'a however it wants     │
│  - More flexible but more restrictive on T      │
│                                                 │
│  'a could be:                                   │
│    ──► (short)                                  │
│    ────────► (medium)                           │
│    ────────────────────────────► (long)         │
│              (any duration works!)              │
└─────────────────────────────────────────────────┘

Common use case in async traits:
┌─────────────────────────────────────────────────┐
│  where                                          │
│      S: for<'a> Service<'a>,                    │
│      for<'a> <S as Service<'a>>::Fut: Send      │
│                                                 │
│  "S implements Service for any lifetime 'a,     │
│   AND its future type is Send for any 'a"       │
└─────────────────────────────────────────────────┘

The where Self: ‘a Bound Explained

This bound is essential and often confusing. Let’s break it down:

trait AsyncService {
    type ProcessFut<'a>: Future<Output = String> + 'a
    where
        Self: 'a;  // What does this mean?

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a>;
}

The intuition:

where Self: 'a
==============

This means: "Self must live at least as long as 'a"

Why do we need this?
────────────────────

The future ProcessFut<'a> has lifetime 'a and contains
a reference to &'a self. For this to be valid:

  - self must be valid for at least 'a
  - Therefore Self: 'a (Self outlives 'a)

Without this bound:
┌─────────────────────────────────────────────────┐
│  let service = MyService::new();                │
│  let future = service.process("hello");         │
│  drop(service);  // Service is gone!            │
│  future.await;   // DANGER: future holds &service│
│  // This would be use-after-free!               │
└─────────────────────────────────────────────────┘

With the bound, the compiler ensures:
┌─────────────────────────────────────────────────┐
│  let service = MyService::new();                │
│  let future = service.process("hello");         │
│  // service cannot be dropped while future exists│
│  // because future: ProcessFut<'a>              │
│  // and service: Self must outlive 'a           │
│  future.await;  // Safe!                        │
└─────────────────────────────────────────────────┘

RPITIT: Return Position Impl Trait in Traits (Rust 1.75+)

Rust 1.75 (December 2023) introduced native support for -> impl Trait in traits:

// Now legal in Rust 1.75+!
trait Greeter {
    fn greet(&self) -> impl Future<Output = String>;
}

impl Greeter for MyGreeter {
    fn greet(&self) -> impl Future<Output = String> {
        async { "Hello!".to_string() }
    }
}

How RPITIT works under the hood:

// What you write:
trait Greeter {
    fn greet(&self) -> impl Future<Output = String>;
}

// What the compiler understands (conceptually):
trait Greeter {
    type __greet_return_type: Future<Output = String>;
    fn greet(&self) -> Self::__greet_return_type;
}

// Each impl provides a concrete hidden type:
impl Greeter for MyGreeter {
    type __greet_return_type = /* compiler-generated type */;
    fn greet(&self) -> Self::__greet_return_type {
        async { "Hello!".to_string() }
    }
}

Comparison: Three Approaches

┌─────────────────────────────────────────────────────────────────────────┐
│                    ASYNC TRAIT APPROACHES COMPARED                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. async-trait crate (Boxing)                                          │
│  ─────────────────────────────                                          │
│  Pros:                                                                  │
│    ✓ Works on stable Rust (any version)                                 │
│    ✓ Object safe (dyn Trait works)                                      │
│    ✓ Simple to use                                                      │
│    ✓ Automatic Send/Sync handling                                       │
│  Cons:                                                                  │
│    ✗ Heap allocation per call                                           │
│    ✗ Cannot inline futures                                              │
│    ✗ Vtable indirection overhead                                        │
│    ✗ Proc macro dependency                                              │
│                                                                         │
│  2. GAT-based (Manual)                                                  │
│  ─────────────────────                                                  │
│  Pros:                                                                  │
│    ✓ Zero allocations                                                   │
│    ✓ Full inlining possible                                             │
│    ✓ No macro magic                                                     │
│    ✓ Maximum performance                                                │
│  Cons:                                                                  │
│    ✗ NOT object safe (no dyn Trait)                                     │
│    ✗ Requires Rust 1.65+                                                │
│    ✗ More verbose trait definitions                                     │
│    ✗ Complex lifetime annotations                                       │
│    ✗ May need type_alias_impl_trait (unstable)                          │
│                                                                         │
│  3. RPITIT (Rust 1.75+)                                                 │
│  ──────────────────────                                                 │
│  Pros:                                                                  │
│    ✓ Zero allocations                                                   │
│    ✓ Native language support                                            │
│    ✓ Cleaner syntax than GATs                                           │
│    ✓ async fn in traits (coming in 1.78+)                               │
│  Cons:                                                                  │
│    ✗ Requires Rust 1.75+                                                │
│    ✗ NOT object safe                                                    │
│    ✗ Some limitations vs full GATs                                      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Memory Layout Comparison: Visual Deep Dive

MEMORY LAYOUT: async-trait (Boxing)
===================================

Heap                              Stack
┌─────────────────────────┐       ┌─────────────────────────┐
│ 0x00001000:             │       │ 0x7fff0000:             │
│ ┌─────────────────────┐ │       │ ┌─────────────────────┐ │
│ │ Future State Machine│ │       │ │ local_var_1: u64    │ │
│ │ - captured &self    │ │       │ ├─────────────────────┤ │
│ │ - captured input    │ │       │ │ local_var_2: bool   │ │
│ │ - state enum        │ │       │ ├─────────────────────┤ │
│ │ - locals from async │ │       │ │ future_box:         │ │
│ │ - padding           │ │       │ │  .ptr: 0x00001000 ──┼─┼──► Points to heap
│ │ Size: 64-200 bytes  │ │       │ │  .vtable: 0x00002000│ │
│ └─────────────────────┘ │       │ │  Size: 16 bytes     │ │
└─────────────────────────┘       │ └─────────────────────┘ │
                                  └─────────────────────────┘

Cost per call:
  - malloc: ~10-50 ns
  - free: ~10-50 ns
  - Cache miss likely: ~100 ns
  Total overhead: 30-200 ns per async call


MEMORY LAYOUT: GAT-based (Stack)
================================

Heap                              Stack
┌─────────────────────────┐       ┌─────────────────────────┐
│                         │       │ 0x7fff0000:             │
│    (Nothing here!)      │       │ ┌─────────────────────┐ │
│                         │       │ │ local_var_1: u64    │ │
│                         │       │ ├─────────────────────┤ │
│                         │       │ │ local_var_2: bool   │ │
│                         │       │ ├─────────────────────┤ │
│                         │       │ │ future (inline):    │ │
│                         │       │ │  .self_ref: &Svc    │ │
│                         │       │ │  .input_ref: &str   │ │
│                         │       │ │  .state: Start      │ │
│                         │       │ │  Size: 32-64 bytes  │ │
│                         │       │ └─────────────────────┘ │
└─────────────────────────┘       └─────────────────────────┘

Cost per call:
  - malloc: 0 ns (no allocation!)
  - free: 0 ns
  - Cache: Already in L1/L2 (stack is hot)
  Total overhead: ~0 ns per async call

Performance Implications: Deep Analysis

Why stack allocation wins:

No allocator overhead: malloc and free are surprisingly expensive
- Thread-local caches must be checked
- Locks may be needed for thread safety
- Fragmentation bookkeeping
Cache locality: Stack is always “hot” in cache
- L1 cache hit: ~1 ns
- L2 cache hit: ~4 ns
- L3 cache hit: ~12 ns
- RAM access: ~100 ns
- Heap-allocated futures often cause L3 misses
Inlining: Compiler can see through the abstraction
- With boxing: Vtable dispatch prevents inlining
- With GATs: Full inlining possible, optimizer can combine operations
Branch prediction: Concrete types enable better optimization
- Boxing requires indirect call (vtable lookup)
- GATs enable direct call (address known at compile time)

When boxing is acceptable:

Low-frequency calls (setup, configuration)
When you need dyn Trait (runtime polymorphism)
When the future itself does I/O (network latency » allocation cost)
Rapid prototyping

The Core Question You’re Answering

“Why does async fn in a trait usually require a Box?”

Async functions return a hidden type (the state machine). In a trait, the compiler doesn’t know the size of this state machine for every possible implementation. async-trait solves this by putting that state machine in a Box (pointer-sized). Your goal is to tell the compiler exactly where to find that type without the Box.

Concepts You Must Understand First

Stop and research these before coding:

Generic Associated Types (GATs)
- How can an associated type have its own lifetime parameters?
- What is the difference between type Item; and type Item<'a>;?
- Book Reference: “Idiomatic Rust” Ch. 5 - “Advanced Traits”
Async Desugaring
- What does an async fn look like to the compiler?
- How is the state machine struct generated?
- Book Reference: “Rust for Rustaceans” Ch. 8 - “Asynchronous Programming”
Higher-Ranked Trait Bounds (HRTBs)
- What does for<'a> mean in a trait bound?
- When do you need it vs a regular lifetime parameter?
- Book Reference: “Programming Rust” Ch. 11 - “Traits and Generics”
Object Safety
- Why can some traits be used with dyn Trait and others cannot?
- What makes a trait “object safe”?
- Book Reference: “Rust for Rustaceans” Ch. 3 - “Designing Interfaces”

Solution Architecture

The Trait Definition with GAT

Your trait should look something like this:

pub trait AsyncService {
    /// The future type returned by process().
    ///
    /// Key points:
    /// - It has its own lifetime parameter 'a
    /// - It must implement Future with the correct Output
    /// - It must live at most as long as 'a
    /// - The `where Self: 'a` ensures the service outlives the future
    type ProcessFut<'a>: Future<Output = String> + 'a
    where
        Self: 'a;

    /// Process input asynchronously.
    ///
    /// The lifetime 'a ties together:
    /// - The borrow of self (&'a self)
    /// - The borrow of input (&'a str)
    /// - The returned future (ProcessFut<'a>)
    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a>;
}

The Lifetime Bounds Pattern

Understanding why we need where Self: 'a:

// This is the key insight:
type ProcessFut<'a>: Future<Output = String> + 'a
where
    Self: 'a;  // Self must outlive 'a

// Without this bound, you could write:
let service = MyService::new();
let future = service.process("hello");
drop(service);  // Oops! service is gone
future.await;   // But future still holds &service -> UB!

// The bound prevents this at compile time

How Impl Blocks Look for Concrete Types

Using impl Future (requires nightly with type_alias_impl_trait):

#![feature(type_alias_impl_trait)]

impl AsyncService for MyService {
    type ProcessFut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a> {
        async move {
            format!("{}: {}", self.prefix, input)
        }
    }
}

Using a concrete named future (stable Rust):

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};

pub struct ProcessFuture<'a> {
    prefix: &'a str,
    input: &'a str,
    done: bool,
}

impl<'a> Future for ProcessFuture<'a> {
    type Output = String;

    fn poll(mut self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        if self.done {
            Poll::Pending  // Should not happen
        } else {
            self.done = true;
            Poll::Ready(format!("{}: {}", self.prefix, self.input))
        }
    }
}

impl AsyncService for MyService {
    type ProcessFut<'a> = ProcessFuture<'a>;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a> {
        ProcessFuture {
            prefix: &self.prefix,
            input,
            done: false,
        }
    }
}

The Trade-off: Object Safety vs Zero-Cost

┌─────────────────────────────────────────────────────────────────┐
│  THE FUNDAMENTAL TRADE-OFF                                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Object Safety (dyn Trait)     vs    Zero-Cost Abstraction      │
│  ──────────────────────────          ────────────────────       │
│                                                                 │
│  With async-trait:                   With GATs:                 │
│  ┌───────────────────────┐          ┌───────────────────────┐  │
│  │ Can use dyn Trait     │          │ Cannot use dyn Trait  │  │
│  │ Runtime polymorphism  │          │ Static dispatch only  │  │
│  │ Heap allocation       │          │ Stack allocation      │  │
│  │ Vtable overhead       │          │ Zero overhead         │  │
│  └───────────────────────┘          └───────────────────────┘  │
│                                                                 │
│  Choose async-trait when:           Choose GATs when:          │
│  - You need trait objects           - Maximum performance      │
│  - Plugin systems                   - Embedded/no_std          │
│  - Heterogeneous collections        - Hot paths                │
│  - Rapid development                - Known concrete types     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why GAT-based traits are NOT object safe:
-----------------------------------------
1. Associated types with generics violate object safety rules
2. The compiler cannot create a vtable for generic associated types
3. Each implementor has a different future size/type

trait AsyncService {
    type ProcessFut<'a>: Future + 'a where Self: 'a;
    //              ^^^
    //   This generic makes the trait NOT object safe

    fn process<'a>(&'a self) -> Self::ProcessFut<'a>;
    //         ^^^
    //   Generic lifetime on method also affects object safety
}

// This will NOT compile:
fn use_dyn(service: &dyn AsyncService) { }
//                   ^^^^^^^^^^^^^^^^
// Error: the trait `AsyncService` is not object safe

Phased Implementation Guide

Phase 1: Understand the Problem with Regular Async Trait

Goal: See the compiler errors firsthand

Create a new project and try to write async in a trait:

// Try this and observe the error:
trait Greeter {
    async fn greet(&self) -> String;  // Won't work!
}

Then try with async-trait:

use async_trait::async_trait;

#[async_trait]
trait Greeter {
    async fn greet(&self) -> String;
}

#[async_trait]
impl Greeter for MyGreeter {
    async fn greet(&self) -> String {
        "Hello!".to_string()
    }
}

Use cargo expand to see what async-trait generates:

cargo install cargo-expand
cargo expand

Deliverable: A clear understanding of the generated code with Box::pin.

Phase 2: Define Trait with GAT

Goal: Create the zero-allocation trait definition

use std::future::Future;

pub trait AsyncProcessor {
    /// The future type, parameterized by the borrow lifetime
    type ProcessFut<'a>: Future<Output = String> + 'a
    where
        Self: 'a;

    /// Optionally: a Send-bound version for use with tokio::spawn
    type ProcessFutSend<'a>: Future<Output = String> + Send + 'a
    where
        Self: 'a + Send;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a>;

    fn process_send<'a>(&'a self, input: &'a str) -> Self::ProcessFutSend<'a>
    where
        Self: Send;
}

Key decisions to make:

Do you need Send bounds?
Do you need Sync bounds?
Multiple methods or just one?

Phase 3: Implement for Concrete Types

Goal: Create implementations without any boxing

Option A: Using type_alias_impl_trait (nightly)

#![feature(type_alias_impl_trait)]

struct DataProcessor {
    prefix: String,
}

impl AsyncProcessor for DataProcessor {
    type ProcessFut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a> {
        async move {
            // Simulate some async work
            tokio::time::sleep(Duration::from_millis(1)).await;
            format!("{}: {}", self.prefix, input)
        }
    }
}

Option B: Manual future implementation (stable)

use std::pin::Pin;
use std::task::{Context, Poll};
use std::future::Future;

struct DataProcessor {
    prefix: String,
}

// Manual future struct
pub struct DataProcessFuture<'a> {
    prefix: &'a str,
    input: &'a str,
    state: ProcessState,
}

enum ProcessState {
    Start,
    Done,
}

impl<'a> Future for DataProcessFuture<'a> {
    type Output = String;

    fn poll(mut self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        match self.state {
            ProcessState::Start => {
                self.state = ProcessState::Done;
                Poll::Ready(format!("{}: {}", self.prefix, self.input))
            }
            ProcessState::Done => {
                panic!("Future polled after completion")
            }
        }
    }
}

impl AsyncProcessor for DataProcessor {
    type ProcessFut<'a> = DataProcessFuture<'a>;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a> {
        DataProcessFuture {
            prefix: &self.prefix,
            input,
            state: ProcessState::Start,
        }
    }
}

Phase 4: Create Benchmark Comparing to async-trait

Goal: Prove the allocation difference with numbers

// benches/allocation_comparison.rs
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicUsize, Ordering};

// Allocation-counting allocator
struct CountingAllocator;

static ALLOCATION_COUNT: AtomicUsize = AtomicUsize::new(0);
static BYTES_ALLOCATED: AtomicUsize = AtomicUsize::new(0);

unsafe impl GlobalAlloc for CountingAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        ALLOCATION_COUNT.fetch_add(1, Ordering::SeqCst);
        BYTES_ALLOCATED.fetch_add(layout.size(), Ordering::SeqCst);
        System.alloc(layout)
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        System.dealloc(ptr, layout)
    }
}

#[global_allocator]
static ALLOCATOR: CountingAllocator = CountingAllocator;

fn reset_counts() {
    ALLOCATION_COUNT.store(0, Ordering::SeqCst);
    BYTES_ALLOCATED.store(0, Ordering::SeqCst);
}

fn get_counts() -> (usize, usize) {
    (
        ALLOCATION_COUNT.load(Ordering::SeqCst),
        BYTES_ALLOCATED.load(Ordering::SeqCst),
    )
}

fn bench_comparison(c: &mut Criterion) {
    let mut group = c.benchmark_group("async_trait_comparison");

    for n in [100, 1000, 10000].iter() {
        group.bench_with_input(BenchmarkId::new("async-trait", n), n, |b, &n| {
            let rt = tokio::runtime::Runtime::new().unwrap();
            let service = AsyncTraitService::new();

            b.iter(|| {
                rt.block_on(async {
                    reset_counts();
                    for _ in 0..n {
                        service.process("test").await;
                    }
                    get_counts()
                })
            });
        });

        group.bench_with_input(BenchmarkId::new("GAT-based", n), n, |b, &n| {
            let rt = tokio::runtime::Runtime::new().unwrap();
            let service = GatService::new();

            b.iter(|| {
                rt.block_on(async {
                    reset_counts();
                    for _ in 0..n {
                        service.process("test").await;
                    }
                    get_counts()
                })
            });
        });
    }

    group.finish();
}

criterion_group!(benches, bench_comparison);
criterion_main!(benches);

Phase 5: Explore RPITIT Alternative

Goal: Understand the newest solution (Rust 1.75+)

// Requires Rust 1.75+

trait ModernAsyncService {
    fn process(&self, input: &str) -> impl Future<Output = String> + '_;

    // With Send bound:
    fn process_send(&self, input: &str) -> impl Future<Output = String> + Send + '_
    where
        Self: Sync;
}

impl ModernAsyncService for MyService {
    fn process(&self, input: &str) -> impl Future<Output = String> + '_ {
        async move {
            format!("{}: {}", self.prefix, input)
        }
    }

    fn process_send(&self, input: &str) -> impl Future<Output = String> + Send + '_
    where
        Self: Sync,
    {
        async move {
            format!("{}: {}", self.prefix, input)
        }
    }
}

Compare all three approaches in your benchmark to see which is fastest.

Real World Outcome

A library that allows defining high-performance, zero-allocation async interfaces. You’ll benchmark this against async-trait and show a 0-byte allocation count in the hot path. This demonstrates that GATs enable true zero-cost async abstractions without heap allocation.

Example Build and Benchmark:

$ cargo new --lib boxless-async-trait
     Created library `boxless-async-trait` package

$ cd boxless-async-trait

$ cargo add async-trait tokio --features tokio/full
    Updating crates.io index
      Adding async-trait v0.1.77 to dependencies
      Adding tokio v1.35.1 to dependencies.features

$ cargo add criterion --dev --features criterion/async_tokio
      Adding criterion v0.5.1 to dev-dependencies

$ cargo bench --bench allocation_comparison
   Compiling boxless-async-trait v0.1.0
    Finished bench [optimized] target(s) in 4.72s
     Running benches/allocation_comparison.rs

====================================================================
         Async Trait Performance Comparison
====================================================================
 Testing: Process 10,000 async calls
====================================================================

Benchmarking async_trait (boxed)
  Warming up for 3.0000 s
  Collecting 100 samples in estimated 5.2340 s (2.5M iterations)

async_trait/10k_calls  time:   [2.0845 us 2.0912 us 2.0987 us]
                       thrpt:  [476.48K elem/s 478.19K elem/s 479.73K elem/s]

Memory Analysis:
  Total allocations: 10,000
  Bytes allocated:   160,000 (16 bytes per Box)
  Allocation rate:   76.66 MB/s

Benchmarking GAT-based (zero-alloc)
  Warming up for 3.0000 s
  Collecting 100 samples in estimated 5.0123 s (5.1M iterations)

GAT-based/10k_calls    time:   [982.34 ns 985.67 ns 989.45 ns]
                       thrpt:  [1.0107M elem/s 1.0145M elem/s 1.0179M elem/s]

Memory Analysis:
  Total allocations: 0
  Bytes allocated:   0
  Allocation rate:   0 MB/s

===================================================================

PERFORMANCE SUMMARY:
===================================================================
 Metric                 | async_trait | GAT-based | Improvement
===================================================================
 Time per 10k calls     | 2.09 us     | 985 ns    | 2.12x
 Throughput             | 478K ops/s  | 1.01M/s   | 2.12x
 Allocations            | 10,000      | 0         | infinity
 Memory allocated       | 160 KB      | 0 bytes   | infinity
 CPU cache pressure     | HIGH        | LOW       | Better
===================================================================

$ cargo run --example real_world_usage
   Compiling boxless-async-trait v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 1.82s
     Running `target/debug/examples/real_world_usage`

=== Real-World Usage Example ===

[1] Defining trait with GAT-based async method...

trait AsyncService {
    type ProcessFut<'a>: Future<Output = String> + 'a
    where Self: 'a;

    fn process<'a>(&'a self, data: &'a str) -> Self::ProcessFut<'a>;
}

[2] Implementing for concrete type...

struct DataProcessor {
    prefix: String,
}

impl AsyncService for DataProcessor {
    type ProcessFut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self, data: &'a str) -> Self::ProcessFut<'a> {
        async move { format!("{}: {}", self.prefix, data) }
    }
}

[3] Running async operations...

Processing "hello" -> Result: "PREFIX: hello"
  Stack allocation at: 0x7ffee4b2c890
  Future size: 64 bytes (on stack)
  [check] Zero heap allocations

Processing "world" -> Result: "PREFIX: world"
  Stack allocation at: 0x7ffee4b2c8d0
  [check] Zero heap allocations

[4] Comparison with async-trait...

Processing "hello" with async-trait
  Heap allocation at: 0x600002504020
  Box size: 16 bytes + Future size: 96 bytes
  [warning] 1 heap allocation required

[Summary]
[check] GAT-based async traits enable zero-allocation async
[check] 2.12x faster than async-trait in benchmarks
[check] 100% reduction in heap allocations
[check] Type-safe lifetime management
[check] No vtable indirection (static dispatch)

$ cargo test
   Compiling boxless-async-trait v0.1.0
    Finished test [unoptimized + debuginfo] target(s) in 1.24s
     Running unittests src/lib.rs

running 4 tests
test tests::test_gat_zero_alloc ... ok
test tests::test_lifetime_bounds ... ok
test tests::test_static_dispatch ... ok
test tests::test_vs_async_trait ... ok

test result: ok. 4 passed; 0 failed

Testing Strategy

Allocation Counting Tests

#[cfg(test)]
mod allocation_tests {
    use super::*;
    use std::alloc::{GlobalAlloc, Layout, System};
    use std::sync::atomic::{AtomicUsize, Ordering};

    thread_local! {
        static ALLOC_COUNT: AtomicUsize = AtomicUsize::new(0);
    }

    fn with_alloc_count<F, R>(f: F) -> (R, usize)
    where
        F: FnOnce() -> R,
    {
        ALLOC_COUNT.with(|c| c.store(0, Ordering::SeqCst));
        let result = f();
        let count = ALLOC_COUNT.with(|c| c.load(Ordering::SeqCst));
        (result, count)
    }

    #[test]
    fn test_zero_allocations() {
        let service = GatProcessor::new("test".to_string());

        let rt = tokio::runtime::Runtime::new().unwrap();
        let (_, allocs) = rt.block_on(async {
            with_alloc_count(|| {
                // This should not allocate
                let future = service.process("input");
                futures::executor::block_on(future)
            })
        });

        assert_eq!(allocs, 0, "GAT-based async should not allocate");
    }

    #[test]
    fn test_async_trait_does_allocate() {
        let service = BoxedProcessor::new("test".to_string());

        let rt = tokio::runtime::Runtime::new().unwrap();
        let (_, allocs) = rt.block_on(async {
            with_alloc_count(|| {
                let future = service.process("input");
                futures::executor::block_on(future)
            })
        });

        assert!(allocs > 0, "async-trait should allocate");
    }
}

Benchmark Tests

#[cfg(test)]
mod benchmark_tests {
    use super::*;
    use std::time::Instant;

    #[tokio::test]
    async fn test_gat_performance() {
        let service = GatProcessor::new("prefix".to_string());
        let iterations = 100_000;

        let start = Instant::now();
        for _ in 0..iterations {
            let _ = service.process("test").await;
        }
        let gat_duration = start.elapsed();

        let boxed_service = BoxedProcessor::new("prefix".to_string());

        let start = Instant::now();
        for _ in 0..iterations {
            let _ = boxed_service.process("test").await;
        }
        let boxed_duration = start.elapsed();

        println!("GAT: {:?}", gat_duration);
        println!("Boxed: {:?}", boxed_duration);

        // GAT should be at least 1.5x faster
        assert!(gat_duration.as_nanos() * 3 < boxed_duration.as_nanos() * 2);
    }
}

Lifetime Verification Tests

#[cfg(test)]
mod lifetime_tests {
    use super::*;

    #[test]
    fn test_future_lifetime_tied_to_service() {
        let service = GatProcessor::new("test".to_string());

        // This should compile: future borrows service
        let future = service.process("hello");

        // The future holds a reference to service, so service
        // cannot be dropped while future exists
        // (This is enforced by the compiler, not a runtime check)

        let result = futures::executor::block_on(future);
        assert_eq!(result, "test: hello");
    }

    #[test]
    fn test_multiple_concurrent_futures() {
        let service = GatProcessor::new("test".to_string());

        let rt = tokio::runtime::Runtime::new().unwrap();
        rt.block_on(async {
            let f1 = service.process("one");
            let f2 = service.process("two");
            let f3 = service.process("three");

            let (r1, r2, r3) = futures::join!(f1, f2, f3);

            assert_eq!(r1, "test: one");
            assert_eq!(r2, "test: two");
            assert_eq!(r3, "test: three");
        });
    }
}

Common Pitfalls and Debugging

Pitfall 1: Forgetting the where Self: ‘a Bound

// WRONG: Missing the crucial bound
trait BadTrait {
    type Fut<'a>: Future<Output = ()> + 'a;
    //                             ^^^
    // Without `where Self: 'a`, the compiler cannot prove safety

    fn go<'a>(&'a self) -> Self::Fut<'a>;
}

// CORRECT:
trait GoodTrait {
    type Fut<'a>: Future<Output = ()> + 'a
    where
        Self: 'a;  // Essential!

    fn go<'a>(&'a self) -> Self::Fut<'a>;
}

Pitfall 2: Trying to Use dyn Trait

// This will NOT compile:
fn accept_any(service: &dyn AsyncService) {
    //                  ^^^^^^^^^^^^^^^^
    // Error: the trait `AsyncService` is not object safe
}

// Solution: Use generics instead
fn accept_any<S: AsyncService>(service: &S) {
    // This works with static dispatch
}

Pitfall 3: Incorrect Lifetime Annotations

// WRONG: Lifetimes don't match
impl AsyncService for MyService {
    type ProcessFut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self, input: &str) -> Self::ProcessFut<'a> {
        //                        ^^^^
        // Error: input has different lifetime than 'a
        async move {
            format!("{}", input)  // input might not live long enough
        }
    }
}

// CORRECT: All lifetimes aligned
impl AsyncService for MyService {
    type ProcessFut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self, input: &'a str) -> Self::ProcessFut<'a> {
        //                         ^^^^
        // Now input shares lifetime 'a with self
        async move {
            format!("{}", input)  // Safe!
        }
    }
}

Pitfall 4: Send Bounds Complications

// Problem: Your future needs to be Send for tokio::spawn
trait NeedsSend {
    type Fut<'a>: Future<Output = ()> + Send + 'a
    //                                  ^^^^
    // This requires ALL captured data to be Send
    where
        Self: 'a;

    fn go<'a>(&'a self) -> Self::Fut<'a>;
}

// But if your service contains non-Send data:
struct MyService {
    data: Rc<String>,  // Rc is NOT Send!
}

// This won't compile because &Rc<String> is not Send

// Solution 1: Use Arc instead of Rc
struct MyServiceSend {
    data: Arc<String>,  // Arc IS Send
}

// Solution 2: Have separate Send and non-Send variants
trait FlexibleService {
    type Fut<'a>: Future<Output = ()> + 'a
    where Self: 'a;

    type FutSend<'a>: Future<Output = ()> + Send + 'a
    where Self: 'a + Sync;  // Require Sync for Send future

    fn go<'a>(&'a self) -> Self::Fut<'a>;
    fn go_send<'a>(&'a self) -> Self::FutSend<'a> where Self: Sync;
}

Pitfall 5: type_alias_impl_trait Confusion

// This requires nightly with #![feature(type_alias_impl_trait)]
type ProcessFut<'a> = impl Future<Output = String> + 'a;

// Common error: Using it in the wrong place
// WRONG: Defining at module level without proper context
type MyFut = impl Future<Output = ()>;  // Error!

// CORRECT: Define in impl block context
impl AsyncService for MyService {
    type ProcessFut<'a> = impl Future<Output = String> + 'a;
    // The compiler infers the concrete type from the method body

    fn process<'a>(&'a self) -> Self::ProcessFut<'a> {
        async move { /* ... */ }  // This defines the concrete type
    }
}

Pitfall 6: Borrowing Across Await Points

// Problem: Borrowing something that doesn't live long enough
impl AsyncService for MyService {
    type Fut<'a> = impl Future<Output = String> + 'a;

    fn process<'a>(&'a self) -> Self::Fut<'a> {
        async move {
            let temp = self.create_temp();  // Returns a String
            some_async_operation().await;   // await point!
            //                    ^^^^^
            // If temp is borrowed across this await, it must live long enough
            temp  // This is fine because we own temp
        }
    }
}

// Problematic case:
fn bad_process<'a>(&'a self) -> impl Future<Output = String> + 'a {
    async move {
        let temp = String::from("hello");
        let temp_ref = &temp;  // Borrow of local
        some_async_operation().await;  // await point
        temp_ref.to_string()  // ERROR: temp_ref across await
        //      ^^^^^^^^^
        // The reference doesn't live long enough
    }
}

Questions to Guide Your Design

Lifetime Elision
- How do you capture the lifetime of &self in the returned future?
- What happens when the method takes multiple references with different lifetimes?
Trait Objects
- Why does this approach make the trait no longer “Object Safe”?
- Can you use dyn MyAsyncService with this GAT approach?
- What alternatives exist when you need runtime polymorphism?
Send and Sync
- When does your future need to be Send?
- How do you handle !Send types captured in the async block?

Thinking Exercise

Desugaring the Sugar

Take a standard async fn:

async fn hello(s: &str) -> usize { s.len() }

Now, try to write the same thing without the async keyword:

fn hello(s: &str) -> impl Future<Output = usize> + '_ {
    // How do you implement this?
}

Notice the lifetime issues when the input s is used inside the future. The '_ lifetime (elided) is tied to s. Now consider: How does a GAT solve the “named return type” problem?

Key insight: GATs allow the trait to express “the future type varies based on the lifetime of the borrow” - something that was impossible to express before.

The Interview Questions They’ll Ask

“Why can’t we have async fn in traits natively (before Rust 1.75)?”

Answer: Because async functions return anonymous types whose size varies per implementation. Traits need to specify return types that work for all implementors, but you can’t name the anonymous future type. Boxing (as async-trait does) provides a uniform size (one pointer), while GATs allow each impl to specify its own concrete type.
“What is a GAT and how does it solve the lifetime-in-traits problem?”

Answer: A Generic Associated Type is an associated type that has its own generic parameters (lifetimes or types). For async traits, it solves the problem by allowing the future type to be parameterized by the lifetime of the borrow: type Fut<'a>: Future + 'a where Self: 'a. This lets each implementor provide a different concrete future type while maintaining lifetime safety.
“What are the performance implications of using #[async_trait]?”

Answer: async-trait causes a heap allocation (via Box) for every async method call. This has several costs: the allocation itself (~20-50ns), potential cache misses when accessing the boxed future, inability to inline across the vtable dispatch, and increased memory fragmentation. For hot paths called millions of times, this overhead is significant.
“How does the compiler determine the size of an async future?”

Answer: The compiler analyzes all variables that live across await points (variables that are used after an await but were created before it). These become fields in the generated state machine struct. The struct also contains an enum for the current state. The total size is the sum of all these fields plus alignment padding. Different async functions produce different sized futures.

Extensions and Challenges

Extension 1: Multi-Method Trait with Different Future Types

trait ComplexService {
    type FetchFut<'a>: Future<Output = Vec<u8>> + 'a where Self: 'a;
    type ProcessFut<'a>: Future<Output = String> + 'a where Self: 'a;
    type StoreFut<'a>: Future<Output = ()> + 'a where Self: 'a;

    fn fetch<'a>(&'a self, url: &'a str) -> Self::FetchFut<'a>;
    fn process<'a>(&'a self, data: &'a [u8]) -> Self::ProcessFut<'a>;
    fn store<'a>(&'a self, key: &'a str, value: &'a str) -> Self::StoreFut<'a>;
}

Extension 2: Combine with Tower’s Service Trait Pattern

// Tower-style service with GAT
trait GatService<Request> {
    type Response;
    type Error;
    type Future<'a>: Future<Output = Result<Self::Response, Self::Error>> + 'a
    where
        Self: 'a;

    fn call<'a>(&'a self, req: Request) -> Self::Future<'a>;
}

// Implement middleware that wraps any GatService
struct Logging<S> {
    inner: S,
}

impl<S, R> GatService<R> for Logging<S>
where
    S: GatService<R>,
    R: std::fmt::Debug,
{
    type Response = S::Response;
    type Error = S::Error;
    type Future<'a> = impl Future<Output = Result<S::Response, S::Error>> + 'a
    where
        Self: 'a;

    fn call<'a>(&'a self, req: R) -> Self::Future<'a> {
        async move {
            println!("Request: {:?}", req);
            let result = self.inner.call(req).await;
            println!("Response received");
            result
        }
    }
}

Extension 3: Stream-based GAT Trait

use futures::Stream;

trait AsyncIterator {
    type Item;
    type NextFut<'a>: Future<Output = Option<Self::Item>> + 'a
    where
        Self: 'a;

    fn next<'a>(&'a mut self) -> Self::NextFut<'a>;
}

// Convert to a Stream
struct GatStream<I: AsyncIterator> {
    iter: I,
}

impl<I: AsyncIterator> Stream for GatStream<I> {
    type Item = I::Item;

    fn poll_next(
        self: Pin<&mut Self>,
        cx: &mut Context<'_>
    ) -> Poll<Option<Self::Item>> {
        // Implementation would require pinning the future
        todo!()
    }
}

Real-World Connections

Tower Service Trait

The Tower crate (used by Axum, Tonic, etc.) uses a polling-based approach:

// Tower's approach (simplified)
pub trait Service<Request> {
    type Response;
    type Error;
    type Future: Future<Output = Result<Self::Response, Self::Error>>;

    fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>>;
    fn call(&mut self, req: Request) -> Self::Future;
}

Tower uses &mut self (not &self) to manage backpressure, which changes the lifetime requirements. This is a design trade-off worth studying.

Axum Handlers

Axum’s handler traits internally use complex type machinery to avoid boxing where possible:

// Axum uses FromRequest extractors with associated futures
pub trait FromRequest<S>: Sized {
    type Rejection: IntoResponse;

    fn from_request(
        req: Request,
        state: &S
    ) -> impl Future<Output = Result<Self, Self::Rejection>> + Send;
}

Database Drivers

Many async database drivers (like sqlx) must balance ergonomics with performance:

// sqlx uses RPITIT in newer versions
pub trait Executor<'c>: Send {
    fn execute<'e, 'q: 'e, E>(
        self,
        query: E
    ) -> impl Future<Output = Result<QueryResult, Error>> + Send + 'e
    where
        'c: 'e,
        E: Execute<'q, Self::Database>;
}

Hints in Layers

Hint 1: The GAT definition

Start by defining the associated type with a lifetime:

type Fut<'a>: Future<Output = ()> + 'a where Self: 'a;

Hint 2: Implementation

In the implementation, you’ll need to use impl Future or a concrete type. Since you can’t use impl Future in associated types easily yet, you might need to use a crate like real-async-trait for inspiration or use Box only during the development phase to see where it hurts.

Hint 3: Capturing Lifetimes

The where Self: 'a bound is crucial. It tells the compiler that the future can’t outlive the service itself.

Hint 4: Testing Zero-Cost

Use a custom allocator that counts allocations to verify your implementation truly allocates nothing in the async path.

Hint 5: Stable Rust Alternative

If you can’t use nightly, implement the Future trait manually for a named struct instead of relying on impl Future.

Books That Will Help

Topic	Book	Chapter
GAT Mastery	“Idiomatic Rust”	Ch. 5 - Advanced Traits
Async Internals	“Rust for Rustaceans”	Ch. 8 - Asynchronous Programming
Dispatch Performance	“Effective Rust”	Item 12 - Prefer Generics to Trait Objects
Trait Design	“Programming Rust”	Ch. 11 - Traits and Generics
Future Implementation	“Rust for Rustaceans”	Ch. 8 - Building Your Own Futures

Summary

This project teaches you one of the most advanced patterns in the Rust async ecosystem. By building a box-less async trait system, you will:

Understand why async in traits was historically problematic
Master GATs and their role in expressing complex lifetime relationships
Learn to reason about memory allocation in async code
Build benchmarks that prove performance differences
Understand the trade-off between object safety and zero-cost abstractions

The skills learned here directly apply to understanding and contributing to major async crates like Tower, Axum, and Tokio.

What’s Next?

After completing this project, consider:

Project 8: Building a Custom Runtime - Understand the executor side of async
Project 1: Manual Pin Projector - Deep dive into why futures need pinning
Exploring the Tower crate’s service pattern in depth
Reading the async-trait source code to understand its macro implementation