Project 4: Home Assistant Voice Satellite (ESPHome)

Build a Home Assistant Assist Satellite on ESP32-S3 using ESPHome, with correct pipeline state handling and OTA stability.

Quick Reference

Attribute	Value
Difficulty	Level 3: Intermediate
Time Estimate	2-4 days
Main Programming Language	YAML (ESPHome)
Alternative Programming Languages	C++ (custom ESPHome components), C (ESP-IDF)
Coolness Level	High
Business Potential	Medium
Prerequisites	Basic Home Assistant setup, ESPHome flashing, audio hardware wired
Key Topics	ESPHome voice assistant component, Assist pipeline states, OTA stability

1. Learning Objectives

By completing this project, you will:

Configure ESPHome to expose an Assist Satellite entity in Home Assistant.
Implement correct pipeline state transitions (IDLE, LISTENING, PROCESSING, RESPONDING).
Ensure stable audio capture and playback with limited RAM.
Build a maintainable configuration that supports OTA updates.

2. All Theory Needed (Per-Concept Breakdown)

2.1 ESPHome Architecture and the Voice Assistant Component

Fundamentals

ESPHome is a configuration-driven firmware system built on top of ESP-IDF/Arduino. You write YAML, and ESPHome generates C++ firmware that runs on the device. The Voice Assistant component provides a high-level pipeline for audio capture, streaming to Home Assistant, and playback of TTS responses. It exposes a standard Assist Satellite entity in Home Assistant, which means your device can integrate with the Assist pipeline without custom backend code. The key idea is that ESPHome handles most of the audio and networking details, but you still must configure pins, codecs, and state handling correctly.

Deep Dive into the concept

ESPHome abstracts much of the complexity of embedded development, but it is not magic. Understanding its architecture helps you debug and tune performance. ESPHome builds a component graph from your YAML. Each component can register callbacks and update loops. The Voice Assistant component ties together microphone capture, audio streaming, and speaker playback. It also integrates with Home Assistant via the native API and exposes a standardized entity with states like IDLE, LISTENING, PROCESSING, and RESPONDING.

The pipeline works like this: when the Assist Satellite entity receives a start command (either from a button or wake word), the Voice Assistant component begins streaming audio to Home Assistant. Home Assistant runs the Assist pipeline (ASR, intent, TTS) and sends audio back. The device plays back the response and signals when playback ends. The device must keep Home Assistant informed about these states. If you do not send the proper “end of playback” signal, Home Assistant can remain stuck in RESPONDING. This is why state signaling is critical.

ESPHome builds on ESP-IDF, but it wraps driver initialization. For audio, you must configure I2S pins, sample rate, and DMA buffer sizes. ESPHome’s defaults may not be optimal for ESP32-S3 with PSRAM, so you need to tune them. Voice assistant components are memory-heavy; they allocate buffers for audio capture and playback, plus protocol overhead. This is why ESPHome warns that BLE components can cause instability. BLE consumes RAM and adds additional tasks, which can starve the audio pipeline. In practice, you should disable BLE and unnecessary components when building a voice satellite.

One advantage of ESPHome is OTA updates. Because ESPHome manages partitions and firmware upgrades, you can update without serial cables. However, OTA is not free: it requires extra flash space and can fail if the device runs low on memory during update. You should monitor free heap during streaming and keep a safety margin. Another important consideration is logging. ESPHome logs are convenient but can be verbose. High log levels can cause audio dropouts. Use INFO for normal operation and DEBUG only when needed.

Finally, ESPHome’s strength is maintainability. By using YAML, you can quickly modify pins, change devices, or update pipelines. This is ideal for smart-home integration, where users might not be firmware engineers. The tradeoff is less fine-grained control. If you need extremely low latency or custom codecs, you may need to drop down to ESP-IDF. For this project, the goal is to master ESPHome’s capabilities and constraints, not to replace them.

How this fit on projects

This concept is the core of Project 4. It also shows how a full-stack firmware can be simplified with a high-level framework.

Definitions & key terms

ESPHome: YAML-based firmware generator for ESP devices.
Voice Assistant component: ESPHome module for Assist pipelines.
Assist Satellite entity: Home Assistant entity that represents a voice device.
OTA: Over-the-air firmware updates.

Mental model diagram (ASCII)

[ESPHome YAML] -> [ESPHome Build] -> [Firmware]
       |                               |
       v                               v
 [Voice Assistant] <-> [Home Assistant Assist Pipeline]

How it works (step-by-step)

ESPHome compiles YAML into firmware.
Device boots and registers with Home Assistant.
Assist pipeline triggers voice session.
Device streams audio and receives TTS.
Device signals playback completion and returns to IDLE.

Minimal concrete example

voice_assistant:
  microphone: i2s_mic
  speaker: i2s_speaker
  on_listening:
    - logger.log: "listening"

Common misconceptions

“ESPHome handles everything.” You still need correct pins and buffer sizes.
“BLE can stay enabled.” It often causes memory pressure and crashes.

Check-your-understanding questions

Why does the Assist Satellite entity require playback completion signals?
What happens if you enable too many components in ESPHome?
Why might you need to tune DMA buffers in ESPHome?

Check-your-understanding answers

Home Assistant needs to know when to return to IDLE state.
Memory pressure and task contention can destabilize audio.
Default buffers may be too small or too large for your hardware.

Real-world applications

Smart-home voice satellites
Intercom devices
Room-based assist terminals

Where you will apply it

In this project: Section 3.2 and Section 5.2.
Also used in: P05-the-full-stack-xiaozhi-clone.md

References

ESPHome Voice Assistant documentation.
Home Assistant Assist Satellite documentation.

Key insights

ESPHome simplifies integration, but performance still depends on correct configuration.

Summary

ESPHome provides a high-level voice pipeline, but you must tune it for stability.

Homework/exercises to practice the concept

Enable and disable BLE and observe heap usage differences.
Change DMA buffer size and log stability.
Add a custom LED indicator for each pipeline state.

Solutions to the homework/exercises

Heap usage decreases when BLE is disabled.
Larger buffers reduce dropouts but increase latency.
Use on_listening, on_tts_start, on_end triggers.

2.2 Assist Pipeline State Machine and UX Contracts

Fundamentals

The Assist pipeline in Home Assistant defines a standardized set of states: IDLE, LISTENING, PROCESSING, RESPONDING. The satellite device must reflect these states accurately. LISTENING means audio capture is active. PROCESSING means the server is doing ASR/LLM work. RESPONDING means TTS is playing back on the device. The device must signal when playback ends so the pipeline can return to IDLE. This state machine is a contract between the device and Home Assistant, and breaking it causes stuck sessions or confusing UX.

Deep Dive into the concept

In a smart-home environment, consistency is everything. Users expect their voice device to behave like every other device. The Assist pipeline provides this consistency by defining states and transitions. When the pipeline enters LISTENING, the device should show a clear visual indication (LED or UI). When it enters PROCESSING, the user needs feedback that the system is working. When it enters RESPONDING, the device must play audio and then explicitly tell Home Assistant that playback is complete. This explicit signal is important because Home Assistant may send multiple audio chunks, and it cannot assume when playback ends without confirmation.

In ESPHome, you can bind these states to actions in the YAML configuration. For example, on_listening can turn on a blue LED, on_tts_start can turn on a green LED, and on_end can return to idle. This is not just UI; it is a synchronization mechanism. If your device does not send on_end, the pipeline stays in RESPONDING and new commands may be ignored. This is a common failure mode when people first use ESPHome voice assistant components.

The state machine must also handle errors. If the network disconnects during LISTENING, the device should transition to ERROR or IDLE and tell Home Assistant that the session ended. If the server fails to send a response, the device should timeout and return to IDLE. You should implement a watchdog timer for each pipeline state. For example, if LISTENING lasts more than 20 seconds without a stop trigger, you can auto-stop and return to IDLE. This prevents the device from getting stuck. The goal is a stable and predictable user experience.

The state machine also guides performance decisions. If you are in LISTENING, you should disable heavy animations and prioritize audio capture. In PROCESSING, you can lower CPU load and allow the network to work. In RESPONDING, you should ensure playback has priority. These state-based policies keep the device stable and align resource usage with user perception.

Finally, the state machine is a testing tool. By simulating state transitions, you can verify that your device responds correctly without live audio. This is useful when debugging Home Assistant integration or network issues. The more deterministic your state handling, the easier it is to debug.

How this fit on projects

This concept defines the behavior of the ESPHome satellite and prepares you for building your own full-stack state machine.

Definitions & key terms

IDLE: No active voice session.
LISTENING: Audio capture in progress.
PROCESSING: Server is processing audio.
RESPONDING: Device is playing TTS.

Mental model diagram (ASCII)

IDLE -> LISTENING -> PROCESSING -> RESPONDING -> IDLE
   ^         |            |             |
   |         v            v             v
 ERROR <---- timeout   network fail   playback end

How it works (step-by-step)

Trigger voice session.
Enter LISTENING, capture audio.
Stop capture, enter PROCESSING.
Receive TTS, enter RESPONDING.
Playback ends, signal completion, return to IDLE.

Minimal concrete example

on_listening:
  - light.turn_on: led_listen
on_tts_start:
  - light.turn_on: led_speak
on_end:
  - light.turn_off: led_listen
  - light.turn_off: led_speak

Common misconceptions

“Home Assistant will detect playback end automatically.” It needs explicit signaling.
“States are just UI labels.” They drive pipeline behavior.

Check-your-understanding questions

Why must the device signal end of playback?
What happens if LISTENING never ends?
How can you test state transitions without audio?

Check-your-understanding answers

Home Assistant cannot infer playback completion reliably.
The pipeline stays stuck and new commands may not work.
Use button triggers or scripted state changes.

Real-world applications

Voice assistants in smart homes
Call center intercoms
Smart kiosks

Where you will apply it

In this project: Section 3.2 and Section 5.5.
Also used in: P05-the-full-stack-xiaozhi-clone.md

References

Home Assistant Assist documentation.
ESPHome voice assistant component docs.

Key insights

State transitions are a protocol, not just UX.

Summary

Correct state signaling ensures Home Assistant remains in sync with the device.

Homework/exercises to practice the concept

Add a timeout to LISTENING and log when it triggers.
Simulate a server error and verify UI returns to IDLE.
Add LED patterns for each state.

Solutions to the homework/exercises

Use on_listening with a timer to call voice_assistant.stop.
Trigger on_error and ensure LEDs reset.
Use different colors or blink patterns.

2.3 OTA Reliability and Resource Constraints

Fundamentals

OTA updates let you flash new firmware over the network. ESPHome uses OTA by default, but it requires memory and flash space. Voice assistant components are RAM heavy, so OTA can fail if the device is already close to memory limits. You must ensure enough free heap and avoid running heavy tasks during OTA. Resource constraints also apply during normal operation: audio buffers, network stacks, and UI all consume RAM. You should measure free heap and log minimum values to avoid instability.

Deep Dive into the concept

OTA reliability is often overlooked until the first device update fails. ESPHome uses a dual-partition scheme, where the new firmware is downloaded into an inactive partition. This requires sufficient flash space and a stable network. If your firmware is too large, OTA will fail before flashing. You can check firmware size in ESPHome build logs. For voice assistants, the firmware size is larger because of audio components, codecs, and TLS libraries. This is why you should minimize unused components and avoid large assets.

RAM constraints are more subtle. During OTA, the device must allocate buffers for the download and for verification. If your voice assistant is running at the same time, it may consume most of the heap. The result is failed OTA updates. The safe approach is to stop voice assistant activity during OTA or to trigger OTA only when idle. ESPHome has built-in OTA handling, but you can still design your workflow to minimize risks. For example, you can disable wake word detection while updating.

Resource constraints also influence stability during voice sessions. The voice assistant component uses buffers for audio capture and playback. If you enable BLE or other heavy components, you may reduce available heap below safe margins. ESPHome’s warnings about BLE are serious: BLE uses a large stack and memory pool. On ESP32-S3, PSRAM helps, but not all data can live in PSRAM. DMA buffers must be in internal SRAM. This means even if you have plenty of PSRAM, you can still run out of internal SRAM for DMA.

The right approach is to measure. ESPHome provides logs for free heap. You can also add custom sensors that report minimum heap. Track this value during a voice session and during OTA. If it drops below a threshold (for example 50 KB), you should consider reducing features or buffer sizes. The goal is not just to run once, but to run reliably for days. Resource constraints are a reality of embedded systems, and ESPHome does not change that.

How this fit on projects

This concept informs how you configure and test ESPHome for stability.

Definitions & key terms

OTA partition: Flash region used to store new firmware.
Heap: Dynamic memory available for allocations.
Free heap minimum: Lowest heap value observed during runtime.
Internal SRAM: Fast memory needed for DMA.

Mental model diagram (ASCII)

Flash: [Active FW] [OTA Slot]
RAM:   [Audio Buffers] [Network] [UI] [Free]

How it works (step-by-step)

OTA request triggers firmware download.
Firmware image stored in OTA slot.
Device verifies image and reboots.
New firmware becomes active.

Minimal concrete example

logger:
  level: INFO

sensor:
  - platform: uptime
    name: "Uptime"

Common misconceptions

“OTA always works if Wi-Fi is strong.” Memory can still fail the update.
“PSRAM solves everything.” DMA buffers still require internal SRAM.

Check-your-understanding questions

Why might OTA fail even with good Wi-Fi?
Why is internal SRAM still a bottleneck?
How can you monitor memory usage in ESPHome?

Check-your-understanding answers

Insufficient free heap or flash space.
DMA buffers must live in internal SRAM.
Use logging and sensors for free heap metrics.

Real-world applications

Fleet updates of IoT devices
Smart home device maintenance
Remote deployments

Where you will apply it

In this project: Section 3.3 and Section 7.1.
Also used in: P05-the-full-stack-xiaozhi-clone.md

References

ESPHome OTA documentation.
ESP-IDF memory management guides.

Key insights

OTA success depends on memory headroom, not just connectivity.

Summary

Resource constraints remain critical even with high-level frameworks like ESPHome.

Homework/exercises to practice the concept

Measure minimum free heap during a voice session.
Trigger OTA during idle and during active streaming and compare results.
Disable BLE and compare stability.

Solutions to the homework/exercises

Log free_heap periodically and record minimum value.
OTA during streaming is more likely to fail.
Disabling BLE increases free heap and stability.

3. Project Specification

3.1 What You Will Build

An ESPHome-based voice satellite that appears in Home Assistant as an Assist Satellite entity. It supports push-to-talk or wake word, streams audio to Home Assistant, plays responses, and updates LEDs or a screen for pipeline states. OTA updates must work reliably.

3.2 Functional Requirements

ESPHome Config: YAML defines microphone, speaker, and voice assistant.
Assist Integration: Device registers as Assist Satellite entity.
Pipeline States: LEDs or UI reflect LISTENING, PROCESSING, RESPONDING.
Playback Completion: Device signals end of TTS playback.
OTA Updates: OTA update works while device is idle.

3.3 Non-Functional Requirements

Performance: Audio capture stable for 5 minutes.
Reliability: OTA succeeds 3 times in a row while idle.
Usability: State transitions visible within 200 ms.

3.4 Example Usage / Output

I (000500) esphome: connected to HA
I (001000) voice: state=LISTENING
I (003000) voice: state=PROCESSING
I (004500) voice: state=RESPONDING
I (006000) voice: state=IDLE

3.5 Data Formats / Schemas / Protocols

ESPHome uses Home Assistant native API. No custom protocol required.

3.6 Edge Cases

Device not discovered: mDNS or VLAN issues.
OTA failure due to low heap: log and retry when idle.
Audio garbled: sample rate mismatch or incorrect I2S pins.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

esphome run voice_satellite.yaml

3.7.2 Golden Path Demo (Deterministic)

Use fixed test command “Turn on the light” and fixed timestamp.

Expected serial output:

I (000000) esphome: clock=2026-01-01T00:00:00Z
I (000500) voice: state=LISTENING
I (001500) voice: state=PROCESSING
I (002500) voice: state=RESPONDING
I (004000) voice: playback complete
I (004100) voice: state=IDLE

3.7.3 Failure Demo (Deterministic)

Simulate low memory by enabling BLE and forcing a long voice session.

Expected serial output:

E (003000) voice: heap_low, abort session
I (003100) voice: state=IDLE

Expected behavior:

Device stops streaming and returns to IDLE.
UI shows error briefly.

4. Solution Architecture

4.1 High-Level Design

[ESPHome Voice Assistant] <-> [Home Assistant Assist Pipeline]
     |          |                  |
     v          v                  v
 [Mic I2S]  [Speaker I2S]       [TTS/ASR]

4.2 Key Components

Component	Responsibility	Key Decisions
ESPHome YAML	Configure hardware and pipeline	Minimal vs full config
Voice Assistant	Stream audio and manage states	Buffer size and event hooks
OTA	Firmware updates	Trigger only when idle
LEDs/UI	State feedback	Simple LED vs display

4.3 Data Structures (No Full Code)

ESPHome is declarative; no custom data structures required for base use.

4.4 Algorithm Overview

Key Algorithm: Assist Session Lifecycle

Trigger session (button or wake word).
Stream audio to Home Assistant.
Receive response and play TTS.
Signal completion and return to idle.

Complexity Analysis:

Time: O(n) per audio chunk
Space: O(buffer size)

5. Implementation Guide

5.1 Development Environment Setup

pip install esphome
esphome wizard voice_satellite.yaml

5.2 Project Structure

project-root/
├── voice_satellite.yaml
└── secrets.yaml

5.3 The Core Question You’re Answering

“How can I integrate a voice device into Home Assistant without custom firmware?”

5.4 Concepts You Must Understand First

ESPHome component model.
Assist pipeline states and signaling.
OTA update limitations.

5.5 Questions to Guide Your Design

Which ESPHome components are essential and which can be removed?
How will you indicate pipeline state to the user?
When should OTA updates be allowed?

5.6 Thinking Exercise

List all components in your YAML and mark which ones are required for voice. Remove the rest.

5.7 The Interview Questions They’ll Ask

What is the advantage of a standardized Assist Satellite entity?
How do you ensure OTA reliability on constrained devices?
Why can BLE destabilize audio pipelines?

5.8 Hints in Layers

Hint 1: Start from the ESPHome voice assistant example config.

Hint 2: Disable BLE and unnecessary sensors.

Hint 3: Use LED feedback first; add display later.

Hint 4: Log free heap and watch for dips.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Embedded system integration | Making Embedded Systems | Ch. 9 | | Architecture principles | Clean Architecture | Ch. 1-4 |

5.10 Implementation Phases

Phase 1: Basic ESPHome Bring-up (4-6 hours)

Goals:

Device connects to Home Assistant.

Tasks:

Create ESPHome YAML.
Flash firmware and pair with HA.

Checkpoint: Device appears in HA.

Phase 2: Voice Pipeline (1-2 days)

Goals:

Capture audio and play TTS.

Tasks:

Configure mic and speaker.
Test voice session.

Checkpoint: Device responds to a voice command.

Phase 3: Stability and OTA (1 day)

Goals:

Ensure OTA updates work reliably.

Tasks:

Run OTA update while idle.
Log free heap during sessions.

Checkpoint: OTA succeeds three times in a row.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Discovery: Device appears on HA within 60 seconds.
Playback End: Device signals end of TTS playback.
Low Heap: Voice session aborts safely when heap low.

6.3 Test Data

Test command: "Turn on the light"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use HA logs to verify state changes.
Log free heap and minimum heap during sessions.

7.3 Performance Traps

High log level can cause audio glitches. Keep logging at INFO.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a mute switch in ESPHome.
Add a Wi-Fi signal strength sensor.

8.2 Intermediate Extensions

Add a display for pipeline states.
Add a local wake word button.

8.3 Advanced Extensions

Create a custom ESPHome component for advanced UI.
Add per-room context metadata to Assist sessions.

9. Real-World Connections

9.1 Industry Applications

Smart-home voice satellites.
Multi-room intercom systems.

ESPHome voice assistant example configs.
Home Assistant Assist pipeline.

9.3 Interview Relevance

System integration and OTA strategies.
Resource constraint management.

10. Resources

10.1 Essential Reading

ESPHome Voice Assistant docs.
Home Assistant Assist Satellite docs.

10.2 Video Resources

ESPHome community tutorials.
Home Assistant Assist feature overviews.

10.3 Tools & Documentation

ESPHome CLI
Home Assistant logs

P03-the-dumb-chatbot-streaming-audio-api.md - Custom streaming.
P05-the-full-stack-xiaozhi-clone.md - Full custom firmware.

11. Self-Assessment Checklist

11.1 Understanding

I can explain the Assist pipeline states.
I can describe why OTA needs memory headroom.
I can explain why BLE impacts stability.

11.2 Implementation

Device appears as Assist Satellite.
Voice sessions complete with correct states.
OTA updates succeed while idle.

11.3 Growth

I can tune ESPHome for audio stability.
I can debug HA integration issues.

12. Submission / Completion Criteria

Minimum Viable Completion:

Device connects to HA and runs a voice session.
UI/LED shows state changes.

Full Completion:

OTA updates reliable.
10-minute voice session without reset.

Excellence (Going Above & Beyond):

Added display-based UI and custom animations.
Logged and documented minimum free heap.

13 Additional Content Rules (Hard Requirements)

13.1 Determinism

Golden demo uses fixed command and fixed timestamp.

13.2 Outcome Completeness

Golden path demo in Section 3.7.2.
Failure demo in Section 3.7.3.

13.3 Cross-Linking

Cross-links included in Section 2 and Section 10.4.

13.4 No Placeholder Text

All sections are fully filled with specific content.