Project 4: Home Assistant Voice Satellite (ESPHome)
Build a Home Assistant Assist Satellite on ESP32-S3 using ESPHome, with correct pipeline state handling and OTA stability.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Intermediate |
| Time Estimate | 2-4 days |
| Main Programming Language | YAML (ESPHome) |
| Alternative Programming Languages | C++ (custom ESPHome components), C (ESP-IDF) |
| Coolness Level | High |
| Business Potential | Medium |
| Prerequisites | Basic Home Assistant setup, ESPHome flashing, audio hardware wired |
| Key Topics | ESPHome voice assistant component, Assist pipeline states, OTA stability |
1. Learning Objectives
By completing this project, you will:
- Configure ESPHome to expose an Assist Satellite entity in Home Assistant.
- Implement correct pipeline state transitions (IDLE, LISTENING, PROCESSING, RESPONDING).
- Ensure stable audio capture and playback with limited RAM.
- Build a maintainable configuration that supports OTA updates.
2. All Theory Needed (Per-Concept Breakdown)
2.1 ESPHome Architecture and the Voice Assistant Component
Fundamentals
ESPHome is a configuration-driven firmware system built on top of ESP-IDF/Arduino. You write YAML, and ESPHome generates C++ firmware that runs on the device. The Voice Assistant component provides a high-level pipeline for audio capture, streaming to Home Assistant, and playback of TTS responses. It exposes a standard Assist Satellite entity in Home Assistant, which means your device can integrate with the Assist pipeline without custom backend code. The key idea is that ESPHome handles most of the audio and networking details, but you still must configure pins, codecs, and state handling correctly.
Deep Dive into the concept
ESPHome abstracts much of the complexity of embedded development, but it is not magic. Understanding its architecture helps you debug and tune performance. ESPHome builds a component graph from your YAML. Each component can register callbacks and update loops. The Voice Assistant component ties together microphone capture, audio streaming, and speaker playback. It also integrates with Home Assistant via the native API and exposes a standardized entity with states like IDLE, LISTENING, PROCESSING, and RESPONDING.
The pipeline works like this: when the Assist Satellite entity receives a start command (either from a button or wake word), the Voice Assistant component begins streaming audio to Home Assistant. Home Assistant runs the Assist pipeline (ASR, intent, TTS) and sends audio back. The device plays back the response and signals when playback ends. The device must keep Home Assistant informed about these states. If you do not send the proper “end of playback” signal, Home Assistant can remain stuck in RESPONDING. This is why state signaling is critical.
ESPHome builds on ESP-IDF, but it wraps driver initialization. For audio, you must configure I2S pins, sample rate, and DMA buffer sizes. ESPHome’s defaults may not be optimal for ESP32-S3 with PSRAM, so you need to tune them. Voice assistant components are memory-heavy; they allocate buffers for audio capture and playback, plus protocol overhead. This is why ESPHome warns that BLE components can cause instability. BLE consumes RAM and adds additional tasks, which can starve the audio pipeline. In practice, you should disable BLE and unnecessary components when building a voice satellite.
One advantage of ESPHome is OTA updates. Because ESPHome manages partitions and firmware upgrades, you can update without serial cables. However, OTA is not free: it requires extra flash space and can fail if the device runs low on memory during update. You should monitor free heap during streaming and keep a safety margin. Another important consideration is logging. ESPHome logs are convenient but can be verbose. High log levels can cause audio dropouts. Use INFO for normal operation and DEBUG only when needed.
Finally, ESPHome’s strength is maintainability. By using YAML, you can quickly modify pins, change devices, or update pipelines. This is ideal for smart-home integration, where users might not be firmware engineers. The tradeoff is less fine-grained control. If you need extremely low latency or custom codecs, you may need to drop down to ESP-IDF. For this project, the goal is to master ESPHome’s capabilities and constraints, not to replace them.
How this fit on projects
This concept is the core of Project 4. It also shows how a full-stack firmware can be simplified with a high-level framework.
Definitions & key terms
- ESPHome: YAML-based firmware generator for ESP devices.
- Voice Assistant component: ESPHome module for Assist pipelines.
- Assist Satellite entity: Home Assistant entity that represents a voice device.
- OTA: Over-the-air firmware updates.
Mental model diagram (ASCII)
[ESPHome YAML] -> [ESPHome Build] -> [Firmware]
| |
v v
[Voice Assistant] <-> [Home Assistant Assist Pipeline]
How it works (step-by-step)
- ESPHome compiles YAML into firmware.
- Device boots and registers with Home Assistant.
- Assist pipeline triggers voice session.
- Device streams audio and receives TTS.
- Device signals playback completion and returns to IDLE.
Minimal concrete example
voice_assistant:
microphone: i2s_mic
speaker: i2s_speaker
on_listening:
- logger.log: "listening"
Common misconceptions
- “ESPHome handles everything.” You still need correct pins and buffer sizes.
- “BLE can stay enabled.” It often causes memory pressure and crashes.
Check-your-understanding questions
- Why does the Assist Satellite entity require playback completion signals?
- What happens if you enable too many components in ESPHome?
- Why might you need to tune DMA buffers in ESPHome?
Check-your-understanding answers
- Home Assistant needs to know when to return to IDLE state.
- Memory pressure and task contention can destabilize audio.
- Default buffers may be too small or too large for your hardware.
Real-world applications
- Smart-home voice satellites
- Intercom devices
- Room-based assist terminals
Where you will apply it
- In this project: Section 3.2 and Section 5.2.
- Also used in: P05-the-full-stack-xiaozhi-clone.md
References
- ESPHome Voice Assistant documentation.
- Home Assistant Assist Satellite documentation.
Key insights
ESPHome simplifies integration, but performance still depends on correct configuration.
Summary
ESPHome provides a high-level voice pipeline, but you must tune it for stability.
Homework/exercises to practice the concept
- Enable and disable BLE and observe heap usage differences.
- Change DMA buffer size and log stability.
- Add a custom LED indicator for each pipeline state.
Solutions to the homework/exercises
- Heap usage decreases when BLE is disabled.
- Larger buffers reduce dropouts but increase latency.
- Use
on_listening,on_tts_start,on_endtriggers.
2.2 Assist Pipeline State Machine and UX Contracts
Fundamentals
The Assist pipeline in Home Assistant defines a standardized set of states: IDLE, LISTENING, PROCESSING, RESPONDING. The satellite device must reflect these states accurately. LISTENING means audio capture is active. PROCESSING means the server is doing ASR/LLM work. RESPONDING means TTS is playing back on the device. The device must signal when playback ends so the pipeline can return to IDLE. This state machine is a contract between the device and Home Assistant, and breaking it causes stuck sessions or confusing UX.
Deep Dive into the concept
In a smart-home environment, consistency is everything. Users expect their voice device to behave like every other device. The Assist pipeline provides this consistency by defining states and transitions. When the pipeline enters LISTENING, the device should show a clear visual indication (LED or UI). When it enters PROCESSING, the user needs feedback that the system is working. When it enters RESPONDING, the device must play audio and then explicitly tell Home Assistant that playback is complete. This explicit signal is important because Home Assistant may send multiple audio chunks, and it cannot assume when playback ends without confirmation.
In ESPHome, you can bind these states to actions in the YAML configuration. For example, on_listening can turn on a blue LED, on_tts_start can turn on a green LED, and on_end can return to idle. This is not just UI; it is a synchronization mechanism. If your device does not send on_end, the pipeline stays in RESPONDING and new commands may be ignored. This is a common failure mode when people first use ESPHome voice assistant components.
The state machine must also handle errors. If the network disconnects during LISTENING, the device should transition to ERROR or IDLE and tell Home Assistant that the session ended. If the server fails to send a response, the device should timeout and return to IDLE. You should implement a watchdog timer for each pipeline state. For example, if LISTENING lasts more than 20 seconds without a stop trigger, you can auto-stop and return to IDLE. This prevents the device from getting stuck. The goal is a stable and predictable user experience.
The state machine also guides performance decisions. If you are in LISTENING, you should disable heavy animations and prioritize audio capture. In PROCESSING, you can lower CPU load and allow the network to work. In RESPONDING, you should ensure playback has priority. These state-based policies keep the device stable and align resource usage with user perception.
Finally, the state machine is a testing tool. By simulating state transitions, you can verify that your device responds correctly without live audio. This is useful when debugging Home Assistant integration or network issues. The more deterministic your state handling, the easier it is to debug.
How this fit on projects
This concept defines the behavior of the ESPHome satellite and prepares you for building your own full-stack state machine.
Definitions & key terms
- IDLE: No active voice session.
- LISTENING: Audio capture in progress.
- PROCESSING: Server is processing audio.
- RESPONDING: Device is playing TTS.
Mental model diagram (ASCII)
IDLE -> LISTENING -> PROCESSING -> RESPONDING -> IDLE
^ | | |
| v v v
ERROR <---- timeout network fail playback end
How it works (step-by-step)
- Trigger voice session.
- Enter LISTENING, capture audio.
- Stop capture, enter PROCESSING.
- Receive TTS, enter RESPONDING.
- Playback ends, signal completion, return to IDLE.
Minimal concrete example
on_listening:
- light.turn_on: led_listen
on_tts_start:
- light.turn_on: led_speak
on_end:
- light.turn_off: led_listen
- light.turn_off: led_speak
Common misconceptions
- “Home Assistant will detect playback end automatically.” It needs explicit signaling.
- “States are just UI labels.” They drive pipeline behavior.
Check-your-understanding questions
- Why must the device signal end of playback?
- What happens if LISTENING never ends?
- How can you test state transitions without audio?
Check-your-understanding answers
- Home Assistant cannot infer playback completion reliably.
- The pipeline stays stuck and new commands may not work.
- Use button triggers or scripted state changes.
Real-world applications
- Voice assistants in smart homes
- Call center intercoms
- Smart kiosks
Where you will apply it
- In this project: Section 3.2 and Section 5.5.
- Also used in: P05-the-full-stack-xiaozhi-clone.md
References
- Home Assistant Assist documentation.
- ESPHome voice assistant component docs.
Key insights
State transitions are a protocol, not just UX.
Summary
Correct state signaling ensures Home Assistant remains in sync with the device.
Homework/exercises to practice the concept
- Add a timeout to LISTENING and log when it triggers.
- Simulate a server error and verify UI returns to IDLE.
- Add LED patterns for each state.
Solutions to the homework/exercises
- Use
on_listeningwith a timer to callvoice_assistant.stop. - Trigger
on_errorand ensure LEDs reset. - Use different colors or blink patterns.
2.3 OTA Reliability and Resource Constraints
Fundamentals
OTA updates let you flash new firmware over the network. ESPHome uses OTA by default, but it requires memory and flash space. Voice assistant components are RAM heavy, so OTA can fail if the device is already close to memory limits. You must ensure enough free heap and avoid running heavy tasks during OTA. Resource constraints also apply during normal operation: audio buffers, network stacks, and UI all consume RAM. You should measure free heap and log minimum values to avoid instability.
Deep Dive into the concept
OTA reliability is often overlooked until the first device update fails. ESPHome uses a dual-partition scheme, where the new firmware is downloaded into an inactive partition. This requires sufficient flash space and a stable network. If your firmware is too large, OTA will fail before flashing. You can check firmware size in ESPHome build logs. For voice assistants, the firmware size is larger because of audio components, codecs, and TLS libraries. This is why you should minimize unused components and avoid large assets.
RAM constraints are more subtle. During OTA, the device must allocate buffers for the download and for verification. If your voice assistant is running at the same time, it may consume most of the heap. The result is failed OTA updates. The safe approach is to stop voice assistant activity during OTA or to trigger OTA only when idle. ESPHome has built-in OTA handling, but you can still design your workflow to minimize risks. For example, you can disable wake word detection while updating.
Resource constraints also influence stability during voice sessions. The voice assistant component uses buffers for audio capture and playback. If you enable BLE or other heavy components, you may reduce available heap below safe margins. ESPHome’s warnings about BLE are serious: BLE uses a large stack and memory pool. On ESP32-S3, PSRAM helps, but not all data can live in PSRAM. DMA buffers must be in internal SRAM. This means even if you have plenty of PSRAM, you can still run out of internal SRAM for DMA.
The right approach is to measure. ESPHome provides logs for free heap. You can also add custom sensors that report minimum heap. Track this value during a voice session and during OTA. If it drops below a threshold (for example 50 KB), you should consider reducing features or buffer sizes. The goal is not just to run once, but to run reliably for days. Resource constraints are a reality of embedded systems, and ESPHome does not change that.
How this fit on projects
This concept informs how you configure and test ESPHome for stability.
Definitions & key terms
- OTA partition: Flash region used to store new firmware.
- Heap: Dynamic memory available for allocations.
- Free heap minimum: Lowest heap value observed during runtime.
- Internal SRAM: Fast memory needed for DMA.
Mental model diagram (ASCII)
Flash: [Active FW] [OTA Slot]
RAM: [Audio Buffers] [Network] [UI] [Free]
How it works (step-by-step)
- OTA request triggers firmware download.
- Firmware image stored in OTA slot.
- Device verifies image and reboots.
- New firmware becomes active.
Minimal concrete example
logger:
level: INFO
sensor:
- platform: uptime
name: "Uptime"
Common misconceptions
- “OTA always works if Wi-Fi is strong.” Memory can still fail the update.
- “PSRAM solves everything.” DMA buffers still require internal SRAM.
Check-your-understanding questions
- Why might OTA fail even with good Wi-Fi?
- Why is internal SRAM still a bottleneck?
- How can you monitor memory usage in ESPHome?
Check-your-understanding answers
- Insufficient free heap or flash space.
- DMA buffers must live in internal SRAM.
- Use logging and sensors for free heap metrics.
Real-world applications
- Fleet updates of IoT devices
- Smart home device maintenance
- Remote deployments
Where you will apply it
- In this project: Section 3.3 and Section 7.1.
- Also used in: P05-the-full-stack-xiaozhi-clone.md
References
- ESPHome OTA documentation.
- ESP-IDF memory management guides.
Key insights
OTA success depends on memory headroom, not just connectivity.
Summary
Resource constraints remain critical even with high-level frameworks like ESPHome.
Homework/exercises to practice the concept
- Measure minimum free heap during a voice session.
- Trigger OTA during idle and during active streaming and compare results.
- Disable BLE and compare stability.
Solutions to the homework/exercises
- Log
free_heapperiodically and record minimum value. - OTA during streaming is more likely to fail.
- Disabling BLE increases free heap and stability.
3. Project Specification
3.1 What You Will Build
An ESPHome-based voice satellite that appears in Home Assistant as an Assist Satellite entity. It supports push-to-talk or wake word, streams audio to Home Assistant, plays responses, and updates LEDs or a screen for pipeline states. OTA updates must work reliably.
3.2 Functional Requirements
- ESPHome Config: YAML defines microphone, speaker, and voice assistant.
- Assist Integration: Device registers as Assist Satellite entity.
- Pipeline States: LEDs or UI reflect LISTENING, PROCESSING, RESPONDING.
- Playback Completion: Device signals end of TTS playback.
- OTA Updates: OTA update works while device is idle.
3.3 Non-Functional Requirements
- Performance: Audio capture stable for 5 minutes.
- Reliability: OTA succeeds 3 times in a row while idle.
- Usability: State transitions visible within 200 ms.
3.4 Example Usage / Output
I (000500) esphome: connected to HA
I (001000) voice: state=LISTENING
I (003000) voice: state=PROCESSING
I (004500) voice: state=RESPONDING
I (006000) voice: state=IDLE
3.5 Data Formats / Schemas / Protocols
ESPHome uses Home Assistant native API. No custom protocol required.
3.6 Edge Cases
- Device not discovered: mDNS or VLAN issues.
- OTA failure due to low heap: log and retry when idle.
- Audio garbled: sample rate mismatch or incorrect I2S pins.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
esphome run voice_satellite.yaml
3.7.2 Golden Path Demo (Deterministic)
Use fixed test command “Turn on the light” and fixed timestamp.
Expected serial output:
I (000000) esphome: clock=2026-01-01T00:00:00Z
I (000500) voice: state=LISTENING
I (001500) voice: state=PROCESSING
I (002500) voice: state=RESPONDING
I (004000) voice: playback complete
I (004100) voice: state=IDLE
3.7.3 Failure Demo (Deterministic)
Simulate low memory by enabling BLE and forcing a long voice session.
Expected serial output:
E (003000) voice: heap_low, abort session
I (003100) voice: state=IDLE
Expected behavior:
- Device stops streaming and returns to IDLE.
- UI shows error briefly.
4. Solution Architecture
4.1 High-Level Design
[ESPHome Voice Assistant] <-> [Home Assistant Assist Pipeline]
| | |
v v v
[Mic I2S] [Speaker I2S] [TTS/ASR]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| ESPHome YAML | Configure hardware and pipeline | Minimal vs full config |
| Voice Assistant | Stream audio and manage states | Buffer size and event hooks |
| OTA | Firmware updates | Trigger only when idle |
| LEDs/UI | State feedback | Simple LED vs display |
4.3 Data Structures (No Full Code)
ESPHome is declarative; no custom data structures required for base use.
4.4 Algorithm Overview
Key Algorithm: Assist Session Lifecycle
- Trigger session (button or wake word).
- Stream audio to Home Assistant.
- Receive response and play TTS.
- Signal completion and return to idle.
Complexity Analysis:
- Time: O(n) per audio chunk
- Space: O(buffer size)
5. Implementation Guide
5.1 Development Environment Setup
pip install esphome
esphome wizard voice_satellite.yaml
5.2 Project Structure
project-root/
├── voice_satellite.yaml
└── secrets.yaml
5.3 The Core Question You’re Answering
“How can I integrate a voice device into Home Assistant without custom firmware?”
5.4 Concepts You Must Understand First
- ESPHome component model.
- Assist pipeline states and signaling.
- OTA update limitations.
5.5 Questions to Guide Your Design
- Which ESPHome components are essential and which can be removed?
- How will you indicate pipeline state to the user?
- When should OTA updates be allowed?
5.6 Thinking Exercise
List all components in your YAML and mark which ones are required for voice. Remove the rest.
5.7 The Interview Questions They’ll Ask
- What is the advantage of a standardized Assist Satellite entity?
- How do you ensure OTA reliability on constrained devices?
- Why can BLE destabilize audio pipelines?
5.8 Hints in Layers
Hint 1: Start from the ESPHome voice assistant example config.
Hint 2: Disable BLE and unnecessary sensors.
Hint 3: Use LED feedback first; add display later.
Hint 4: Log free heap and watch for dips.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Embedded system integration | Making Embedded Systems | Ch. 9 | | Architecture principles | Clean Architecture | Ch. 1-4 |
5.10 Implementation Phases
Phase 1: Basic ESPHome Bring-up (4-6 hours)
Goals:
- Device connects to Home Assistant.
Tasks:
- Create ESPHome YAML.
- Flash firmware and pair with HA.
Checkpoint: Device appears in HA.
Phase 2: Voice Pipeline (1-2 days)
Goals:
- Capture audio and play TTS.
Tasks:
- Configure mic and speaker.
- Test voice session.
Checkpoint: Device responds to a voice command.
Phase 3: Stability and OTA (1 day)
Goals:
- Ensure OTA updates work reliably.
Tasks:
- Run OTA update while idle.
- Log free heap during sessions.
Checkpoint: OTA succeeds three times in a row.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | BLE enabled | On, Off | Off | Reduces RAM pressure | | UI feedback | LEDs, Display | LEDs first | Lower complexity | | OTA timing | Any time, Idle only | Idle only | Avoid memory conflicts |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Integration Tests | HA pipeline | Voice command test | | Stability Tests | Long session | 10-minute session | | OTA Tests | Firmware update | 3 updates in a row |
6.2 Critical Test Cases
- Discovery: Device appears on HA within 60 seconds.
- Playback End: Device signals end of TTS playback.
- Low Heap: Voice session aborts safely when heap low.
6.3 Test Data
Test command: "Turn on the light"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|———|———|———-|
| BLE enabled | Random resets | Disable BLE |
| Wrong I2S pins | Silence | Verify wiring and pins |
| Missing playback end | HA stuck in RESPONDING | Ensure on_end triggers |
7.2 Debugging Strategies
- Use HA logs to verify state changes.
- Log free heap and minimum heap during sessions.
7.3 Performance Traps
High log level can cause audio glitches. Keep logging at INFO.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a mute switch in ESPHome.
- Add a Wi-Fi signal strength sensor.
8.2 Intermediate Extensions
- Add a display for pipeline states.
- Add a local wake word button.
8.3 Advanced Extensions
- Create a custom ESPHome component for advanced UI.
- Add per-room context metadata to Assist sessions.
9. Real-World Connections
9.1 Industry Applications
- Smart-home voice satellites.
- Multi-room intercom systems.
9.2 Related Open Source Projects
- ESPHome voice assistant example configs.
- Home Assistant Assist pipeline.
9.3 Interview Relevance
- System integration and OTA strategies.
- Resource constraint management.
10. Resources
10.1 Essential Reading
- ESPHome Voice Assistant docs.
- Home Assistant Assist Satellite docs.
10.2 Video Resources
- ESPHome community tutorials.
- Home Assistant Assist feature overviews.
10.3 Tools & Documentation
- ESPHome CLI
- Home Assistant logs
10.4 Related Projects in This Series
- P03-the-dumb-chatbot-streaming-audio-api.md - Custom streaming.
- P05-the-full-stack-xiaozhi-clone.md - Full custom firmware.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the Assist pipeline states.
- I can describe why OTA needs memory headroom.
- I can explain why BLE impacts stability.
11.2 Implementation
- Device appears as Assist Satellite.
- Voice sessions complete with correct states.
- OTA updates succeed while idle.
11.3 Growth
- I can tune ESPHome for audio stability.
- I can debug HA integration issues.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Device connects to HA and runs a voice session.
- UI/LED shows state changes.
Full Completion:
- OTA updates reliable.
- 10-minute voice session without reset.
Excellence (Going Above & Beyond):
- Added display-based UI and custom animations.
- Logged and documented minimum free heap.
13 Additional Content Rules (Hard Requirements)
13.1 Determinism
- Golden demo uses fixed command and fixed timestamp.
13.2 Outcome Completeness
- Golden path demo in Section 3.7.2.
- Failure demo in Section 3.7.3.
13.3 Cross-Linking
- Cross-links included in Section 2 and Section 10.4.
13.4 No Placeholder Text
All sections are fully filled with specific content.