Project 11: The Voice-Activated “JARVIS” (Whisper & TTS)
Project 11: The Voice-Activated “JARVIS” (Whisper & TTS)
Build a voice-to-voice assistant: microphone → speech-to-text → agent → text-to-speech, with streaming and interruption handling.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 25–40 hours |
| Language | Python (Alternatives: Swift, JavaScript) |
| Prerequisites | Audio basics, async I/O, websockets or streaming APIs |
| Key Topics | VAD, streaming STT, streaming TTS, latency budgeting, full-duplex interruption |
1. Learning Objectives
By completing this project, you will:
- Build a voice pipeline with clear latency budgets per stage.
- Implement voice activity detection (VAD) to detect when speech starts/ends.
- Stream partial transcripts into the agent loop for responsiveness.
- Stream TTS audio while text is still being generated.
- Implement interruption and cancellation (user talks over the assistant).
2. Theoretical Foundation
2.1 Core Concepts
- Latency is UX: For a voice assistant to feel alive, time from “end of speech” to “start of response audio” should be near ~1s (or as low as feasible).
- VAD (Voice Activity Detection): Detects speech segments without push-to-talk; reduces accidental triggers and wasted tokens.
- Streaming:
- Streaming STT reduces time-to-first-token by producing partial transcripts.
- Streaming LLM lets you start generating before full completion.
- Streaming TTS lets you play audio as it’s synthesized.
- Interruption/cancellation: A voice assistant is effectively a real-time system; cancellation must propagate through all stages.
2.2 Why This Matters
Voice is the most “natural” assistant interface. If you can make it feel responsive and robust, you’ve mastered real-time orchestration.
2.3 Common Misconceptions
- “Voice is just STT + chat + TTS.” It’s also buffering, streaming, concurrency, and cancellation.
- “Higher accuracy STT always wins.” Latency and robustness often matter more for conversational UX.
3. Project Specification
3.1 What You Will Build
A local application that:
- Listens for speech (optionally a wake word)
- Transcribes speech
- Sends the text into your assistant (can reuse Project 6 routing)
- Speaks back with TTS
3.2 Functional Requirements
- Audio capture: microphone input with configurable device selection.
- VAD: segment speech and detect end-of-utterance.
- STT: transcribe (streaming if possible).
- Assistant: run a tool-using agent to generate a reply.
- TTS: synthesize voice (streaming if possible) and play it.
- Interruption: if the user speaks, stop TTS playback and cancel current response.
3.3 Non-Functional Requirements
- Latency: measure and display timings per stage (STT, LLM, TTS).
- Robustness: handle background noise, silent input, and API failures.
- Privacy: document what audio/text is sent off-device (if using cloud STT/TTS).
3.4 Example Usage / Output
[VAD] Speech detected…
[STT] "Add olive oil to my shopping list"
[Agent] tool call: add_to_list("olive oil")
[TTS] "Done. Olive oil is on your list."
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ audio ┌───────────┐ text ┌──────────────┐ text ┌───────────┐ audio ┌──────────────┐
│ Microphone │───────▶│ VAD/Buffer │──────▶│ STT (Whisper) │──────▶│ Agent Loop │──────▶│ TTS Engine │───────▶│ Speaker │
└──────────────┘ └───────────┘ └──────────────┘ └───────────┘ └───────────┘ └──────────────┘
▲ │
│ cancel/interrupt │ tool calls
└───────────────────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| VAD | detect speech boundaries | thresholding vs model VAD |
| STT | transcription | streaming vs batch |
| Agent | produce answer and tool calls | reuse Project 6 patterns |
| TTS | synthesize audio | streaming vs batch; voice selection |
| Cancellation | interrupt pipeline | cancellation token propagation |
4.3 Data Structures
from dataclasses import dataclass
@dataclass(frozen=True)
class Timing:
vad_ms: int
stt_ms: int
llm_ms: int
tts_ms: int
4.4 Algorithm Overview
Key Algorithm: voice loop
- Continuously capture audio frames.
- VAD detects speech start; buffer until end-of-speech.
- STT transcribes (stream partials if available).
- Agent generates response (stream tokens if available).
- TTS converts response to audio (stream audio if available).
- If user speaks during TTS, cancel TTS and start a new turn.
Complexity Analysis:
- Time: real-time streaming; dominated by STT/LLM/TTS latencies
- Space: audio buffers + transcripts
5. Implementation Guide
5.1 Development Environment Setup
python -m venv .venv
source .venv/bin/activate
pip install pydantic sounddevice numpy
5.2 Project Structure
voice-jarvis/
├── src/
│ ├── main.py
│ ├── audio_capture.py
│ ├── vad.py
│ ├── stt.py
│ ├── agent.py
│ ├── tts.py
│ └── cancel.py
└── data/
└── logs/
5.3 Implementation Phases
Phase 1: Push-to-talk baseline (6–10h)
Goals:
- Record audio → STT → text response → TTS playback.
Tasks:
- Add keypress to start/stop recording.
- Transcribe and speak back.
- Log timings.
Checkpoint: You can do a full voice round-trip reliably.
Phase 2: VAD + streaming (8–14h)
Goals:
- Reduce latency and remove push-to-talk.
Tasks:
- Implement VAD segmentation.
- Add streaming STT and/or streaming LLM.
- Start TTS quickly using partial text chunks.
Checkpoint: Time-to-first-audio decreases substantially vs baseline.
Phase 3: Interruption + robustness (8–16h)
Goals:
- Make it feel natural.
Tasks:
- Cancellation token across STT/LLM/TTS.
- Stop playback when user speaks.
- Handle background noise and false triggers.
Checkpoint: You can interrupt the assistant mid-sentence and it recovers.
5.4 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| STT | cloud Whisper vs local | start cloud or local depending on privacy goal | speed vs privacy |
| TTS | cloud vs local | start with reliable TTS | UX matters; swap later |
| Wake word | simple keyword vs model | optional; start simple | reduce complexity early |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | VAD logic | silence vs speech segmentation |
| Replay | deterministic audio | run against saved WAV fixtures |
| Latency | budgets | fail if regression beyond threshold |
6.2 Critical Test Cases
- Noise robustness: background noise doesn’t trigger speech start constantly.
- Interruption: user starts speaking → TTS stops within 200ms.
- Streaming: partial response audio begins before full response completes.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| VAD too sensitive | constant triggers | tune thresholds; add hangover time |
| Audio device issues | no sound | device selection and sample rate settings |
| Latency spikes | “feels dead” | instrument each stage; optimize bottleneck |
| Cancellation bugs | pipeline gets stuck | propagate cancel tokens and clean up tasks |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add wake word toggle.
- Add a “repeat last answer” voice command.
8.2 Intermediate Extensions
- Add barge-in with partial transcript handling.
- Add per-user voice profiles and style preferences.
8.3 Advanced Extensions
- Full duplex (listen while speaking) with echo cancellation.
- Local-only voice stack (ties into Project 9).
9. Real-World Connections
9.1 Industry Applications
- Voice assistants, call-center copilots, accessibility tools.
9.3 Interview Relevance
- Streaming systems, latency budgets, cancellation propagation, and real-time UX engineering.
10. Resources
10.1 Essential Reading
- AI Engineering (Chip Huyen) — latency and production constraints
10.3 Tools & Documentation
- Whisper/STT docs (cloud or local)
- TTS provider docs (streaming output)
- WebRTC VAD references (segmentation patterns)
10.4 Related Projects in This Series
- Previous: Project 10 (monitoring) — instrument latency per stage
- Next: Project 12 (self-improving) — tool-making to extend voice assistant capabilities
11. Self-Assessment Checklist
- I can explain where latency comes from in voice-to-voice pipelines.
- I can show VAD segmentation behavior on noisy audio.
- I can interrupt responses and recover cleanly.
- I can stream output and measure time-to-first-audio.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Voice capture → STT → agent → TTS round-trip
- Basic timing logs per stage
Full Completion:
- VAD segmentation and streaming for lower latency
- Interruption/cancellation handling
Excellence (Going Above & Beyond):
- Full duplex with echo cancellation and local-only voice stack
This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.