Project 11: The Voice-Activated “JARVIS” (Whisper & TTS)

Build a voice-to-voice assistant: microphone → speech-to-text → agent → text-to-speech, with streaming and interruption handling.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	25–40 hours
Language	Python (Alternatives: Swift, JavaScript)
Prerequisites	Audio basics, async I/O, websockets or streaming APIs
Key Topics	VAD, streaming STT, streaming TTS, latency budgeting, full-duplex interruption

1. Learning Objectives

By completing this project, you will:

Build a voice pipeline with clear latency budgets per stage.
Implement voice activity detection (VAD) to detect when speech starts/ends.
Stream partial transcripts into the agent loop for responsiveness.
Stream TTS audio while text is still being generated.
Implement interruption and cancellation (user talks over the assistant).

2. Theoretical Foundation

2.1 Core Concepts

Latency is UX: For a voice assistant to feel alive, time from “end of speech” to “start of response audio” should be near ~1s (or as low as feasible).
VAD (Voice Activity Detection): Detects speech segments without push-to-talk; reduces accidental triggers and wasted tokens.
Streaming:
- Streaming STT reduces time-to-first-token by producing partial transcripts.
- Streaming LLM lets you start generating before full completion.
- Streaming TTS lets you play audio as it’s synthesized.
Interruption/cancellation: A voice assistant is effectively a real-time system; cancellation must propagate through all stages.

2.2 Why This Matters

Voice is the most “natural” assistant interface. If you can make it feel responsive and robust, you’ve mastered real-time orchestration.

2.3 Common Misconceptions

“Voice is just STT + chat + TTS.” It’s also buffering, streaming, concurrency, and cancellation.
“Higher accuracy STT always wins.” Latency and robustness often matter more for conversational UX.

3. Project Specification

3.1 What You Will Build

A local application that:

Listens for speech (optionally a wake word)
Transcribes speech
Sends the text into your assistant (can reuse Project 6 routing)
Speaks back with TTS

3.2 Functional Requirements

Audio capture: microphone input with configurable device selection.
VAD: segment speech and detect end-of-utterance.
STT: transcribe (streaming if possible).
Assistant: run a tool-using agent to generate a reply.
TTS: synthesize voice (streaming if possible) and play it.
Interruption: if the user speaks, stop TTS playback and cancel current response.

3.3 Non-Functional Requirements

Latency: measure and display timings per stage (STT, LLM, TTS).
Robustness: handle background noise, silent input, and API failures.
Privacy: document what audio/text is sent off-device (if using cloud STT/TTS).

3.4 Example Usage / Output

[VAD] Speech detected…
[STT] "Add olive oil to my shopping list"
[Agent] tool call: add_to_list("olive oil")
[TTS] "Done. Olive oil is on your list."

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐  audio  ┌───────────┐  text  ┌──────────────┐  text  ┌───────────┐  audio  ┌──────────────┐
│ Microphone    │───────▶│ VAD/Buffer │──────▶│ STT (Whisper) │──────▶│ Agent Loop │──────▶│ TTS Engine    │───────▶│ Speaker       │
└──────────────┘         └───────────┘        └──────────────┘        └───────────┘        └───────────┘         └──────────────┘
                                                ▲                           │
                                                │ cancel/interrupt          │ tool calls
                                                └───────────────────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
VAD	detect speech boundaries	thresholding vs model VAD
STT	transcription	streaming vs batch
Agent	produce answer and tool calls	reuse Project 6 patterns
TTS	synthesize audio	streaming vs batch; voice selection
Cancellation	interrupt pipeline	cancellation token propagation

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class Timing:
    vad_ms: int
    stt_ms: int
    llm_ms: int
    tts_ms: int

4.4 Algorithm Overview

Key Algorithm: voice loop

Continuously capture audio frames.
VAD detects speech start; buffer until end-of-speech.
STT transcribes (stream partials if available).
Agent generates response (stream tokens if available).
TTS converts response to audio (stream audio if available).
If user speaks during TTS, cancel TTS and start a new turn.

Complexity Analysis:

Time: real-time streaming; dominated by STT/LLM/TTS latencies
Space: audio buffers + transcripts

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install pydantic sounddevice numpy

5.2 Project Structure

voice-jarvis/
├── src/
│   ├── main.py
│   ├── audio_capture.py
│   ├── vad.py
│   ├── stt.py
│   ├── agent.py
│   ├── tts.py
│   └── cancel.py
└── data/
    └── logs/

5.3 Implementation Phases

Phase 1: Push-to-talk baseline (6–10h)

Goals:

Record audio → STT → text response → TTS playback.

Tasks:

Add keypress to start/stop recording.
Transcribe and speak back.
Log timings.

Checkpoint: You can do a full voice round-trip reliably.

Phase 2: VAD + streaming (8–14h)

Goals:

Reduce latency and remove push-to-talk.

Tasks:

Implement VAD segmentation.
Add streaming STT and/or streaming LLM.
Start TTS quickly using partial text chunks.

Checkpoint: Time-to-first-audio decreases substantially vs baseline.

Phase 3: Interruption + robustness (8–16h)

Goals:

Make it feel natural.

Tasks:

Cancellation token across STT/LLM/TTS.
Stop playback when user speaks.
Handle background noise and false triggers.

Checkpoint: You can interrupt the assistant mid-sentence and it recovers.

5.4 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
STT	cloud Whisper vs local	start cloud or local depending on privacy goal	speed vs privacy
TTS	cloud vs local	start with reliable TTS	UX matters; swap later
Wake word	simple keyword vs model	optional; start simple	reduce complexity early

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	VAD logic	silence vs speech segmentation
Replay	deterministic audio	run against saved WAV fixtures
Latency	budgets	fail if regression beyond threshold

6.2 Critical Test Cases

Noise robustness: background noise doesn’t trigger speech start constantly.
Interruption: user starts speaking → TTS stops within 200ms.
Streaming: partial response audio begins before full response completes.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
VAD too sensitive	constant triggers	tune thresholds; add hangover time
Audio device issues	no sound	device selection and sample rate settings
Latency spikes	“feels dead”	instrument each stage; optimize bottleneck
Cancellation bugs	pipeline gets stuck	propagate cancel tokens and clean up tasks

8. Extensions & Challenges

8.1 Beginner Extensions

Add wake word toggle.
Add a “repeat last answer” voice command.

8.2 Intermediate Extensions

Add barge-in with partial transcript handling.
Add per-user voice profiles and style preferences.

8.3 Advanced Extensions

Full duplex (listen while speaking) with echo cancellation.
Local-only voice stack (ties into Project 9).

9. Real-World Connections

9.1 Industry Applications

Voice assistants, call-center copilots, accessibility tools.

9.3 Interview Relevance

Streaming systems, latency budgets, cancellation propagation, and real-time UX engineering.

10. Resources

10.1 Essential Reading

AI Engineering (Chip Huyen) — latency and production constraints

10.3 Tools & Documentation

Whisper/STT docs (cloud or local)
TTS provider docs (streaming output)
WebRTC VAD references (segmentation patterns)

Previous: Project 10 (monitoring) — instrument latency per stage
Next: Project 12 (self-improving) — tool-making to extend voice assistant capabilities

11. Self-Assessment Checklist

I can explain where latency comes from in voice-to-voice pipelines.
I can show VAD segmentation behavior on noisy audio.
I can interrupt responses and recover cleanly.
I can stream output and measure time-to-first-audio.

12. Submission / Completion Criteria

Minimum Viable Completion:

Voice capture → STT → agent → TTS round-trip
Basic timing logs per stage

Full Completion:

VAD segmentation and streaming for lower latency
Interruption/cancellation handling

Excellence (Going Above & Beyond):

Full duplex with echo cancellation and local-only voice stack

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.

Project 11: The Voice-Activated “JARVIS” (Whisper & TTS)

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Push-to-talk baseline (6–10h)

Phase 2: VAD + streaming (8–14h)

Phase 3: Interruption + robustness (8–16h)

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

12. Submission / Completion Criteria