Project 11: The Voice-Activated “JARVIS” (Whisper & TTS)

Project 11: The Voice-Activated “JARVIS” (Whisper & TTS)

Build a voice-to-voice assistant: microphone → speech-to-text → agent → text-to-speech, with streaming and interruption handling.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 25–40 hours
Language Python (Alternatives: Swift, JavaScript)
Prerequisites Audio basics, async I/O, websockets or streaming APIs
Key Topics VAD, streaming STT, streaming TTS, latency budgeting, full-duplex interruption

1. Learning Objectives

By completing this project, you will:

  1. Build a voice pipeline with clear latency budgets per stage.
  2. Implement voice activity detection (VAD) to detect when speech starts/ends.
  3. Stream partial transcripts into the agent loop for responsiveness.
  4. Stream TTS audio while text is still being generated.
  5. Implement interruption and cancellation (user talks over the assistant).

2. Theoretical Foundation

2.1 Core Concepts

  • Latency is UX: For a voice assistant to feel alive, time from “end of speech” to “start of response audio” should be near ~1s (or as low as feasible).
  • VAD (Voice Activity Detection): Detects speech segments without push-to-talk; reduces accidental triggers and wasted tokens.
  • Streaming:
    • Streaming STT reduces time-to-first-token by producing partial transcripts.
    • Streaming LLM lets you start generating before full completion.
    • Streaming TTS lets you play audio as it’s synthesized.
  • Interruption/cancellation: A voice assistant is effectively a real-time system; cancellation must propagate through all stages.

2.2 Why This Matters

Voice is the most “natural” assistant interface. If you can make it feel responsive and robust, you’ve mastered real-time orchestration.

2.3 Common Misconceptions

  • “Voice is just STT + chat + TTS.” It’s also buffering, streaming, concurrency, and cancellation.
  • “Higher accuracy STT always wins.” Latency and robustness often matter more for conversational UX.

3. Project Specification

3.1 What You Will Build

A local application that:

  • Listens for speech (optionally a wake word)
  • Transcribes speech
  • Sends the text into your assistant (can reuse Project 6 routing)
  • Speaks back with TTS

3.2 Functional Requirements

  1. Audio capture: microphone input with configurable device selection.
  2. VAD: segment speech and detect end-of-utterance.
  3. STT: transcribe (streaming if possible).
  4. Assistant: run a tool-using agent to generate a reply.
  5. TTS: synthesize voice (streaming if possible) and play it.
  6. Interruption: if the user speaks, stop TTS playback and cancel current response.

3.3 Non-Functional Requirements

  • Latency: measure and display timings per stage (STT, LLM, TTS).
  • Robustness: handle background noise, silent input, and API failures.
  • Privacy: document what audio/text is sent off-device (if using cloud STT/TTS).

3.4 Example Usage / Output

[VAD] Speech detected…
[STT] "Add olive oil to my shopping list"
[Agent] tool call: add_to_list("olive oil")
[TTS] "Done. Olive oil is on your list."

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐  audio  ┌───────────┐  text  ┌──────────────┐  text  ┌───────────┐  audio  ┌──────────────┐
│ Microphone    │───────▶│ VAD/Buffer │──────▶│ STT (Whisper) │──────▶│ Agent Loop │──────▶│ TTS Engine    │───────▶│ Speaker       │
└──────────────┘         └───────────┘        └──────────────┘        └───────────┘        └───────────┘         └──────────────┘
                                                ▲                           │
                                                │ cancel/interrupt          │ tool calls
                                                └───────────────────────────┘

4.2 Key Components

Component Responsibility Key Decisions
VAD detect speech boundaries thresholding vs model VAD
STT transcription streaming vs batch
Agent produce answer and tool calls reuse Project 6 patterns
TTS synthesize audio streaming vs batch; voice selection
Cancellation interrupt pipeline cancellation token propagation

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class Timing:
    vad_ms: int
    stt_ms: int
    llm_ms: int
    tts_ms: int

4.4 Algorithm Overview

Key Algorithm: voice loop

  1. Continuously capture audio frames.
  2. VAD detects speech start; buffer until end-of-speech.
  3. STT transcribes (stream partials if available).
  4. Agent generates response (stream tokens if available).
  5. TTS converts response to audio (stream audio if available).
  6. If user speaks during TTS, cancel TTS and start a new turn.

Complexity Analysis:

  • Time: real-time streaming; dominated by STT/LLM/TTS latencies
  • Space: audio buffers + transcripts

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install pydantic sounddevice numpy

5.2 Project Structure

voice-jarvis/
├── src/
│   ├── main.py
│   ├── audio_capture.py
│   ├── vad.py
│   ├── stt.py
│   ├── agent.py
│   ├── tts.py
│   └── cancel.py
└── data/
    └── logs/

5.3 Implementation Phases

Phase 1: Push-to-talk baseline (6–10h)

Goals:

  • Record audio → STT → text response → TTS playback.

Tasks:

  1. Add keypress to start/stop recording.
  2. Transcribe and speak back.
  3. Log timings.

Checkpoint: You can do a full voice round-trip reliably.

Phase 2: VAD + streaming (8–14h)

Goals:

  • Reduce latency and remove push-to-talk.

Tasks:

  1. Implement VAD segmentation.
  2. Add streaming STT and/or streaming LLM.
  3. Start TTS quickly using partial text chunks.

Checkpoint: Time-to-first-audio decreases substantially vs baseline.

Phase 3: Interruption + robustness (8–16h)

Goals:

  • Make it feel natural.

Tasks:

  1. Cancellation token across STT/LLM/TTS.
  2. Stop playback when user speaks.
  3. Handle background noise and false triggers.

Checkpoint: You can interrupt the assistant mid-sentence and it recovers.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
STT cloud Whisper vs local start cloud or local depending on privacy goal speed vs privacy
TTS cloud vs local start with reliable TTS UX matters; swap later
Wake word simple keyword vs model optional; start simple reduce complexity early

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit VAD logic silence vs speech segmentation
Replay deterministic audio run against saved WAV fixtures
Latency budgets fail if regression beyond threshold

6.2 Critical Test Cases

  1. Noise robustness: background noise doesn’t trigger speech start constantly.
  2. Interruption: user starts speaking → TTS stops within 200ms.
  3. Streaming: partial response audio begins before full response completes.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
VAD too sensitive constant triggers tune thresholds; add hangover time
Audio device issues no sound device selection and sample rate settings
Latency spikes “feels dead” instrument each stage; optimize bottleneck
Cancellation bugs pipeline gets stuck propagate cancel tokens and clean up tasks

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add wake word toggle.
  • Add a “repeat last answer” voice command.

8.2 Intermediate Extensions

  • Add barge-in with partial transcript handling.
  • Add per-user voice profiles and style preferences.

8.3 Advanced Extensions

  • Full duplex (listen while speaking) with echo cancellation.
  • Local-only voice stack (ties into Project 9).

9. Real-World Connections

9.1 Industry Applications

  • Voice assistants, call-center copilots, accessibility tools.

9.3 Interview Relevance

  • Streaming systems, latency budgets, cancellation propagation, and real-time UX engineering.

10. Resources

10.1 Essential Reading

  • AI Engineering (Chip Huyen) — latency and production constraints

10.3 Tools & Documentation

  • Whisper/STT docs (cloud or local)
  • TTS provider docs (streaming output)
  • WebRTC VAD references (segmentation patterns)
  • Previous: Project 10 (monitoring) — instrument latency per stage
  • Next: Project 12 (self-improving) — tool-making to extend voice assistant capabilities

11. Self-Assessment Checklist

  • I can explain where latency comes from in voice-to-voice pipelines.
  • I can show VAD segmentation behavior on noisy audio.
  • I can interrupt responses and recover cleanly.
  • I can stream output and measure time-to-first-audio.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Voice capture → STT → agent → TTS round-trip
  • Basic timing logs per stage

Full Completion:

  • VAD segmentation and streaming for lower latency
  • Interruption/cancellation handling

Excellence (Going Above & Beyond):

  • Full duplex with echo cancellation and local-only voice stack

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.