Project 5: The “Ear” of Edge AI (XIAO ESP32S3 Sense)
Build a keyword-spotting device that listens through the on-board microphone and triggers an action locally using TinyML.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C/C++ (ESP-IDF + TFLM) |
| Alternative Programming Languages | Arduino C++ (Edge Impulse export) |
| Coolness Level | Very High |
| Business Potential | Medium (voice triggers) |
| Prerequisites | DSP basics, sampling, C memory management |
| Key Topics | Audio sampling, MFCC/MFE, TinyML quantization |
1. Learning Objectives
By completing this project, you will:
- Capture microphone audio and buffer it reliably in real time.
- Extract MFCC or MFE features and feed them to a TinyML model.
- Deploy a quantized model using TensorFlow Lite Micro (TFLM).
- Tune thresholds to balance false positives and false negatives.
- Profile inference latency and memory usage on the ESP32-S3.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Concept 1: Audio Sampling, Buffers, and DSP Features
Fundamentals
Audio sampling converts continuous sound into discrete digital samples. The sampling rate must be at least twice the highest frequency you care about (Nyquist). For keyword spotting, 16 kHz is common, which captures speech well. Samples are typically 16-bit signed integers. To feed a model, you must buffer samples into fixed-size windows and extract features such as MFCC (Mel-Frequency Cepstral Coefficients) or MFE (Mel-Frequency Energy). These features compress the audio into a lower-dimensional representation that captures phonetic content while ignoring irrelevant details. The buffer and feature pipeline must run in real time, or you will miss audio frames.
Deep Dive into the Concept
Sampling is not just taking numbers; it is a timing-critical pipeline. The microphone interface (I2S or PDM) produces samples at a fixed rate. The MCU must move those samples into memory using DMA or fast interrupt handlers. On the ESP32-S3 Sense, the microphone is typically connected through I2S. You configure the I2S peripheral for the sample rate, bit depth, and channel format. A DMA ring buffer receives blocks of samples. If you do not process them fast enough, the DMA buffer overruns and you lose data.
Feature extraction transforms raw audio into a representation that is more stable for machine learning. MFCC starts by framing the audio into short windows (e.g., 25 ms) with overlap (e.g., 10 ms stride). Each frame is windowed (Hamming window) to reduce spectral leakage, then transformed with an FFT to get the frequency spectrum. The spectrum is passed through a Mel filterbank to emphasize frequencies relevant to human perception. The log of the filterbank energies is then transformed via a DCT to yield MFCC coefficients. MFE is similar but uses the log filterbank energies directly. The resulting feature matrix is a time-frequency image that a neural network can classify.
Real-time constraints matter. If you compute MFCC on the CPU without optimization, you might not keep up with the audio stream. This is why many systems use fixed-point or optimized DSP libraries. The ESP32-S3 includes vector extensions and can use ESP-NN or other DSP-optimized kernels. You can also reduce workload by lowering sample rate, reducing the number of MFCC coefficients, or using shorter windows. However, reducing too much can hurt accuracy. The key is to balance compute cost and model performance.
Buffers are also critical. You typically maintain a ring buffer that stores the last N samples and a sliding window for feature extraction. Overlap means each new frame shares samples with the previous frame. Implementing this efficiently avoids copying large blocks of memory. A common approach is to store audio in a circular buffer and index into it when building frames. The latency of the system is the window size plus the time to compute features and run inference. If your window is 1 second and you run inference every 100 ms, you can detect keywords quickly while still using a long context window. This overlapping approach increases compute cost but improves responsiveness.
Finally, the audio front end includes pre-processing like DC offset removal, gain control, and noise filtering. Some microphones provide a raw stream with a DC bias that must be removed. You can implement a high-pass filter to reduce low-frequency noise. If you skip this, your MFCC features may be dominated by noise and your model will misclassify. That is why data collected with the actual microphone is essential; differences in frequency response and noise floor matter.
How this fits on projects
You will configure I2S, collect audio frames, extract MFCC/MFE, and feed a model. This pipeline is the core of the keyword spotting device.
Definitions & Key Terms
- Sample rate: Number of audio samples per second.
- Nyquist: Sampling rate must be at least twice the highest frequency.
- MFCC: Feature representation for speech.
- MFE: Mel-frequency energy features.
- FFT: Fast Fourier Transform, converts time to frequency.
- Windowing: Applying a window to reduce spectral leakage.
Mental Model Diagram (ASCII)
Mic -> I2S/DMA -> Ring buffer -> Frames -> FFT -> Mel filter -> Features -> Model
How It Works (Step-by-Step)
- Configure I2S with sample rate and bit depth.
- DMA fills a ring buffer with samples.
- Extract overlapping frames from the buffer.
- Apply window, FFT, and Mel filterbank.
- Produce MFCC/MFE features for inference.
Minimal Concrete Example
// Pseudocode frame extraction
for each frame {
copy 400 samples (25ms @ 16kHz)
apply Hamming window
run FFT -> mel -> log -> DCT
}
Common Misconceptions
- “Higher sample rate always helps.” It increases compute cost and may not improve accuracy.
- “Raw audio is enough.” Models usually require feature extraction.
- “No overlap is fine.” Overlap improves detection latency and accuracy.
Check-Your-Understanding Questions
- Why is overlap used in audio frames?
- What does the Mel filterbank do?
- How does a ring buffer prevent data loss?
Check-Your-Understanding Answers
- It allows frequent inference while still using a longer context window.
- It compresses the spectrum to reflect human hearing sensitivity.
- It allows continuous sampling without blocking while you process data.
Real-World Applications
- Voice activation in smart speakers.
- Wake-word detection in wearables.
Where You’ll Apply It
- See Section 3.2 Functional Requirements and Section 5.10 Phase 2.
- Also used in: P10 Web Oscilloscope for streaming data pipelines.
References
- “TinyML” (DSP and audio chapters)
- “Understanding Digital Signal Processing” (audio features)
- ESP32-S3 I2S documentation
Key Insights
A good TinyML model starts with a reliable audio pipeline.
Summary
Audio sampling and feature extraction are the foundation of keyword spotting. If your buffer or features are wrong, the model will never perform well.
Homework/Exercises to Practice the Concept
- Calculate the number of samples in a 25 ms window at 16 kHz.
- Explain how MFCC compresses audio information.
Solutions to the Homework/Exercises
- 16,000 samples/sec * 0.025 sec = 400 samples.
- It reduces the spectrum into perceptually scaled coefficients.
2.2 Concept 2: TinyML Quantization, Memory Arena, and Inference
Fundamentals
TinyML models must fit into tiny memory. Quantization converts floating-point weights and activations into 8-bit integers, reducing memory and speeding inference. TensorFlow Lite Micro (TFLM) runs models on microcontrollers using a fixed memory arena for all tensors. You must size this arena carefully; too small and inference fails, too large and you waste RAM. The model must also be small enough to fit in flash. Understanding quantization and memory planning is essential for deploying models on the ESP32-S3.
Deep Dive into the Concept
Quantization is a technique that maps floating-point values into integers with a scale and zero-point. For an 8-bit model, each tensor has parameters scale and zero_point such that real_value = scale * (int8_value - zero_point). This allows integer arithmetic to approximate floating-point operations. The benefits are dramatic: smaller models, faster inference, and lower power. The tradeoff is reduced precision, which can lower accuracy. For keyword spotting, quantized models often work well because the input features are already compressed and the classification problem is relatively simple.
TFLM uses a static memory arena because dynamic allocation is risky on microcontrollers. The arena is a fixed chunk of RAM that holds input, output, and intermediate tensors. The interpreter plans the memory layout so that tensors can reuse memory as layers execute. You must choose an arena size that covers the peak memory usage. If the arena is too small, TFLM will fail at runtime with a memory error. Profiling tools can show the required size, but often you must experiment. The ESP32-S3 has more RAM than smaller MCUs, but you still need to budget memory for audio buffers, logs, and stacks.
Inference latency depends on model size, operator choice, and hardware acceleration. The ESP32-S3 includes vector extensions and can use optimized kernels from ESP-NN. If you enable these, you can cut inference time significantly. However, you must ensure the operators in your model are supported by the optimized kernel set. For example, some convolution or depthwise convolution operators are accelerated, while others fall back to reference implementations.
Another consideration is model input format. The model expects a fixed-size feature matrix, so you must align your feature extraction to that shape. If the model expects 49 frames of 10 MFCC coefficients, your pipeline must produce exactly that. If you change the pipeline, you must retrain the model. This tight coupling is why TinyML projects require careful documentation of preprocessing parameters.
You also need to manage latency. If feature extraction takes 50 ms and inference takes 30 ms, and you run inference every 100 ms, you are close to real-time. If it takes longer, you will fall behind and buffer overflows will occur. The practical strategy is to measure each stage and optimize the heaviest ones, or reduce the model size. Many TinyML projects start with a model from Edge Impulse or TensorFlow examples and then iterate on performance.
How this fits on projects
You will load a quantized model into flash, allocate a TFLM arena, and run inference on MFCC features. You will profile inference time and adjust thresholds.
Definitions & Key Terms
- Quantization: Mapping float values to integers for compact models.
- TFLM: TensorFlow Lite Micro, runtime for MCUs.
- Memory arena: Fixed RAM buffer for all tensors.
- Zero-point: Integer offset used in quantization.
- Operator kernel: Implementation of a model layer.
Mental Model Diagram (ASCII)
Features -> [TFLM Interpreter]
|
+-- arena memory
+-- quantized ops
How It Works (Step-by-Step)
- Convert model to int8 quantized TFLM format.
- Allocate tensor arena in RAM.
- Load features into input tensor.
- Invoke interpreter.
- Read output scores and apply threshold.
Minimal Concrete Example
TfLiteTensor* input = interpreter->input(0);
memcpy(input->data.int8, feature_buffer, input->bytes);
interpreter->Invoke();
TfLiteTensor* output = interpreter->output(0);
Common Misconceptions
- “Quantization always hurts accuracy.” Often it is nearly identical for small models.
- “Arena size is arbitrary.” It must cover peak tensor usage.
- “Any model will run on MCU.” Many models are too large or use unsupported ops.
Check-Your-Understanding Questions
- Why is quantization important for microcontrollers?
- What happens if the tensor arena is too small?
- Why must preprocessing match the model’s expected input shape?
Check-Your-Understanding Answers
- It reduces memory and computation requirements.
- TFLM fails at runtime and inference cannot run.
- The model weights are trained for that exact input representation.
Real-World Applications
- Wake-word detection on earbuds.
- Offline voice commands in appliances.
Where You’ll Apply It
- See Section 3.2 Functional Requirements and Section 6.2 Critical Test Cases.
- Also used in: P01 Architectural Blink for memory discipline.
References
- “TinyML” (deployment chapters)
- TensorFlow Lite Micro documentation
- ESP-NN library docs
Key Insights
TinyML success is as much about memory planning as model accuracy.
Summary
Quantized models and the TFLM memory arena make neural networks feasible on MCUs. You must align preprocessing, memory, and kernels to achieve real-time inference.
Homework/Exercises to Practice the Concept
- Estimate the memory usage of a model with 30k parameters in int8.
- List two operator types that are commonly accelerated on ESP32-S3.
Solutions to the Homework/Exercises
- 30k params * 1 byte = 30 KB, plus activations and overhead.
- Convolution and depthwise convolution are common accelerated ops.
3. Project Specification
3.1 What You Will Build
A keyword spotting firmware that captures audio from the XIAO ESP32S3 Sense microphone, extracts MFCC features, runs a quantized TinyML model, and triggers an LED or buzzer when the keyword is detected.
3.2 Functional Requirements
- Audio Capture: Stream audio at 16 kHz using I2S with DMA.
- Feature Extraction: Compute MFCC or MFE features in real time.
- Model Inference: Run a quantized TFLM model locally.
- Decision Logic: Trigger an action when score exceeds threshold.
- Logging: Print inference scores and latency to serial.
3.3 Non-Functional Requirements
- Latency: Detect keyword within 300 ms of utterance.
- Accuracy: >90 percent accuracy in your environment.
- Stability: Runs continuously for 30 minutes without overflow.
3.4 Example Usage / Output
[Infer] score=0.97 label=XIAO latency=42ms
[Action] Keyword detected
3.5 Data Formats / Schemas / Protocols
- Feature tensor shape:
[frames][coeffs](e.g., 49x10) - Model output: class probabilities (int8 or uint8)
3.6 Edge Cases
- Buffer overrun when feature extraction is too slow.
- Model output unstable in noisy environments.
- Memory arena too small for model.
3.7 Real World Outcome
A device that responds to a spoken keyword without cloud connectivity.
3.7.1 How to Run (Copy/Paste)
idf.py set-target esp32s3
idf.py build
idf.py -p /dev/ttyUSB0 flash monitor
3.7.2 Golden Path Demo (Deterministic)
- Fixed threshold 0.8
- Use recorded test audio for repeatability
Expected logs:
[Infer] score=0.95 label=XIAO latency=40ms
[Action] Keyword detected
3.7.3 Failure Demo (Buffer Overrun)
E (1234) AUDIO: DMA buffer overrun
E (1234) AUDIO: Dropping frame
3.7.4 If CLI
No standalone CLI. Exit codes not applicable.
3.7.5 If Web App
Not applicable.
3.7.6 If API
No API is exposed. Error JSON shape not applicable.
3.7.7 If GUI / Desktop / Mobile
Not applicable.
3.7.8 If TUI
Not applicable.
4. Solution Architecture
4.1 High-Level Design
Mic -> I2S/DMA -> Feature extraction -> TFLM -> Decision -> Action
4.2 Key Components
| Component | Responsibility | Key Decisions | |———-|—————-|—————| | Audio capture | Configure I2S and DMA | Sample rate 16 kHz | | Feature pipeline | MFCC/MFE extraction | Frame size/stride | | Model runtime | TFLM inference | Arena size, kernels | | Decision logic | Threshold + debounce | Min confidence |
4.3 Data Structures (No Full Code)
struct inference_state {
int8_t features[49][10];
uint32_t last_trigger_ms;
};
4.4 Algorithm Overview
Key Algorithm: Keyword Detection
- Capture audio frame.
- Update rolling buffer and compute features.
- Run inference on latest window.
- If score > threshold, trigger action.
Complexity Analysis:
- Time: O(F) per frame (FFT + inference)
- Space: O(window size)
5. Implementation Guide
5.1 Development Environment Setup
idf.py set-target esp32s3
idf.py build
5.2 Project Structure
p05_tinyml_ear/
+-- main/
| +-- audio_capture.c
| +-- feature_mfcc.c
| +-- model_data.cc
| +-- app_main.c
+-- models/
+-- keyword_model.tflite
5.3 The Core Question You’re Answering
“Can a tiny microcontroller perform reliable speech recognition on-device?”
5.4 Concepts You Must Understand First
- Audio sampling and MFCC features
- Quantized inference and TFLM arena
- Real-time buffering and latency
5.5 Questions to Guide Your Design
- What window size and stride balance latency vs accuracy?
- How large is your model and arena?
- What threshold minimizes false positives?
5.6 Thinking Exercise
If you process 1-second windows every 100 ms, how many overlapping windows cover a 1.2 second keyword?
5.7 The Interview Questions They’ll Ask
- Why is quantization required for TinyML?
- What is the role of MFCC features?
- How do you handle false positives?
5.8 Hints in Layers
Hint 1: Start with a known model from Edge Impulse or TensorFlow example.
Hint 2: Use a ring buffer for audio samples.
Hint 3: Measure inference time and adjust model size.
Hint 4: Collect training data with the on-board microphone.
5.9 Books That Will Help
| Topic | Book | Chapter | |——|——|———| | TinyML deployment | TinyML | Deployment chapters | | DSP basics | Understanding Digital Signal Processing | MFCC chapters |
5.10 Implementation Phases
Phase 1: Audio Capture (2-3 days)
Goals:
- Stream audio reliably.
- Log raw sample levels.
Tasks:
- Configure I2S and DMA buffers.
- Log RMS values to verify capture.
Checkpoint: Stable stream without overruns.
Phase 2: Feature + Model (3-4 days)
Goals:
- Extract features.
- Run inference with a small model.
Tasks:
- Implement MFCC pipeline.
- Integrate TFLM model.
Checkpoint: Inference runs and outputs scores.
Phase 3: Tuning and Robustness (2-3 days)
Goals:
- Tune threshold and debounce.
- Measure accuracy and latency.
Tasks:
- Test with real speech and noise.
- Adjust threshold and smoothing.
Checkpoint: >90 percent accuracy in your environment.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Feature type | MFCC vs MFE | MFCC | Strong speech features | | Sample rate | 8k vs 16k | 16k | Better speech fidelity | | Model size | Small vs large | Small + quantized | Real-time on MCU |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate MFCC output shapes | Feature vector length | | Integration Tests | End-to-end detection | Speak keyword | | Edge Case Tests | Noise-only input | False positive checks |
6.2 Critical Test Cases
- Known Keyword: Detect with >90 percent confidence.
- Noise: No trigger in silent/noise environment.
- Buffer Overrun: System recovers gracefully.
6.3 Test Data
keyword_samples: 20 recordings
noise_samples: 20 recordings
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | Overrun buffers | Audio glitches | Increase DMA size or reduce load | | Wrong feature shape | Model errors | Match training pipeline | | High false positives | Constant triggers | Increase threshold, add debounce |
7.2 Debugging Strategies
- Log feature statistics (mean, variance) to detect bad audio.
- Measure inference time and memory usage.
7.3 Performance Traps
- Large FFT sizes or high sample rates can exceed CPU budget.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a second keyword class.
- Blink LED with different patterns per label.
8.2 Intermediate Extensions
- Add a simple noise gate before inference.
- Stream features over UART for visualization.
8.3 Advanced Extensions
- Implement on-device training or personalization.
- Add a VAD (voice activity detector) to reduce compute.
9. Real-World Connections
9.1 Industry Applications
- Voice wake-word in appliances and wearables.
- Offline voice triggers in industrial equipment.
9.2 Related Open Source Projects
- TensorFlow Lite Micro - inference engine.
- Edge Impulse - TinyML deployment pipeline.
9.3 Interview Relevance
- TinyML deployment and quantization are common ML-on-device topics.
10. Resources
10.1 Essential Reading
- “TinyML” by Warden and Situnayake
- TensorFlow Lite Micro documentation
10.2 Video Resources
- “TinyML on Microcontrollers” (talk)
- “MFCC Explained” (tutorial)
10.3 Tools & Documentation
- ESP32-S3 I2S documentation
- ESP-NN optimized kernels
10.4 Related Projects in This Series
- P10 Web Oscilloscope - streaming data in real time
- P02 Deep Sleep Champion - power profiling for always-on listening
11. Self-Assessment Checklist
11.1 Understanding
- I can explain MFCC extraction steps.
- I can describe quantization and the TFLM arena.
- I can compute the latency budget.
11.2 Implementation
- Audio capture is stable with no overruns.
- Model runs in real time.
- Keyword detection is accurate and reliable.
11.3 Growth
- I documented at least one model tuning change.
- I can explain tradeoffs between accuracy and latency.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Model runs on device and outputs scores.
- Keyword detection triggers an action.
Full Completion:
-
90 percent accuracy in your environment.
- Stable streaming and inference for 30 minutes.
Excellence (Going Above & Beyond):
- Two-keyword model with robust noise handling.
- Performance report with latency and memory metrics.