Project 3: Real-Time Audio Spectrum Analyzer
Build a handheld audio spectrum analyzer that captures microphone input over I2S, performs FFT analysis, and renders smooth frequency bars in real time.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2–3 weeks |
| Main Programming Language | C/C++ (ESP-IDF) |
| Alternative Programming Languages | Arduino |
| Coolness Level | Very High |
| Business Potential | Medium (audio diagnostics, field meters) |
| Prerequisites | DSP basics, I2S familiarity, basic UI rendering |
| Key Topics | I2S DMA, FFT, windowing, real-time UI pipelines |
1. Learning Objectives
By completing this project, you will:
- Configure I2S audio capture with DMA buffers and stable sampling rates.
- Implement FFT analysis with windowing and magnitude scaling.
- Design a real-time UI that updates smoothly under CPU constraints.
- Calibrate microphone input and avoid clipping or DC bias.
- Build a reliable audio pipeline that tolerates load without glitches.
2. All Theory Needed (Per-Concept Breakdown)
2.1 I2S Audio Capture and DMA Buffering
Fundamentals
I2S is a digital audio bus that transmits PCM samples with a bit clock and word select signal. For an audio analyzer, you need stable sampling: each buffer must contain equally spaced samples, and the CPU must receive those buffers without loss. This is why DMA (Direct Memory Access) is essential. DMA moves audio data from the I2S peripheral into memory without CPU intervention, then raises an interrupt when a buffer is ready. Your task is to configure sample rate, bit depth, channel format, and DMA buffer size so that you get continuous, predictable streams of samples.
Deep Dive into the concept
The I2S peripheral on ESP32-S3 can be configured for master receive mode, with the MCU generating clocks or accepting an external clock. For the Cardputer’s MEMS microphone, you typically generate the clock and read 16-bit mono PCM at 16 kHz or 22.05 kHz. The sample rate is critical because it sets your FFT frequency resolution: with 16 kHz and a 1024-sample FFT, your bin size is about 15.6 Hz. DMA buffer size must align with your FFT window. For example, if you want 1024 samples, you might configure two DMA buffers of 512 samples each and assemble them. If you choose too small a buffer, CPU overhead increases; too large and latency increases. The key is to balance latency, CPU load, and FFT update rate.
DMA memory must be allocated in a region compatible with DMA. ESP-IDF provides heap_caps_malloc with flags like MALLOC_CAP_DMA. If you allocate buffers in the wrong region, transfers may fail or produce corrupted data. Another subtlety is sample format: some MEMS mics output left-justified data with a fixed offset. You may need to shift or sign-extend samples to produce correct signed PCM values. If you ignore this, your spectrum will look wrong or show a strong DC component.
The capture callback should not do FFTs. It should simply enqueue buffer pointers or copy samples into a circular buffer and return. A separate DSP task can process sample blocks at a fixed cadence. You also need to handle buffer overrun: if the DSP task falls behind, you should drop the oldest buffer rather than blocking I2S. This keeps your audio pipeline stable even when the UI task is busy. For debugging, log the rate of DMA buffer completions and the DSP processing time; if processing time exceeds buffer duration, you will inevitably fall behind.
How this fits in projects
This capture pipeline mirrors the WiFi sniffer pipeline in P01-wifi-packet-sniffer-network-analyzer.md and informs the concurrency discipline in P08-complete-cardputer-security-toolkit.md.
Definitions & key terms
- I2S → serial bus for audio PCM data
- DMA → hardware data transfer without CPU
- Sample rate → number of samples per second
- PCM → raw audio sample format
- Buffer overrun → data arrives faster than consumer can process
Mental model diagram (ASCII)
[Mic] -> [I2S] -> [DMA Buffers] -> [DSP Task] -> [FFT] -> [UI]
How it works (step-by-step, with invariants and failure modes)
- Configure I2S for desired sample rate and format.
- DMA fills buffer and signals completion.
- ISR or driver callback posts buffer to DSP queue.
- DSP task assembles window and runs FFT.
- UI task renders magnitude bars.
Invariants: buffers are DMA-capable; sample rate stable; queue bounded. Failure modes: buffer overrun drops samples; incorrect sample format causes distorted spectrum; blocking in callback stalls capture.
Minimal concrete example
i2s_read(I2S_NUM_0, dma_buf, buf_bytes, &bytes_read, portMAX_DELAY);
Common misconceptions
- “I can read samples with polling.” → Polling is too slow for stable capture.
- “Any buffer size works.” → Buffer size determines latency and CPU load.
- “All mics output standard PCM.” → Some mics output left-justified or biased samples.
Check-your-understanding questions
- Why must DMA buffers be allocated in DMA-capable memory?
- How does sample rate affect FFT bin size?
- What happens if DSP processing time exceeds buffer duration?
Check-your-understanding answers
- DMA cannot access all memory regions; wrong region causes failure.
- Higher sample rate increases the max frequency and widens bin spacing.
- Buffers accumulate and you eventually drop samples.
Real-world applications
- Audio spectrum monitoring for sound engineers.
- Field diagnostics for noise and vibration analysis.
Where you’ll apply it
- This project: see §4.1 and §5.10 for buffer strategy.
- Also used in:
P08-complete-cardputer-security-toolkit.md.
References
- ESP-IDF I2S driver documentation.
- The Scientist and Engineer’s Guide to DSP – sampling fundamentals.
Key insight
Real-time audio depends on a stable capture pipeline; if capture wobbles, everything else fails.
Summary
Configure I2S + DMA carefully, keep callbacks short, and match buffers to your FFT window.
Homework/Exercises to practice the concept
- Capture 1 second of audio and print min/max sample values.
- Change sample rate and observe FFT bin spacing.
Solutions to the homework/exercises
- Read samples, track max/min, and verify they are within signed 16-bit range.
- Compute
bin_size = sample_rate / FFT_sizeand compare plots.
2.2 FFT, Windowing, and Magnitude Scaling
Fundamentals
The FFT converts time-domain samples into frequency-domain magnitudes. A naive FFT without windowing assumes the signal is periodic within the buffer. Real audio signals are not perfectly periodic, so you must apply a window function (Hann, Hamming) to reduce spectral leakage. After the FFT, you compute magnitudes, often convert to logarithmic (dB) scale, and map them to display bars. The key is to produce stable, readable output without jitter or aliasing.
Deep Dive into the concept
FFT size sets your frequency resolution and processing cost. A 1024-point FFT gives reasonable resolution at 16 kHz, but a 2048-point FFT doubles resolution at the cost of higher CPU and latency. You need to choose a size that fits the device’s processing budget. Windowing is critical: without it, a single tone spreads energy into neighboring bins (leakage). The Hann window is a good compromise because it reduces sidelobes with moderate amplitude distortion. You multiply each sample by the window before the FFT. This costs CPU but improves clarity.
Magnitude computation requires converting complex FFT outputs into real magnitudes: sqrt(re^2 + im^2) or a faster approximation. Then you often apply logarithmic scaling because human perception of loudness is logarithmic. You can compute 20*log10(mag) and clamp it to a range. For embedded performance, you might precompute a log lookup table or use a fixed-point approximation. You also should average or smooth magnitudes across frames to reduce flicker. A simple exponential moving average (EMA) per bin can stabilize the display.
Aliasing and Nyquist are fundamental. Your maximum frequency is half the sample rate. If you try to interpret bins above that, you are looking at aliases. Also, if you sample at 16 kHz, you cannot correctly represent 10 kHz tones without careful anti-aliasing; the MEMS mic and I2S configuration should include a low-pass response that makes this acceptable. For display, you may want to group bins into bands (e.g., 8 or 16 bars) using a logarithmic frequency scale. This matches human hearing and makes the display more intuitive.
Finally, normalization matters. If your FFT magnitude is unscaled, loud signals saturate and quiet signals disappear. You can normalize by the FFT size and window gain. Use a calibrated reference tone (e.g., 1 kHz) to set display levels. Provide an on-screen “gain” setting to adjust sensitivity in the field.
How this fits in projects
FFT analysis is specific to this project, but the idea of buffering, processing, and UI rendering in real time appears throughout the series, especially in P01-wifi-packet-sniffer-network-analyzer.md and P08-complete-cardputer-security-toolkit.md.
Definitions & key terms
- FFT → fast algorithm to compute frequency spectrum
- Windowing → weighting samples to reduce spectral leakage
- Spectral leakage → energy spread across bins due to non-periodic signals
- Nyquist frequency → half the sample rate
- EMA → exponential moving average for smoothing
Mental model diagram (ASCII)
[Samples] -> [Window] -> [FFT] -> [Magnitude] -> [Smoothing] -> [Bars]
How it works (step-by-step, with invariants and failure modes)
- Collect N samples in a buffer.
- Apply window function to each sample.
- Run FFT to produce complex frequency bins.
- Compute magnitudes and scale to dB.
- Smooth magnitudes and map to UI bars.
Invariants: FFT size matches buffer size; window coefficients correct; magnitude scaling stable. Failure modes: no window leads to smeared peaks; wrong scaling causes flicker or saturation; insufficient smoothing yields jitter.
Minimal concrete example
for (i = 0; i < N; i++) win[i] = samples[i] * hann[i];
Common misconceptions
- “FFT output is directly audible volume.” → You must scale and interpret it.
- “Windowing is optional.” → Without it, spectral leakage dominates.
- “More bins always look better.” → More bins increases noise and CPU load.
Check-your-understanding questions
- Why is windowing needed for FFT of real signals?
- What is the Nyquist limit for a 16 kHz sample rate?
- Why use logarithmic scaling for display?
Check-your-understanding answers
- Real signals are not periodic in the buffer; windowing reduces leakage.
- 8 kHz.
- Human perception of loudness is logarithmic.
Real-world applications
- Audio visualization and diagnostics.
- Vibration analysis in industrial monitoring.
Where you’ll apply it
- This project: see §4.4 algorithm overview and §5.10 Phase 2.
- Also used in:
P08-complete-cardputer-security-toolkit.mdfor audio modules.
References
- The Scientist and Engineer’s Guide to DSP – FFT and windowing chapters.
- ARM CMSIS DSP FFT documentation (conceptual references).
Key insight
A spectrum display is only as good as its windowing, scaling, and smoothing choices.
Summary
FFT transforms time samples into frequency bins; windowing and smoothing are what make the visualization meaningful.
Homework/Exercises to practice the concept
- Generate a 1 kHz sine wave and verify its FFT bin.
- Compare FFT results with and without a window.
Solutions to the homework/exercises
- At 16 kHz and 1024 samples, the peak should be near bin 64.
- Without a window, you’ll see leakage into adjacent bins.
2.3 Real-Time UI Rendering and Frame Budgeting
Fundamentals
Rendering on a small SPI TFT is expensive. Each pixel update consumes SPI bandwidth and CPU time. To keep the spectrum smooth, you must control how often you draw, how much you redraw, and how you synchronize drawing with data updates. A real-time UI budget is a fixed time slice each frame; if you exceed it, the UI stutters and the capture pipeline suffers. The solution is to decouple rendering from DSP and use partial redraws or a dirty-rectangle approach.
Deep Dive into the concept
The ST7789 display uses SPI, which means each pixel drawn is a bus transfer. At 240x135, a full-screen redraw at 16-bit color is ~65 KB; at 20 FPS, that’s over 1.3 MB/s. The ESP32-S3 can handle this, but only if you avoid blocking other tasks. The correct approach is to draw only what changes. For a spectrum analyzer, the bars are the main dynamic element; you can draw a background once and then update bar regions each frame. Use a fixed render cadence (e.g., 10–20 FPS) to balance smoothness and CPU load.
Synchronization between DSP and UI matters. If you render while DSP is updating the same data structure, you can read inconsistent values. Use double-buffered magnitude arrays or a snapshot mechanism: the DSP task writes into a buffer, then flips a pointer or posts a message when ready. The UI task reads the latest complete buffer. This avoids partial updates and tearing. If you want smooth decay effects, you can implement a peak-hold or falling-bar algorithm: keep a separate “peak” array that decays slowly, and draw peak markers. This adds polish without heavy CPU cost.
Power and battery are also part of the UI equation. A bright display drains the Cardputer battery quickly. Provide a brightness control and an auto-dim feature when no audio is detected. This reduces power draw and prevents burn-in. If you render too frequently, you increase power consumption and risk overheating; a stable cadence helps.
How this fits in projects
The same UI rendering discipline is used in P01-wifi-packet-sniffer-network-analyzer.md and P06-custom-application-launcher-mini-os.md, where multiple apps share display resources.
Definitions & key terms
- Frame budget → time allowed for each UI update
- Dirty rectangle → only redraw changed areas
- Double-buffering → separate buffers for producer and renderer
- Tearing → visual artifact from partial updates
Mental model diagram (ASCII)
[DSP Task] -> [Magnitude Buffer A/B] -> [UI Task] -> [SPI TFT]
How it works (step-by-step, with invariants and failure modes)
- DSP task completes FFT and writes magnitudes to buffer A.
- DSP task swaps buffer pointer and signals UI task.
- UI task draws bars from the stable buffer.
- UI task throttles update rate to maintain frame budget.
Invariants: UI uses immutable snapshot; render cadence fixed. Failure modes: UI reads while DSP writes, causing flicker; full-screen redraw causes drops elsewhere.
Minimal concrete example
if (xQueueReceive(fft_ready_q, &buf, 0)) {
draw_bars(buf);
}
Common misconceptions
- “Full-screen redraw is simplest.” → It wastes bandwidth and CPU.
- “UI doesn’t affect capture.” → It can starve DSP and I2S.
- “Higher FPS is always better.” → 10–20 FPS is sufficient for bars.
Check-your-understanding questions
- Why use double-buffering for FFT magnitudes?
- What is a dirty rectangle and why does it matter?
- How does UI update rate affect power usage?
Check-your-understanding answers
- To avoid reading data while DSP is writing, preventing tearing.
- It limits redraw to changed areas, reducing SPI bandwidth.
- Higher update rates increase CPU and display power consumption.
Real-world applications
- Dashboard displays in embedded instrumentation.
- Portable meters and oscilloscopes.
Where you’ll apply it
- This project: see §4.2 key components and §5.10 Phase 3.
- Also used in:
P06-custom-application-launcher-mini-os.mdandP08-complete-cardputer-security-toolkit.md.
References
- ST7789 display driver docs.
- Making Embedded Systems – UI timing and performance tradeoffs.
Key insight
A smooth UI is a scheduling problem, not just a drawing problem.
Summary
Render at a fixed cadence, use snapshots, and redraw only what changes to keep the spectrum stable.
Homework/Exercises to practice the concept
- Measure frame time for full-screen vs bar-only redraw.
- Implement a peak-hold decay effect.
Solutions to the homework/exercises
- Use a timer to log render time in microseconds and compare.
- Track a peak array and decrement it each frame until it matches magnitude.
3. Project Specification
3.1 What You Will Build
A real-time spectrum analyzer that:
- captures mic audio via I2S,
- computes FFT magnitudes,
- renders a bar graph with smoothing and peak hold,
- provides a gain control and clipping indicator.
3.2 Functional Requirements
- Audio capture: continuous PCM sampling with DMA.
- FFT processing: windowed FFT on fixed-size frames.
- Visualization: bar graph with smoothing and optional peak hold.
- Controls: gain and sensitivity settings.
- Diagnostics: show FPS, buffer overruns, and clipping stats.
3.3 Non-Functional Requirements
- Performance: FFT and render within frame budget (no stutter).
- Reliability: no buffer overruns during 10-minute run.
- Usability: clear display and responsive controls.
3.4 Example Usage / Output
1) Boot analyzer
2) Play a 1 kHz tone nearby
3) Observe a strong bar at 1 kHz
3.5 Data Formats / Schemas / Protocols
- In-memory magnitude arrays of size N/2.
- Optional CSV export of averaged spectrum (timestamp, bin, magnitude).
3.6 Edge Cases
- Mic clipping and saturation.
- DC bias causing a dominant 0 Hz bin.
- DSP task falls behind.
3.7 Real World Outcome
A successful build shows a stable spectrum where tones appear at expected frequencies, with smooth motion and minimal flicker. The UI should report buffer drops as zero during normal use.
3.7.1 How to Run (Copy/Paste)
idf.py set-target esp32s3
idf.py build
idf.py -p /dev/ttyUSB0 flash monitor
3.7.2 Golden Path Demo (Deterministic)
- Use a phone app to generate a 1 kHz sine tone.
- Set gain to 0 dB.
- Expect strongest bar near 1 kHz with low noise elsewhere.
Failure demo (deterministic):
- Disconnect the microphone (or disable I2S) and start the analyzer. Expected: UI shows “I2S ERROR,” spectrum bars freeze at zero, and the system logs an error. Exit code: 2.
3.7.3 If CLI: exact terminal transcript
I (3200) audio: sr=16000 fft=1024
I (3201) dsp: frame=120 fps=20 drops=0
I (3202) ui: peak=1kHz -12dB
Exit codes: 0 = success, 2 = I2S init/capture error, 3 = DSP buffer overrun.
3.7.4 If Web App
Not applicable.
3.7.5 If API
Not applicable.
3.7.6 If Library
Not applicable.
3.7.7 If GUI / Desktop / Mobile
Not applicable.
3.7.8 If TUI
+----------------------------+
| Spectrum Analyzer |
| 1kHz: ████████████ |
| 2kHz: ██ |
| 4kHz: █ |
| Gain: 0dB Drops:0 |
+----------------------------+
4. Solution Architecture
4.1 High-Level Design
[I2S DMA] -> [Sample Buffer] -> [FFT Task] -> [Magnitude Buffer]
|
v
[UI Task]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| I2S driver | Capture PCM samples | Sample rate and buffer sizes |
| FFT engine | Compute spectrum | Window size and scaling |
| UI renderer | Draw bars | Update cadence and smoothing |
| Gain control | Adjust sensitivity | Fixed-point gain scale |
4.3 Data Structures (No Full Code)
typedef struct {
float mags[512];
float peak[512];
} spectrum_frame_t;
4.4 Algorithm Overview
Key Algorithm: FFT with Windowing
- Apply window to samples.
- Run FFT.
- Compute magnitudes and smooth.
Complexity Analysis:
- Time: O(N log N)
- Space: O(N)
5. Implementation Guide
5.1 Development Environment Setup
idf.py set-target esp32s3
idf.py build
5.2 Project Structure
project-root/
├── main/
│ ├── audio_capture.c
│ ├── fft.c
│ ├── ui.c
│ └── config.c
└── README.md
5.3 The Core Question You’re Answering
“How can I analyze audio in real time without losing samples or freezing the UI?”
5.4 Concepts You Must Understand First
- I2S DMA buffering.
- FFT and windowing.
- UI frame budgeting.
5.5 Questions to Guide Your Design
- What FFT size balances resolution and speed?
- How will you smooth magnitudes to reduce flicker?
- What is the acceptable latency for updates?
5.6 Thinking Exercise
Sketch how many samples you need for 10 Hz resolution at 16 kHz, and compute the resulting latency.
5.7 The Interview Questions They Will Ask
- Why is DMA essential for audio capture?
- What is spectral leakage and how do you reduce it?
- How does FFT size affect latency?
5.8 Hints in Layers
Hint 1: Display raw waveform first.
Hint 2: Add FFT and show a single bar.
Hint 3: Add smoothing and UI cadence.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| DSP fundamentals | Scientist & Engineer’s Guide to DSP | FFT chapters |
| Embedded timing | Making Embedded Systems | Ch. 5–7 |
5.10 Implementation Phases
Phase 1: Audio Capture (4–5 days)
Phase 2: FFT + Visualization (5–7 days)
Phase 3: Calibration + UI polish (5–7 days)
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| FFT size | 512, 1024, 2048 | 1024 | Good resolution vs CPU |
| Window | Hann, Hamming | Hann | Balanced leakage reduction |
| UI FPS | 10, 20, 30 | 15 | Smooth enough, low CPU |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | FFT correctness | sine wave peak |
| Integration Tests | capture -> FFT | stable bins |
| Edge Tests | clipping | saturated samples |
6.2 Critical Test Cases
- 1 kHz tone maps to correct bin.
- Silence shows low magnitudes.
- Overrun counters remain at zero during 10-min run.
6.3 Test Data
Generated 1 kHz sine samples
White noise input
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Wrong sample format | DC offset peak | Shift/normalize PCM |
| No window | smeared peaks | Apply Hann window |
| Too frequent redraw | UI stutter | Reduce FPS |
7.2 Debugging Strategies
- Log DSP runtime vs buffer duration.
- Plot FFT output on serial for verification.
7.3 Performance Traps
- Large FFT sizes without enough CPU headroom.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add peak-hold markers.
8.2 Intermediate Extensions
- Log spectrum to CSV on SD.
8.3 Advanced Extensions
- Add octave band aggregation.
9. Real-World Connections
9.1 Industry Applications
- Audio diagnostics and monitoring.
- Acoustic field measurements.
9.2 Related Open Source Projects
- Mini Spectrum Analyzer projects using ESP32 I2S.
9.3 Interview Relevance
- DSP pipelines, real-time constraints, and buffer management.
10. Resources
10.1 Essential Reading
- The Scientist and Engineer’s Guide to DSP – FFT, windowing.
10.2 Video Resources
- FFT visualization tutorials.
10.3 Tools & Documentation
- ESP-IDF I2S driver docs.
10.4 Related Projects in This Series
P01-wifi-packet-sniffer-network-analyzer.md– pipeline similarities.P08-complete-cardputer-security-toolkit.md– integrated audio module.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain I2S and DMA capture.
- I can explain FFT and windowing.
11.2 Implementation
- Spectrum updates smoothly without drops.
- Bin peaks match known tones.
11.3 Growth
- I can tune buffer sizes for latency vs CPU.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Capture audio and render basic FFT bars.
Full Completion:
- Smooth UI, gain control, and no buffer overruns.
Excellence (Going Above & Beyond):
- Octave-band display and CSV export.