Project 3: Real-Time Audio Spectrum Analyzer

Build a handheld audio spectrum analyzer that captures microphone input over I2S, performs FFT analysis, and renders smooth frequency bars in real time.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2–3 weeks
Main Programming Language	C/C++ (ESP-IDF)
Alternative Programming Languages	Arduino
Coolness Level	Very High
Business Potential	Medium (audio diagnostics, field meters)
Prerequisites	DSP basics, I2S familiarity, basic UI rendering
Key Topics	I2S DMA, FFT, windowing, real-time UI pipelines

1. Learning Objectives

By completing this project, you will:

Configure I2S audio capture with DMA buffers and stable sampling rates.
Implement FFT analysis with windowing and magnitude scaling.
Design a real-time UI that updates smoothly under CPU constraints.
Calibrate microphone input and avoid clipping or DC bias.
Build a reliable audio pipeline that tolerates load without glitches.

2. All Theory Needed (Per-Concept Breakdown)

2.1 I2S Audio Capture and DMA Buffering

Fundamentals

I2S is a digital audio bus that transmits PCM samples with a bit clock and word select signal. For an audio analyzer, you need stable sampling: each buffer must contain equally spaced samples, and the CPU must receive those buffers without loss. This is why DMA (Direct Memory Access) is essential. DMA moves audio data from the I2S peripheral into memory without CPU intervention, then raises an interrupt when a buffer is ready. Your task is to configure sample rate, bit depth, channel format, and DMA buffer size so that you get continuous, predictable streams of samples.

Deep Dive into the concept

The I2S peripheral on ESP32-S3 can be configured for master receive mode, with the MCU generating clocks or accepting an external clock. For the Cardputer’s MEMS microphone, you typically generate the clock and read 16-bit mono PCM at 16 kHz or 22.05 kHz. The sample rate is critical because it sets your FFT frequency resolution: with 16 kHz and a 1024-sample FFT, your bin size is about 15.6 Hz. DMA buffer size must align with your FFT window. For example, if you want 1024 samples, you might configure two DMA buffers of 512 samples each and assemble them. If you choose too small a buffer, CPU overhead increases; too large and latency increases. The key is to balance latency, CPU load, and FFT update rate.

DMA memory must be allocated in a region compatible with DMA. ESP-IDF provides heap_caps_malloc with flags like MALLOC_CAP_DMA. If you allocate buffers in the wrong region, transfers may fail or produce corrupted data. Another subtlety is sample format: some MEMS mics output left-justified data with a fixed offset. You may need to shift or sign-extend samples to produce correct signed PCM values. If you ignore this, your spectrum will look wrong or show a strong DC component.

The capture callback should not do FFTs. It should simply enqueue buffer pointers or copy samples into a circular buffer and return. A separate DSP task can process sample blocks at a fixed cadence. You also need to handle buffer overrun: if the DSP task falls behind, you should drop the oldest buffer rather than blocking I2S. This keeps your audio pipeline stable even when the UI task is busy. For debugging, log the rate of DMA buffer completions and the DSP processing time; if processing time exceeds buffer duration, you will inevitably fall behind.

How this fits in projects

This capture pipeline mirrors the WiFi sniffer pipeline in P01-wifi-packet-sniffer-network-analyzer.md and informs the concurrency discipline in P08-complete-cardputer-security-toolkit.md.

Definitions & key terms

I2S → serial bus for audio PCM data
DMA → hardware data transfer without CPU
Sample rate → number of samples per second
PCM → raw audio sample format
Buffer overrun → data arrives faster than consumer can process

Mental model diagram (ASCII)

[Mic] -> [I2S] -> [DMA Buffers] -> [DSP Task] -> [FFT] -> [UI]

How it works (step-by-step, with invariants and failure modes)

Configure I2S for desired sample rate and format.
DMA fills buffer and signals completion.
ISR or driver callback posts buffer to DSP queue.
DSP task assembles window and runs FFT.
UI task renders magnitude bars.

Invariants: buffers are DMA-capable; sample rate stable; queue bounded. Failure modes: buffer overrun drops samples; incorrect sample format causes distorted spectrum; blocking in callback stalls capture.

Minimal concrete example

i2s_read(I2S_NUM_0, dma_buf, buf_bytes, &bytes_read, portMAX_DELAY);

Common misconceptions

“I can read samples with polling.” → Polling is too slow for stable capture.
“Any buffer size works.” → Buffer size determines latency and CPU load.
“All mics output standard PCM.” → Some mics output left-justified or biased samples.

Check-your-understanding questions

Why must DMA buffers be allocated in DMA-capable memory?
How does sample rate affect FFT bin size?
What happens if DSP processing time exceeds buffer duration?

Check-your-understanding answers

DMA cannot access all memory regions; wrong region causes failure.
Higher sample rate increases the max frequency and widens bin spacing.
Buffers accumulate and you eventually drop samples.

Real-world applications

Audio spectrum monitoring for sound engineers.
Field diagnostics for noise and vibration analysis.

Where you’ll apply it

This project: see §4.1 and §5.10 for buffer strategy.
Also used in: P08-complete-cardputer-security-toolkit.md.

References

ESP-IDF I2S driver documentation.
The Scientist and Engineer’s Guide to DSP – sampling fundamentals.

Key insight

Real-time audio depends on a stable capture pipeline; if capture wobbles, everything else fails.

Summary

Configure I2S + DMA carefully, keep callbacks short, and match buffers to your FFT window.

Homework/Exercises to practice the concept

Capture 1 second of audio and print min/max sample values.
Change sample rate and observe FFT bin spacing.

Solutions to the homework/exercises

Read samples, track max/min, and verify they are within signed 16-bit range.
Compute bin_size = sample_rate / FFT_size and compare plots.

2.2 FFT, Windowing, and Magnitude Scaling

Fundamentals

The FFT converts time-domain samples into frequency-domain magnitudes. A naive FFT without windowing assumes the signal is periodic within the buffer. Real audio signals are not perfectly periodic, so you must apply a window function (Hann, Hamming) to reduce spectral leakage. After the FFT, you compute magnitudes, often convert to logarithmic (dB) scale, and map them to display bars. The key is to produce stable, readable output without jitter or aliasing.

Deep Dive into the concept

FFT size sets your frequency resolution and processing cost. A 1024-point FFT gives reasonable resolution at 16 kHz, but a 2048-point FFT doubles resolution at the cost of higher CPU and latency. You need to choose a size that fits the device’s processing budget. Windowing is critical: without it, a single tone spreads energy into neighboring bins (leakage). The Hann window is a good compromise because it reduces sidelobes with moderate amplitude distortion. You multiply each sample by the window before the FFT. This costs CPU but improves clarity.

Magnitude computation requires converting complex FFT outputs into real magnitudes: sqrt(re^2 + im^2) or a faster approximation. Then you often apply logarithmic scaling because human perception of loudness is logarithmic. You can compute 20*log10(mag) and clamp it to a range. For embedded performance, you might precompute a log lookup table or use a fixed-point approximation. You also should average or smooth magnitudes across frames to reduce flicker. A simple exponential moving average (EMA) per bin can stabilize the display.

Aliasing and Nyquist are fundamental. Your maximum frequency is half the sample rate. If you try to interpret bins above that, you are looking at aliases. Also, if you sample at 16 kHz, you cannot correctly represent 10 kHz tones without careful anti-aliasing; the MEMS mic and I2S configuration should include a low-pass response that makes this acceptable. For display, you may want to group bins into bands (e.g., 8 or 16 bars) using a logarithmic frequency scale. This matches human hearing and makes the display more intuitive.

Finally, normalization matters. If your FFT magnitude is unscaled, loud signals saturate and quiet signals disappear. You can normalize by the FFT size and window gain. Use a calibrated reference tone (e.g., 1 kHz) to set display levels. Provide an on-screen “gain” setting to adjust sensitivity in the field.

How this fits in projects

FFT analysis is specific to this project, but the idea of buffering, processing, and UI rendering in real time appears throughout the series, especially in P01-wifi-packet-sniffer-network-analyzer.md and P08-complete-cardputer-security-toolkit.md.

Definitions & key terms

FFT → fast algorithm to compute frequency spectrum
Windowing → weighting samples to reduce spectral leakage
Spectral leakage → energy spread across bins due to non-periodic signals
Nyquist frequency → half the sample rate
EMA → exponential moving average for smoothing

Mental model diagram (ASCII)

[Samples] -> [Window] -> [FFT] -> [Magnitude] -> [Smoothing] -> [Bars]

How it works (step-by-step, with invariants and failure modes)

Collect N samples in a buffer.
Apply window function to each sample.
Run FFT to produce complex frequency bins.
Compute magnitudes and scale to dB.
Smooth magnitudes and map to UI bars.

Invariants: FFT size matches buffer size; window coefficients correct; magnitude scaling stable. Failure modes: no window leads to smeared peaks; wrong scaling causes flicker or saturation; insufficient smoothing yields jitter.

Minimal concrete example

for (i = 0; i < N; i++) win[i] = samples[i] * hann[i];

Common misconceptions

“FFT output is directly audible volume.” → You must scale and interpret it.
“Windowing is optional.” → Without it, spectral leakage dominates.
“More bins always look better.” → More bins increases noise and CPU load.

Check-your-understanding questions

Why is windowing needed for FFT of real signals?
What is the Nyquist limit for a 16 kHz sample rate?
Why use logarithmic scaling for display?

Check-your-understanding answers

Real signals are not periodic in the buffer; windowing reduces leakage.
8 kHz.
Human perception of loudness is logarithmic.

Real-world applications

Audio visualization and diagnostics.
Vibration analysis in industrial monitoring.

Where you’ll apply it

This project: see §4.4 algorithm overview and §5.10 Phase 2.
Also used in: P08-complete-cardputer-security-toolkit.md for audio modules.

References

The Scientist and Engineer’s Guide to DSP – FFT and windowing chapters.
ARM CMSIS DSP FFT documentation (conceptual references).

Key insight

A spectrum display is only as good as its windowing, scaling, and smoothing choices.

Summary

FFT transforms time samples into frequency bins; windowing and smoothing are what make the visualization meaningful.

Homework/Exercises to practice the concept

Generate a 1 kHz sine wave and verify its FFT bin.
Compare FFT results with and without a window.

Solutions to the homework/exercises

At 16 kHz and 1024 samples, the peak should be near bin 64.
Without a window, you’ll see leakage into adjacent bins.

2.3 Real-Time UI Rendering and Frame Budgeting

Fundamentals

Rendering on a small SPI TFT is expensive. Each pixel update consumes SPI bandwidth and CPU time. To keep the spectrum smooth, you must control how often you draw, how much you redraw, and how you synchronize drawing with data updates. A real-time UI budget is a fixed time slice each frame; if you exceed it, the UI stutters and the capture pipeline suffers. The solution is to decouple rendering from DSP and use partial redraws or a dirty-rectangle approach.

Deep Dive into the concept

The ST7789 display uses SPI, which means each pixel drawn is a bus transfer. At 240x135, a full-screen redraw at 16-bit color is ~65 KB; at 20 FPS, that’s over 1.3 MB/s. The ESP32-S3 can handle this, but only if you avoid blocking other tasks. The correct approach is to draw only what changes. For a spectrum analyzer, the bars are the main dynamic element; you can draw a background once and then update bar regions each frame. Use a fixed render cadence (e.g., 10–20 FPS) to balance smoothness and CPU load.

Synchronization between DSP and UI matters. If you render while DSP is updating the same data structure, you can read inconsistent values. Use double-buffered magnitude arrays or a snapshot mechanism: the DSP task writes into a buffer, then flips a pointer or posts a message when ready. The UI task reads the latest complete buffer. This avoids partial updates and tearing. If you want smooth decay effects, you can implement a peak-hold or falling-bar algorithm: keep a separate “peak” array that decays slowly, and draw peak markers. This adds polish without heavy CPU cost.

Power and battery are also part of the UI equation. A bright display drains the Cardputer battery quickly. Provide a brightness control and an auto-dim feature when no audio is detected. This reduces power draw and prevents burn-in. If you render too frequently, you increase power consumption and risk overheating; a stable cadence helps.

How this fits in projects

The same UI rendering discipline is used in P01-wifi-packet-sniffer-network-analyzer.md and P06-custom-application-launcher-mini-os.md, where multiple apps share display resources.

Definitions & key terms

Frame budget → time allowed for each UI update
Dirty rectangle → only redraw changed areas
Double-buffering → separate buffers for producer and renderer
Tearing → visual artifact from partial updates

Mental model diagram (ASCII)

[DSP Task] -> [Magnitude Buffer A/B] -> [UI Task] -> [SPI TFT]

How it works (step-by-step, with invariants and failure modes)

DSP task completes FFT and writes magnitudes to buffer A.
DSP task swaps buffer pointer and signals UI task.
UI task draws bars from the stable buffer.
UI task throttles update rate to maintain frame budget.

Invariants: UI uses immutable snapshot; render cadence fixed. Failure modes: UI reads while DSP writes, causing flicker; full-screen redraw causes drops elsewhere.

Minimal concrete example

if (xQueueReceive(fft_ready_q, &buf, 0)) {
    draw_bars(buf);
}

Common misconceptions

“Full-screen redraw is simplest.” → It wastes bandwidth and CPU.
“UI doesn’t affect capture.” → It can starve DSP and I2S.
“Higher FPS is always better.” → 10–20 FPS is sufficient for bars.

Check-your-understanding questions

Why use double-buffering for FFT magnitudes?
What is a dirty rectangle and why does it matter?
How does UI update rate affect power usage?

Check-your-understanding answers

To avoid reading data while DSP is writing, preventing tearing.
It limits redraw to changed areas, reducing SPI bandwidth.
Higher update rates increase CPU and display power consumption.

Real-world applications

Dashboard displays in embedded instrumentation.
Portable meters and oscilloscopes.

Where you’ll apply it

This project: see §4.2 key components and §5.10 Phase 3.
Also used in: P06-custom-application-launcher-mini-os.md and P08-complete-cardputer-security-toolkit.md.

References

ST7789 display driver docs.
Making Embedded Systems – UI timing and performance tradeoffs.

Key insight

A smooth UI is a scheduling problem, not just a drawing problem.

Summary

Render at a fixed cadence, use snapshots, and redraw only what changes to keep the spectrum stable.

Homework/Exercises to practice the concept

Measure frame time for full-screen vs bar-only redraw.
Implement a peak-hold decay effect.

Solutions to the homework/exercises

Use a timer to log render time in microseconds and compare.
Track a peak array and decrement it each frame until it matches magnitude.

3. Project Specification

3.1 What You Will Build

A real-time spectrum analyzer that:

captures mic audio via I2S,
computes FFT magnitudes,
renders a bar graph with smoothing and peak hold,
provides a gain control and clipping indicator.

3.2 Functional Requirements

Audio capture: continuous PCM sampling with DMA.
FFT processing: windowed FFT on fixed-size frames.
Visualization: bar graph with smoothing and optional peak hold.
Controls: gain and sensitivity settings.
Diagnostics: show FPS, buffer overruns, and clipping stats.

3.3 Non-Functional Requirements

Performance: FFT and render within frame budget (no stutter).
Reliability: no buffer overruns during 10-minute run.
Usability: clear display and responsive controls.

3.4 Example Usage / Output

1) Boot analyzer
2) Play a 1 kHz tone nearby
3) Observe a strong bar at 1 kHz

3.5 Data Formats / Schemas / Protocols

In-memory magnitude arrays of size N/2.
Optional CSV export of averaged spectrum (timestamp, bin, magnitude).

3.6 Edge Cases

Mic clipping and saturation.
DC bias causing a dominant 0 Hz bin.
DSP task falls behind.

3.7 Real World Outcome

A successful build shows a stable spectrum where tones appear at expected frequencies, with smooth motion and minimal flicker. The UI should report buffer drops as zero during normal use.

3.7.1 How to Run (Copy/Paste)

idf.py set-target esp32s3
idf.py build
idf.py -p /dev/ttyUSB0 flash monitor

3.7.2 Golden Path Demo (Deterministic)

Use a phone app to generate a 1 kHz sine tone.
Set gain to 0 dB.
Expect strongest bar near 1 kHz with low noise elsewhere.

Failure demo (deterministic):

Disconnect the microphone (or disable I2S) and start the analyzer. Expected: UI shows “I2S ERROR,” spectrum bars freeze at zero, and the system logs an error. Exit code: 2.

3.7.3 If CLI: exact terminal transcript

I (3200) audio: sr=16000 fft=1024
I (3201) dsp: frame=120 fps=20 drops=0
I (3202) ui: peak=1kHz -12dB

Exit codes: 0 = success, 2 = I2S init/capture error, 3 = DSP buffer overrun.

3.7.4 If Web App

Not applicable.

3.7.5 If API

Not applicable.

3.7.6 If Library

Not applicable.

3.7.7 If GUI / Desktop / Mobile

Not applicable.

3.7.8 If TUI

+----------------------------+
| Spectrum Analyzer          |
| 1kHz: ████████████         |
| 2kHz: ██                   |
| 4kHz: █                    |
| Gain: 0dB  Drops:0         |
+----------------------------+

4. Solution Architecture

4.1 High-Level Design

[I2S DMA] -> [Sample Buffer] -> [FFT Task] -> [Magnitude Buffer]
                                                    |
                                                    v
                                                [UI Task]

4.2 Key Components

Component	Responsibility	Key Decisions
I2S driver	Capture PCM samples	Sample rate and buffer sizes
FFT engine	Compute spectrum	Window size and scaling
UI renderer	Draw bars	Update cadence and smoothing
Gain control	Adjust sensitivity	Fixed-point gain scale

4.3 Data Structures (No Full Code)

typedef struct {
    float mags[512];
    float peak[512];
} spectrum_frame_t;

4.4 Algorithm Overview

Key Algorithm: FFT with Windowing

Apply window to samples.
Run FFT.
Compute magnitudes and smooth.

Complexity Analysis:

Time: O(N log N)
Space: O(N)

5. Implementation Guide

5.1 Development Environment Setup

idf.py set-target esp32s3
idf.py build

5.2 Project Structure

project-root/
├── main/
│   ├── audio_capture.c
│   ├── fft.c
│   ├── ui.c
│   └── config.c
└── README.md

5.3 The Core Question You’re Answering

“How can I analyze audio in real time without losing samples or freezing the UI?”

5.4 Concepts You Must Understand First

I2S DMA buffering.
FFT and windowing.
UI frame budgeting.

5.5 Questions to Guide Your Design

What FFT size balances resolution and speed?
How will you smooth magnitudes to reduce flicker?
What is the acceptable latency for updates?

5.6 Thinking Exercise

Sketch how many samples you need for 10 Hz resolution at 16 kHz, and compute the resulting latency.

5.7 The Interview Questions They Will Ask

Why is DMA essential for audio capture?
What is spectral leakage and how do you reduce it?
How does FFT size affect latency?

5.8 Hints in Layers

Hint 1: Display raw waveform first.

Hint 2: Add FFT and show a single bar.

Hint 3: Add smoothing and UI cadence.

5.9 Books That Will Help

Topic	Book	Chapter
DSP fundamentals	Scientist & Engineer’s Guide to DSP	FFT chapters
Embedded timing	Making Embedded Systems	Ch. 5–7

5.10 Implementation Phases

Phase 1: Audio Capture (4–5 days)

Phase 2: FFT + Visualization (5–7 days)

Phase 3: Calibration + UI polish (5–7 days)

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
FFT size	512, 1024, 2048	1024	Good resolution vs CPU
Window	Hann, Hamming	Hann	Balanced leakage reduction
UI FPS	10, 20, 30	15	Smooth enough, low CPU

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	FFT correctness	sine wave peak
Integration Tests	capture -> FFT	stable bins
Edge Tests	clipping	saturated samples

6.2 Critical Test Cases

1 kHz tone maps to correct bin.
Silence shows low magnitudes.
Overrun counters remain at zero during 10-min run.

6.3 Test Data

Generated 1 kHz sine samples
White noise input

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong sample format	DC offset peak	Shift/normalize PCM
No window	smeared peaks	Apply Hann window
Too frequent redraw	UI stutter	Reduce FPS

7.2 Debugging Strategies

Log DSP runtime vs buffer duration.
Plot FFT output on serial for verification.

7.3 Performance Traps

Large FFT sizes without enough CPU headroom.

8. Extensions & Challenges

8.1 Beginner Extensions

Add peak-hold markers.

8.2 Intermediate Extensions

Log spectrum to CSV on SD.

8.3 Advanced Extensions

Add octave band aggregation.

9. Real-World Connections

9.1 Industry Applications

Audio diagnostics and monitoring.
Acoustic field measurements.

Mini Spectrum Analyzer projects using ESP32 I2S.

9.3 Interview Relevance

DSP pipelines, real-time constraints, and buffer management.

10. Resources

10.1 Essential Reading

The Scientist and Engineer’s Guide to DSP – FFT, windowing.

10.2 Video Resources

FFT visualization tutorials.

10.3 Tools & Documentation

ESP-IDF I2S driver docs.

P01-wifi-packet-sniffer-network-analyzer.md – pipeline similarities.
P08-complete-cardputer-security-toolkit.md – integrated audio module.

11. Self-Assessment Checklist

11.1 Understanding

I can explain I2S and DMA capture.
I can explain FFT and windowing.

11.2 Implementation

Spectrum updates smoothly without drops.
Bin peaks match known tones.

11.3 Growth

I can tune buffer sizes for latency vs CPU.

12. Submission / Completion Criteria

Minimum Viable Completion:

Capture audio and render basic FFT bars.

Full Completion:

Smooth UI, gain control, and no buffer overruns.

Excellence (Going Above & Beyond):

Octave-band display and CSV export.

Project 3: Real-Time Audio Spectrum Analyzer

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

2.1 I2S Audio Capture and DMA Buffering

2.2 FFT, Windowing, and Magnitude Scaling

2.3 Real-Time UI Rendering and Frame Budgeting

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: exact terminal transcript

3.7.4 If Web App

3.7.5 If API

3.7.6 If Library

3.7.7 If GUI / Desktop / Mobile

3.7.8 If TUI

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They Will Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Audio Capture (4–5 days)

Phase 2: FFT + Visualization (5–7 days)

Phase 3: Calibration + UI polish (5–7 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria