Project 4: Real-Time Audio Spectrum Analyzer

Project 4: Real-Time Audio Spectrum Analyzer

Project Overview

Attribute Value
Difficulty Advanced
Time Estimate 3-4 weeks
Main Language C
Alternatives Rust, Arduino C++, MicroPython
Primary Book The Scientist and Engineerโ€™s Guide to DSP by Steven W. Smith
Knowledge Areas DSP, Audio Processing, I2S, Dual-Core FreeRTOS, DMA

What Youโ€™ll Build

A device that captures audio through a microphone, performs FFT analysis, and displays a real-time frequency spectrum on an LED matrix or OLED display.

Physical Setup:

  • ESP32 connected to INMP441 (I2S digital microphone) or MAX4466 (analog)
  • 8x32 or 16x16 WS2812B LED matrix, or SSD1306 OLED display
  • Music or voice near the microphone creates dancing visualizations

What Youโ€™ll See:

LED Matrix (8 frequency bands):
     โ–ˆ
     โ–ˆ     โ–ˆ
โ–ˆ    โ–ˆ  โ–ˆ  โ–ˆ        โ–ˆ
โ–ˆ โ–ˆ  โ–ˆ  โ–ˆ  โ–ˆ  โ–ˆ  โ–ˆ  โ–ˆ
โ–” โ–”  โ–”  โ–”  โ–”  โ–”  โ–”  โ–”
20 50 125 315 800 2k 5k 12k Hz

OLED Display (128x64):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Real-Time Audio Spectrum       โ”‚
โ”‚                                โ”‚
โ”‚ โ–‚โ–ƒโ–„โ–…โ–†โ–‡โ–ˆโ–‡โ–†โ–…โ–„โ–ƒโ–‚โ–โ–โ–‚โ–ƒโ–„โ–…โ–†โ–‡โ–ˆโ–‡โ–†โ–…โ–„โ–ƒโ–‚โ– โ”‚
โ”‚                                โ”‚
โ”‚ Peak: 2.4kHz    dB: -18        โ”‚
โ”‚ FPS: 45         Clipping: No   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Learning Objectives

By completing this project, you will be able to:

  1. Sample audio using I2S with DMA for zero-copy continuous capture
  2. Implement the Fast Fourier Transform on resource-constrained hardware
  3. Leverage ESP32โ€™s dual-core architecture for parallel processing
  4. Apply digital signal processing concepts: windowing, magnitude calculation, dB scaling
  5. Drive WS2812B LED matrices using the RMT peripheral
  6. Profile and optimize real-time systems to achieve target frame rates
  7. Understand time-frequency domain trade-offs in spectrum analysis

Deep Theoretical Foundation

Digital Audio Fundamentals

Sound is a continuous pressure wave in air. To process it digitally, we must sample it at discrete points in time.

Sampling and Nyquist Theorem

Continuous Sound Wave:

Amplitude
    ^
    โ”‚    โ•ญโ”€โ”€โ”€โ•ฎ      โ•ญโ”€โ”€โ”€โ•ฎ      โ•ญโ”€โ”€โ”€โ•ฎ
    โ”‚   โ•ฑ     โ•ฒ    โ•ฑ     โ•ฒ    โ•ฑ     โ•ฒ
 0 โ”€โ”ผโ”€โ”€โ•ฏโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ”€โ•ฏโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ”€โ•ฏโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ†’ Time
    โ”‚           โ•ฐโ•ฎ         โ•ฐโ•ฎ         โ•ฐ
    โ”‚             โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚

Sampled at discrete points:

    โ”‚    โ€ข         โ€ข         โ€ข
    โ”‚   โ€ข           โ€ข         โ€ข
    โ”‚  โ€ข             โ€ข         โ€ข
 0 โ”€โ”ผโ”€โ€ขโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ€ขโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ€ขโ”€โ”€โ”€โ”€โ”€โ”€โ†’ Time
    โ”‚                   โ€ข
    โ”‚                    โ€ข

Nyquist Theorem: To accurately represent a frequency f, you must sample at least 2f times per second.

Sample Rate Maximum Frequency Common Use
8,000 Hz 4,000 Hz Telephone
22,050 Hz 11,025 Hz AM radio quality
44,100 Hz 22,050 Hz CD quality
48,000 Hz 24,000 Hz Professional audio

Why 44.1kHz for this project: Human hearing extends to ~20kHz. Sampling at 44.1kHz captures the full audible spectrum with headroom.

Aliasing: What Happens When Nyquist is Violated

If you sample a 15kHz tone at 22kHz (less than 2ร—15kHz):

Original signal (15kHz):
    โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ
โ”€โ”€โ”€โ•ฏ  โ•ฐโ•ฏ  โ•ฐโ•ฏ  โ•ฐโ•ฏ  โ•ฐโ•ฏ  โ•ฐโ”€โ”€โ”€

Sample points at 22kHz:
    โ€ข   โ€ข   โ€ข   โ€ข   โ€ข   โ€ข

What we reconstruct (7kHz!):
      โ•ญโ”€โ”€โ”€โ”€โ•ฎ    โ•ญโ”€โ”€โ”€โ”€โ•ฎ
โ”€โ”€โ”€โ”€โ•ฏ      โ•ฐโ”€โ”€โ”€โ•ฏ      โ•ฐโ”€โ”€โ”€

The 15kHz signal "aliases" to 7kHz (22-15=7)

Anti-aliasing filter: Hardware or software filter to remove frequencies above Nyquist before sampling. Many I2S microphones include this internally.

The Fast Fourier Transform (FFT)

The FFT transforms time-domain samples into frequency-domain componentsโ€”the mathematical heart of a spectrum analyzer.

What FFT Computes

Given N time-domain samples, FFT produces N/2 frequency โ€œbinsโ€:

Time Domain (1024 samples):              Frequency Domain (512 bins):

Amplitude                                Magnitude
    ^                                        ^
    โ”‚ โ•ญโ•ฎโ•ญโ•ฎ   โ•ญโ•ฎโ•ญโ•ฎ   โ•ญโ•ฎโ•ญโ•ฎ                    โ”‚        โ–ˆ
    โ”‚โ•ญโ•ฏโ•ฐโ•ฏโ•ฐโ•ฎ โ•ญโ•ฏโ•ฐโ•ฏโ•ฐโ•ฎ โ•ญโ•ฏโ•ฐโ•ฏโ•ฐโ•ฎ                   โ”‚     โ–ˆ  โ–ˆ  โ–ˆ
 0 โ”€โ”ค      โ•ฐโ•ฏ    โ•ฐโ•ฏ    โ•ฐโ•ฏ                   โ”‚  โ–ˆ  โ–ˆ  โ–ˆ  โ–ˆ  โ–ˆ
    โ”‚                                    0 โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Frequency
    โ”‚                            FFT         0Hz            22kHz
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Time            โ”‚โ†  bin 0    bin 511 โ†’โ”‚

Each bin represents a frequency range:

  • Bin width = Sample Rate / N = 44100 / 1024 = 43.07 Hz per bin
  • Bin 0 = 0 Hz (DC component)
  • Bin 1 = 0-43 Hz
  • Bin 100 = 4300-4343 Hz

FFT Output: Complex Numbers

FFT output is complex: each bin has real and imaginary components.

FFT output for bin k: X[k] = a + bi

Where:
- a = real component (cosine amplitude)
- b = imaginary component (sine amplitude)

Magnitude: |X[k]| = โˆš(aยฒ + bยฒ)
Phase:     โˆ X[k] = atan2(b, a)  (not needed for visualization)

For spectrum display, we only care about magnitudeโ€”how strong each frequency is.

Why FFT is Fast

DFT (Discrete Fourier Transform):
- Direct calculation: O(Nยฒ)
- 1024 samples: ~1 million operations
- Too slow for real-time

FFT (Fast Fourier Transform):
- Divide and conquer: O(N log N)
- 1024 samples: ~10,000 operations
- 100x faster!

The FFT exploits symmetry in the DFT equations using the Cooley-Tukey algorithm (1965).

Windowing: Reducing Spectral Leakage

FFT assumes the input repeats infinitely. But our 1024-sample buffer doesnโ€™t perfectly align with signal periods.

Without windowing - discontinuity at edges:

Sample buffer:
โ”‚โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ”‚  Discontinuity!
โ”‚โ”‚    โ•ฒ โ•ฑ     โ•ฒ โ•ฑ    โ”‚
โ”‚โ”‚     โ•ณ       โ•ณ     โ”‚
โ”‚โ•ฐโ”€โ”€โ”€โ•ฏ โ•ฐโ”€โ”€โ”€โ•ฏ โ•ฐโ”€โ”€โ”€โ•ฏ   โ”‚
โ”‚โ†โ”€โ”€โ”€ 1024 samples โ”€โ”€โ†’โ”‚

This discontinuity creates artificial frequencies (spectral leakage)

Window functions taper the signal to zero at the edges:

Hann Window:
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ•ฑ                โ•ฒ
  โ•ฑ                  โ•ฒ
 โ•ฑ                    โ•ฒ
โ•ฑ________________________โ•ฒ
โ”‚โ†โ”€โ”€โ”€ 1024 samples โ”€โ”€โ”€โ”€โ†’โ”‚

Applied to signal:
- Multiplied sample-by-sample
- Edges become zero (no discontinuity)
- Center preserved (minimal signal distortion)
Window Sidelobe Level Frequency Resolution Use Case
Rectangular -13 dB Excellent Testing only
Hann -31 dB Good General purpose
Hamming -43 dB Good Speech analysis
Blackman -58 dB Poor High dynamic range

For spectrum visualizers: Hann window is idealโ€”good compromise between frequency resolution and sidelobe suppression.

I2S: Digital Audio Interface

I2S (Inter-IC Sound) is a standard for transmitting digital audio between chips.

I2S Signal Lines

BCLK (Bit Clock):     โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”โ”Œโ”
                      โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜โ””โ”˜

WS (Word Select):     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      Left channel โ”€โ”€โ”€โ”˜                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                                            Right channel

DOUT (Data):          โ”‚D15โ”‚D14โ”‚...โ”‚D1โ”‚D0โ”‚D15โ”‚D14โ”‚...โ”‚D1โ”‚D0โ”‚
                      โ”‚โ† Left channel โ†’โ”‚โ† Right channel โ†’โ”‚
  • BCLK: One clock per bit (e.g., 44100 ร— 16 ร— 2 = 1.4 MHz for 16-bit stereo)
  • WS/LRCLK: High = Right channel, Low = Left channel
  • DOUT: Serial audio data, MSB first

I2S on ESP32

ESP32 has two I2S peripherals. For audio input:

i2s_config_t i2s_config = {
    .mode = I2S_MODE_MASTER | I2S_MODE_RX,  // Receive mode
    .sample_rate = 44100,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = I2S_COMM_FORMAT_STAND_I2S,
    .dma_buf_count = 8,      // Number of DMA buffers
    .dma_buf_len = 1024,     // Samples per buffer
    .use_apll = true,        // Use APLL for accurate sample rate
};

i2s_pin_config_t pin_config = {
    .bck_io_num = 26,        // Bit Clock
    .ws_io_num = 25,         // Word Select (Left/Right)
    .data_in_num = 33,       // Data input
};

DMA: Zero-Copy Audio Capture

DMA (Direct Memory Access) moves data between peripherals and memory without CPU involvement.

Without DMA:                        With DMA:

I2S โ†’ CPU โ†’ RAM                     I2S โ†’ DMA โ†’ RAM
                                              โ”‚
CPU must copy each sample              CPU is free to
(blocks processing)                    run FFT while DMA
                                       fills next buffer

Double Buffering (Ping-Pong)

Time โ†’  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’

DMA:    โ”‚ Fill Buffer A โ”‚ Fill Buffer B โ”‚ Fill Buffer A โ”‚
                โ”‚                โ”‚                โ”‚
CPU:            โ”‚ Process A      โ”‚ Process B      โ”‚ Process A
                โ–ผ                โ–ผ                โ–ผ

Buffer A:  [samples...]     (being processed)   [new samples]
Buffer B:  (being filled)    [samples...]       (being filled)

This allows continuous audio capture with no gaps.

ESP32 Dual-Core Architecture

ESP32 has two Xtensa LX6 cores. We can dedicate each to specific tasks:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         ESP32                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚        Core 0           โ”‚            Core 1                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Audio Capture Task      โ”‚ FFT Processing Task               โ”‚
โ”‚ - I2S DMA management    โ”‚ - Apply window function           โ”‚
โ”‚ - Buffer ready signal   โ”‚ - Compute FFT                     โ”‚
โ”‚ - Continuous sampling   โ”‚ - Calculate magnitudes            โ”‚
โ”‚                         โ”‚ - Map to display                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                         โ”‚ Display Update Task               โ”‚
โ”‚                         โ”‚ - Render LED/OLED                 โ”‚
โ”‚                         โ”‚ - Gamma correction                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Task Pinning

// Pin audio capture to Core 0
xTaskCreatePinnedToCore(
    audio_capture_task,    // Function
    "audio",               // Name
    4096,                  // Stack size
    NULL,                  // Parameters
    10,                    // Priority (high)
    &audio_task_handle,    // Task handle
    0                      // Core 0
);

// Pin FFT processing to Core 1
xTaskCreatePinnedToCore(
    fft_process_task,
    "fft",
    8192,                  // Larger stack for FFT arrays
    NULL,
    5,                     // Medium priority
    &fft_task_handle,
    1                      // Core 1
);

WS2812B LED Matrix Driving

WS2812B LEDs use a single-wire timing-critical protocol. Each bit is encoded by pulse duration:

Bit 0:                    Bit 1:
High: 0.4ยตs              High: 0.8ยตs
Low:  0.85ยตs             Low:  0.45ยตs

โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚                โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
โ”‚       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚   โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
โ”‚โ† 0.4 โ†’โ”‚โ† 0.85 โ†’โ”‚       โ”‚โ† 0.8 โ†’โ”‚โ† 0.45 โ†’โ”‚

Total bit time: 1.25ยตs (800kHz data rate)

For an 8ร—32 matrix (256 LEDs ร— 24 bits = 6,144 bits):

  • Transmission time: 6,144 ร— 1.25ยตs = 7.68ms
  • Maximum refresh rate: ~130 Hz

ESP32 RMT Peripheral

The RMT (Remote Control Transceiver) peripheral generates precise timing for WS2812B:

#include "driver/rmt.h"

// Configure RMT for WS2812B
rmt_config_t config = {
    .rmt_mode = RMT_MODE_TX,
    .channel = RMT_CHANNEL_0,
    .gpio_num = LED_GPIO,
    .clk_div = 2,                    // 40MHz
    .mem_block_num = 1,
};

// Timing for WS2812B
#define T0H 16  // 0.4ยตs at 40MHz
#define T1H 32  // 0.8ยตs
#define T0L 34  // 0.85ยตs
#define T1L 18  // 0.45ยตs

Project Specification

Hardware Requirements

Component Quantity Purpose
ESP32 DevKit 1 Main MCU
INMP441 I2S Mic 1 Digital audio input
8x32 WS2812B Matrix 1 Spectrum display
5V 3A Power Supply 1 LED power
Level Shifter (3.3Vโ†’5V) 1 Data line for LEDs
Capacitor (1000ยตF) 1 Power smoothing

Wiring Diagram

ESP32 DevKit                    INMP441 Microphone
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      GPIO26 โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ SCK (BCLK) โ”‚
โ”‚      GPIO25 โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ WS (LRCLK) โ”‚
โ”‚      GPIO33 โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ SD (DOUT)  โ”‚
โ”‚        3.3V โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ VDD        โ”‚
โ”‚         GND โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ GND        โ”‚
โ”‚             โ”‚                โ”‚ L/R โ†’ GND  โ”‚ (left channel)
โ”‚             โ”‚                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚             โ”‚
โ”‚             โ”‚                WS2812B LED Matrix
โ”‚             โ”‚                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      GPIO13 โ”‚โ”€โ”€โ”€[Level]โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ DIN                โ”‚
โ”‚             โ”‚   Shifter      โ”‚                    โ”‚
โ”‚         GND โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ GND                โ”‚
โ”‚             โ”‚                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚             โ”‚
โ”‚             โ”‚                โ† 5V 3A Supply
โ”‚             โ”‚                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚             โ”‚                โ”‚ 5V โ”€โ”€โ”€โ”€โ”€โ”€โ†’ LED VCC โ”‚
โ”‚             โ”‚                โ”‚ GND โ”€โ”€โ”€โ”€โ”€โ†’ LED GND โ”‚
โ”‚             โ”‚                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Note: Add 1000ยตF capacitor across LED power rails
      Add 330ฮฉ resistor in series with DIN line

Functional Requirements

  1. Audio Capture
    • 44.1kHz sample rate, 16-bit mono
    • 1024-sample FFT window (23ms latency)
    • Continuous capture via DMA
  2. FFT Processing
    • Apply Hann window to reduce spectral leakage
    • Compute 1024-point FFT
    • Calculate magnitude in dB scale
  3. Frequency Mapping
    • Map 512 bins to 8 display bands (logarithmic)
    • Apply smoothing for pleasant visuals
    • Implement peak hold (optional)
  4. Display Output
    • 30+ FPS refresh rate
    • Rainbow color gradient
    • Gamma correction for LEDs
  5. Performance
    • Total latency < 50ms (audio to display)
    • No audio dropouts
    • CPU usage < 80% per core

Solution Architecture

System Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ I2S/DMA   โ”‚โ”€โ”€โ”€โ†’โ”‚ Window    โ”‚โ”€โ”€โ”€โ†’โ”‚   FFT     โ”‚โ”€โ”€โ”€โ†’โ”‚ Magnitude โ”‚
โ”‚ Capture   โ”‚    โ”‚ Function  โ”‚    โ”‚ (1024pt)  โ”‚    โ”‚ Calc      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    23ms             1ms             8ms              2ms

                                                         โ”‚
                                                         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LED       โ”‚โ†โ”€โ”€โ”€โ”‚ Color     โ”‚โ†โ”€โ”€โ”€โ”‚ Smoothing โ”‚โ†โ”€โ”€โ”€โ”‚ Bin       โ”‚
โ”‚ Output    โ”‚    โ”‚ Mapping   โ”‚    โ”‚ Filter    โ”‚    โ”‚ Mapping   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    8ms              1ms              1ms              1ms

Total pipeline latency: ~45ms (well under perceptual threshold)

Task Structure

// Core 0: Audio capture
void audio_capture_task(void* param) {
    int16_t samples[1024];
    size_t bytes_read;

    while (1) {
        // DMA fills buffer, blocks until ready
        i2s_read(I2S_NUM_0, samples, sizeof(samples), &bytes_read, portMAX_DELAY);

        // Send to FFT task via queue
        xQueueSend(audio_queue, samples, 0);
    }
}

// Core 1: FFT processing and display
void fft_process_task(void* param) {
    int16_t samples[1024];
    float fft_input[1024];
    float fft_output[1024];
    float magnitudes[8];  // 8 frequency bands

    while (1) {
        // Wait for audio data
        xQueueReceive(audio_queue, samples, portMAX_DELAY);

        // Convert to float and apply window
        for (int i = 0; i < 1024; i++) {
            fft_input[i] = samples[i] * hann_window[i];
        }

        // Compute FFT
        dsps_fft2r_fc32(fft_input, 1024);

        // Calculate magnitudes for 8 bands
        calculate_band_magnitudes(fft_output, magnitudes);

        // Update LED display
        update_leds(magnitudes);
    }
}

Frequency Band Mapping

Human hearing is logarithmic. We use logarithmic spacing for natural-looking response:

FFT bin โ†’ Frequency โ†’ Display Bar

Bin Range    Frequency Range    Bar    Description
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1-2          43-86 Hz           0      Sub-bass
3-5          86-215 Hz          1      Bass (kick drum)
6-12         215-516 Hz         2      Low-mid (bass guitar)
13-25        516-1075 Hz        3      Mid (vocals fundamental)
26-50        1075-2150 Hz       4      Upper-mid (presence)
51-100       2150-4300 Hz       5      High-mid (consonants)
101-200      4300-8600 Hz       6      High (sibilance)
201-400      8600-17200 Hz      7      Air (sparkle)

Key Data Structures

// Pre-computed Hann window coefficients
float hann_window[1024];  // Computed once at startup

// FFT band configuration
typedef struct {
    uint16_t bin_start;
    uint16_t bin_end;
    float smoothing;      // 0.0 = instant, 0.9 = very smooth
    float peak;           // Peak hold value
} fft_band_t;

fft_band_t bands[8] = {
    {1, 2, 0.7, 0},
    {3, 5, 0.7, 0},
    // ... etc
};

// LED frame buffer
typedef struct {
    uint8_t r, g, b;
} rgb_t;

rgb_t led_buffer[256];  // 8x32 matrix

Phased Implementation Guide

Phase 1: Audio Capture (Day 1-3)

Goal: See audio waveform in serial plotter

  1. Configure I2S
    • Set up INMP441 microphone
    • 44.1kHz, 16-bit, mono
    • Verify with serial output
  2. Visualize Raw Samples
    • Print samples to serial
    • Use Arduino Serial Plotter
    • Speak/clap โ†’ see waveform
  3. Verify DMA Operation
    • Check no buffer overruns
    • Measure timing consistency
    • Confirm continuous capture

Checkpoint: Serial plotter shows clean audio waveform when you speak

Phase 2: FFT Implementation (Day 4-7)

Goal: See frequency spectrum in serial

  1. Compute Basic FFT
    • Use ESP-DSP library
    • 1024-point FFT
    • Print raw bin values
  2. Add Windowing
    • Pre-compute Hann coefficients
    • Apply before FFT
    • Compare with/without (reduced leakage)
  3. Calculate Magnitudes
    • sqrt(realยฒ + imagยฒ)
    • Convert to dB: 20*log10(mag)
    • Print 8-band summary
  4. Test with Tone Generator
    • Play 440Hz tone from phone
    • Should peak in bin ~10 (440/43)
    • Verify frequency accuracy

Checkpoint: 1kHz tone shows peak in correct frequency range

Phase 3: Dual-Core Pipeline (Day 8-10)

Goal: Parallel audio capture and FFT processing

  1. Create Task Structure
    • Audio capture on Core 0
    • FFT processing on Core 1
    • Queue for sample transfer
  2. Measure Performance
    • Time each stage
    • FFT should complete before next buffer
    • Check for queue overflows
  3. Handle Edge Cases
    • Queue full โ†’ drop oldest buffer
    • FFT too slow โ†’ reduce size or optimize
    • Monitor heap usage

Checkpoint: Continuous processing at 44Hz (1024 samples / 44100 = 23ms)

Phase 4: LED Display (Day 11-14)

Goal: Visualization on LED matrix

  1. Configure RMT for WS2812B
    • Set timing parameters
    • Test with solid color
    • Verify all 256 LEDs work
  2. Implement Bar Graph
    • Map magnitude to bar height
    • Apply logarithmic scaling
    • Add color gradient (rainbow or single-color)
  3. Add Visual Polish
    • Smoothing filter (exponential moving average)
    • Peak hold with decay
    • Gamma correction for LEDs

Checkpoint: Dancing spectrum display responding to music

Phase 5: Optimization (Day 15-21)

Goal: Smooth, responsive, efficient

  1. Profile Performance
    • Measure total pipeline latency
    • Identify bottlenecks
    • Target <50ms total delay
  2. Optimize FFT
    • Use ESP-DSP optimized functions
    • Consider fixed-point math
    • Benchmark alternatives
  3. Reduce Memory Usage
    • Static allocation where possible
    • Share buffers carefully
    • Monitor for leaks
  4. Add Features
    • Multiple visualization modes
    • Sensitivity adjustment
    • Beat detection (bonus)

Testing Strategy

Unit Tests

Component Test Expected Result
I2S Read 1024 samples Non-zero values
FFT 440Hz input Peak at bin 10ยฑ1
FFT White noise Flat spectrum
Window Apply Hann Edge samples = 0
LED Set color Correct color displayed

Performance Tests

Metric Target How to Measure
FFT time < 15ms esp_timer_get_time()
Display update < 10ms Timer around LED write
Total latency < 50ms Clap test (visual delay)
Frame rate > 30 FPS Count frames per second
CPU usage < 80% vTaskGetRunTimeStats()

Audio Quality Tests

  1. Frequency Accuracy
    • Play known frequencies (100Hz, 1kHz, 10kHz)
    • Verify correct bars light up
  2. Dynamic Range
    • Whisper โ†’ quiet bars
    • Loud music โ†’ full bars
    • No clipping at max volume
  3. Response Time
    • Sharp transients (clap)
    • Should appear within 2 frames (~66ms)

Common Pitfalls and Debugging

Audio Issues

Problem: No audio input

  • Check I2S pin connections
  • Verify microphone power (3.3V)
  • L/R pin determines channel (try toggling)

Problem: Audio is distorted

  • Check for clipping (samples at ยฑ32767)
  • Reduce gain or add attenuation
  • Verify sample rate matches microphone

Problem: High-frequency noise

  • Add decoupling capacitor near mic (0.1ยตF)
  • Use shielded wires
  • Check for WiFi interference (disable WiFi if not needed)

FFT Issues

Problem: Spectrum looks wrong

  • Verify windowing is applied
  • Check FFT size matches sample count
  • Ensure correct bin-to-frequency mapping

Problem: All frequencies show same level

  • Check for DC offset (subtract average)
  • Verify magnitude calculation (realยฒ + imagยฒ)
  • Window might not be applied

Display Issues

Problem: LEDs flicker

  • Check power supply (5V 3A minimum for 256 LEDs)
  • Add capacitor across power rails
  • Reduce brightness if power limited

Problem: Colors are wrong

  • WS2812B is GRB, not RGB
  • Check color order in library
  • Verify gamma correction

Problem: Only some LEDs work

  • Check data line connections
  • Level shifter may be needed (3.3V โ†’ 5V)
  • Test with fewer LEDs first

Extensions and Challenges

Beginner Extensions

  1. Multiple Display Modes
    • Bar graph, waterfall, oscilloscope
    • Button to cycle through modes
  2. Color Themes
    • Rainbow, fire, ocean, custom
    • Store preference in NVS

Intermediate Challenges

  1. Beat Detection
    • Detect kick drum hits
    • Flash LEDs on beat
    • Calculate BPM
  2. OLED Display
    • Alternative to LED matrix
    • Higher resolution spectrum
    • Show peak frequency and dB level

Advanced Challenges

  1. Stereo Analysis
    • Two microphones
    • Compare left/right channels
    • Visualize stereo field
  2. Wireless Audio
    • Bluetooth A2DP sink
    • Analyze streamed audio
    • No microphone needed
  3. Machine Learning
    • Train classifier on ESP32
    • Detect music vs speech
    • Identify specific songs

Real-World Connections

Commercial Products

Product Your Project Skill
Equalizer apps FFT analysis, visualization
Guitar tuners Frequency detection
Smart speakers Audio processing, DSP
Music visualizers Real-time graphics

Industry Applications

  • Audio Engineering: Spectrum analyzers, room correction
  • Voice Assistants: Preprocessing before speech recognition
  • Musical Instruments: Electronic effects, synthesizers
  • Environmental Monitoring: Sound level meters, noise detection

Resources

Official Documentation

Resource URL
ESP-IDF I2S docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/i2s.html
ESP-DSP Library github.com/espressif/esp-dsp
WS2812B Datasheet cdn-shop.adafruit.com/datasheets/WS2812B.pdf

Books

Book Author Relevant Chapters
DSP Guide Steven W. Smith Ch. 8-12: FFT (free online)
Making Embedded Systems Elecia White Ch. 6, 8: DMA, Multitasking
Mastering FreeRTOS FreeRTOS.org Tasks, Queues (free PDF)

Online Resources

Resource Description
DSPGuide.com Free complete DSP textbook
INMP441 Hookup Guide SparkFun tutorial
FastLED Library Arduino LED library

Self-Assessment Checklist

Fundamentals

  • I can explain the Nyquist theorem
  • I understand what FFT computes and why itโ€™s fast
  • I can describe why windowing reduces spectral leakage
  • I know how DMA enables zero-copy audio capture

Implementation

  • Audio waveform is clean in serial plotter
  • Known frequencies map to correct FFT bins
  • Display achieves 30+ FPS
  • Latency is imperceptible (<100ms)

Code Quality

  • No audio dropouts during operation
  • Memory usage is stable over time
  • Both cores have headroom (<80% usage)
  • Display looks smooth and responsive

Interview Preparation

Be ready to answer these questions:

  1. โ€œExplain how FFT converts time-domain audio to frequency-domain spectrum.โ€
    • Decomposes signal into constituent frequencies, O(n log n), complex output, magnitude = energy
  2. โ€œWhy must sample rate be at least 2ร— the highest frequency?โ€
    • Nyquist theorem, aliasing if violated, frequencies fold back
  3. โ€œHow does I2S DMA work and why is it necessary?โ€
    • DMA moves data without CPU, prevents sample drops, ping-pong buffering
  4. โ€œHow do you divide work between ESP32โ€™s two cores?โ€
    • Pin tasks with xTaskCreatePinnedToCore, queues for communication
  5. โ€œWhat is spectral leakage and how does windowing fix it?โ€
    • Discontinuity at buffer edges causes energy spread, window tapers edges to zero
  6. โ€œHow do you map 512 FFT bins to 8 display bars?โ€
    • Logarithmic frequency spacing (humans hear logarithmically), bin averaging

Next Project: P05-ota-smart-home-hub.md - OTA-Updatable Smart Home Hub