Learn Video/Audio Muxing & Demuxing: From Zero to Multimedia Systems Master

Goal: Deeply understand how multimedia data flows through processing pipelines—from raw pixels and audio samples to compressed, synchronized streams in container formats. You’ll master the architecture behind FFmpeg, VLC, and every video processing tool, understanding not just how to use libav libraries, but why containers separate metadata from payloads, how codecs achieve compression, and what happens when you click “play” on a video. After completing these projects, you’ll be able to build your own transcoder, debug A/V sync issues, parse any multimedia format, and read codec specifications with confidence.


Why Video/Audio Muxing & Demuxing Matters

Every second, hundreds of millions of hours of video are streamed globally. The video streaming market reached $811.37 billion in 2025 and is projected to grow to $2.66 trillion by 2032 (Grand View Research). Behind every Netflix stream, YouTube video, and Zoom call lies a sophisticated pipeline of muxing, demuxing, encoding, and decoding.

The Foundation of Modern Media

In 2000, Fabrice Bellard created FFmpeg—a project that would become the most widely adopted multimedia framework in history. Today, FFmpeg powers:

  • Netflix, YouTube, Facebook: Server-side transcoding at massive scale
  • VLC, OBS Studio: Real-time playback and streaming applications
  • Browsers (Chrome, Firefox): HTML5 video via libav integration
  • Mars Perseverance Rover: Image compression before transmitting to Earth (Ant Media)

FFmpeg received unanimous first place in a streaming technology survey due to its “robust and universally-accepted open-source encoding and transcoding” capabilities (The Streaming Company). 79% of video industry developers use H.264/AVC as their primary codec (Uploadcare).

The Muxing/Demuxing Pipeline

Traditional Video Processing (High-Level)         What Actually Happens (Low-Level)
┌──────────────────────────┐                     ┌──────────────────────────┐
│  ffmpeg -i input.mp4     │                     │  1. DEMUX: Parse MP4     │
│    -c:v libx264          │                     │     container, extract   │
│    -c:a aac output.mkv   │                     │     H.264 + AAC packets  │
└──────────────────────────┘                     └────────┬─────────────────┘
                                                          │
"Just works!"                                    ┌────────▼─────────────────┐
                                                 │  2. DECODE: Decompress   │
                                                 │     H.264 NAL units to   │
                                                 │     raw YUV frames       │
                                                 └────────┬─────────────────┘
                                                          │
                                                 ┌────────▼─────────────────┐
                                                 │  3. PROCESS: Apply       │
                                                 │     filters, resize,     │
                                                 │     color correction     │
                                                 └────────┬─────────────────┘
                                                          │
                                                 ┌────────▼─────────────────┐
                                                 │  4. ENCODE: Compress     │
                                                 │     YUV back to H.264    │
                                                 │     with x264 library    │
                                                 └────────┬─────────────────┘
                                                          │
                                                 ┌────────▼─────────────────┐
                                                 │  5. MUX: Write Matroska  │
                                                 │     container with new   │
                                                 │     streams + metadata   │
                                                 └──────────────────────────┘

You understand: "It transcodes video"          You DEEPLY understand: Every byte's journey

Why This Knowledge Matters

For Careers:

  • Multimedia engineers earn $120k-$180k+ (streaming platforms, game engines, video conferencing)
  • Understanding codec internals is a rare, high-value skill (most developers only know how to call FFmpeg CLI)
  • Interview questions focus on A/V sync, container formats, bitstream parsing—concepts these projects teach

For Technical Depth:

  • Video processing is pure systems programming: binary parsing, performance optimization, concurrency
  • Teaches real-world engineering tradeoffs: compression ratio vs speed, quality vs bandwidth
  • Bridges theory and practice: codec specs become tangible through implementation

For Building Real Systems:

  • Custom transcoders for specific workflows (medical imaging, security cameras, drones)
  • Media servers handling millions of concurrent streams
  • Video analytics pipelines (computer vision requires raw frames from demuxing)

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Before starting these projects, you should have:

  1. Strong C programming skills
    • Comfortable with pointers, structs, memory management
    • Experience with binary file I/O (fopen, fread, fwrite)
    • Understanding of malloc/free and avoiding memory leaks
  2. Binary data fundamentals
    • Reading hex dumps and understanding byte layouts
    • Endianness (little-endian vs big-endian)
    • Bitwise operations (&, |, >>, <<)
  3. Basic command-line skills
    • Using FFmpeg CLI (e.g., ffmpeg -i input.mp4 -c:v copy output.mkv)
    • Playing video files in VLC or mpv
    • Inspecting files with hexdump -C or xxd

Helpful But Not Required (You’ll Learn During Projects)

  • Color space theory (RGB, YUV)
  • Compression algorithms (DCT, entropy coding)
  • Threading and concurrency
  • Audio DSP fundamentals

Self-Assessment Questions

Stop and verify you can answer these before starting:

  • C Programming: Can you write a struct parser that reads binary data from a file?
  • Endianness: Why does 0x1234 appear as 34 12 in a hex dump on x86?
  • File I/O: How do you seek to byte offset 1000 in a file?
  • Binary Basics: What’s the difference between reading bytes vs. reading text lines?
  • FFmpeg CLI: Can you extract audio from a video file using ffmpeg?

If you answered “no” to more than 2 questions, start with:

  • “C Programming: A Modern Approach” by K.N. King - Chapters 22-24 (I/O, pointers, memory)
  • “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 2 (binary data representation)

Development Environment Setup

Required Tools:

# FFmpeg libraries (Ubuntu/Debian)
sudo apt-get install libavcodec-dev libavformat-dev libavutil-dev libswscale-dev libswresample-dev

# For video player project
sudo apt-get install libsdl2-dev

# For analysis
sudo apt-get install ffmpeg mediainfo hexdump

macOS (via Homebrew):

brew install ffmpeg pkg-config sdl2

Recommended Tools:

  • VLC / mpv: For testing output files
  • MediaInfo: Inspect container metadata (mediainfo video.mp4)
  • xxd / hexdump: View raw file bytes
  • ffprobe: Analyze streams (ffprobe -show_streams input.mp4)

Time Investment Expectations

Project Level Time Estimate What You’ll Build
Beginner (Projects 1-2) 1-2 weeks Simple parsers, raw format converters
Intermediate (Projects 3-4) 2-4 weeks Real demuxers, video players
Advanced (Project 5+) 3-6 weeks Codec-level parsing, full transcoders

Total mastery: 3-6 months part-time (2-3 hours/day)

Important Reality Check

This is genuinely hard. You will:

  • Spend hours debugging why “everything looks right” but the video won’t play
  • Stare at hex dumps trying to find off-by-one errors
  • Deal with cryptic FFmpeg error messages
  • Read codec specifications that feel like legal documents

But it’s worth it. The moment you:

  • See your first self-decoded video frame render
  • Build a tool that processes video faster than FFmpeg
  • Debug an A/V sync issue in production and know exactly what’s wrong
  • Read VLC source code and understand every line

…you’ll have knowledge that 99% of developers don’t possess.

Core Concept Analysis

To truly understand video/audio muxing and demuxing, you need to grasp these fundamental building blocks:

The Container vs. Codec Distinction

This is the #1 source of confusion for beginners. A video file has two layers:

video.mp4 (the file you see)
│
├─ Container Format (MP4)        ← Metadata, timing, stream organization
│  ├─ Video Stream Metadata      ← Resolution, framerate, duration
│  ├─ Audio Stream Metadata      ← Sample rate, channels, duration
│  └─ Interleaved Packets        ← Actual data chunks with timestamps
│     ├─ Video Packet #1 (PTS=0.0s)
│     ├─ Audio Packet #1 (PTS=0.0s)
│     ├─ Video Packet #2 (PTS=0.033s)
│     └─ Audio Packet #2 (PTS=0.021s)
│
└─ Inside Each Packet:
   ├─ Video: H.264 Codec         ← Compressed video bitstream
   └─ Audio: AAC Codec           ← Compressed audio bitstream

Analogy: The container is like a ZIP file. The codecs are like JPEG or PNG inside the ZIP. FFmpeg demuxes the ZIP (unpack streams), then decodes the images (decompress data).

Key Concepts Explained

Concept What It Means Why It Matters
Container Format The “box” that holds streams (MP4, MKV, AVI, TS) - stores metadata, timing, and interleaves data Determines file compatibility, seeking behavior, streaming support
Codec Algorithm that compresses/decompresses actual video/audio data (H.264, AAC, VP9) Determines quality, file size, encoding speed
Muxing Combining multiple streams (video, audio, subtitles) into a single container file Creates final playable files with synchronized A/V
Demuxing Extracting individual streams from a container file First step in playback, transcoding, or analysis
Packets Chunks of compressed data belonging to a stream Basic unit of data transfer in multimedia pipelines
Frames Decoded raw data (pixels for video, samples for audio) What you actually see/hear after decompression
PTS/DTS Presentation/Decode Timestamps - controls playback timing and sync Prevents A/V desync, enables seeking
Bitstream The raw encoded data format within a codec (NAL units for H.264) Understanding this lets you parse codec internals

The Full Multimedia Data Hierarchy

┌─────────────────────────────────────────────────────────┐
│                    FILE (video.mp4)                     │  What you download
│  ┌───────────────────────────────────────────────────┐  │
│  │           CONTAINER (MP4 Format)                  │  │  Muxing/Demuxing layer
│  │  ┌─────────────────────────────────────────────┐  │  │
│  │  │  STREAM #0: Video (codec: H.264)            │  │  │
│  │  │    ┌──────────────────────────────────────┐ │  │  │
│  │  │    │  PACKET #0 (PTS=0.0s, 5KB)           │ │  │  │  Transport units
│  │  │    │    ┌──────────────────────────────┐  │ │  │  │
│  │  │    │    │  NAL UNIT: SPS (config)      │  │ │  │  │  Codec bitstream
│  │  │    │    │  NAL UNIT: IDR Slice (data)  │  │ │  │  │
│  │  │    │    └──────────────────────────────┘  │ │  │  │
│  │  │    │         ↓ DECODE                     │ │  │  │
│  │  │    │    ┌──────────────────────────────┐  │ │  │  │
│  │  │    │    │  FRAME: 1920x1080 YUV pixels │  │ │  │  │  Raw data
│  │  │    │    └──────────────────────────────┘  │ │  │  │
│  │  │    └──────────────────────────────────────┘ │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────┐  │  │
│  │  │  STREAM #1: Audio (codec: AAC)              │  │  │
│  │  │    ┌──────────────────────────────────────┐ │  │  │
│  │  │    │  PACKET #0 (PTS=0.0s, 512B)          │ │  │  │
│  │  │    │    ↓ DECODE                          │ │  │  │
│  │  │    │  1024 PCM samples (float32)          │ │  │  │
│  │  │    └──────────────────────────────────────┘ │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Container Formats: The Organizational Layer

Different containers optimize for different use cases:

Container Extension Use Case Characteristics
MP4 .mp4 Streaming, mobile, web Atom-based structure, fast seeking, MPEG-4 Part 14 standard
Matroska (MKV) .mkv High-quality archival Flexible, EBML-based, unlimited tracks, chapter support
MPEG-TS .ts Broadcasting, live streaming Fixed 188-byte packets, error resilience, HLS segments
AVI .avi Legacy Windows Simple RIFF format, limited codec support
WebM .webm Web video Subset of Matroska, VP8/VP9 + Opus, royalty-free

Key Insight: You can have H.264 video in MP4, MKV, TS, or AVI—same codec, different container. Transm uxing (changing container without re-encoding) is fast because you’re just repackaging packets.

Codec Landscape: The Compression Layer

Video Codecs:

Codec Evolution: Compression vs. Complexity

Old (1990s)         Efficient (2000s)       Modern (2010s)       Future (2020s)
┌─────────┐         ┌─────────┐            ┌─────────┐          ┌─────────┐
│ MPEG-1  │  ────▶  │ H.264   │  ────▶     │ H.265   │  ────▶   │  AV1    │
│ MPEG-2  │         │ (AVC)   │            │ (HEVC)  │          │  VVC    │
└─────────┘         └─────────┘            └─────────┘          └─────────┘
  Simple              79% market            50% better            Open, 30%
  Low compression     Universal support     compression          better than
  Fast decode         Hardware everywhere   Patent issues        HEVC

Audio Codecs:

Codec Use Case Characteristics
PCM (WAV) Uncompressed Raw samples, huge files, no quality loss
MP3 Music, legacy Lossy, widely supported, moderate compression
AAC Streaming, mobile Lossy, better than MP3, part of MPEG-4
Opus VoIP, WebRTC Lossy, best quality at low bitrates, open-source
FLAC Archival, audiophiles Lossless compression, 50-60% size reduction

PTS/DTS: The Synchronization System

Every packet has timestamps that control when to decode and present:

Video Stream (with B-frames):
Decode Order (DTS):  I₀  P₁  P₂  B₃  B₄  P₅
Display Order (PTS): I₀  B₃  B₄  P₁  P₂  P₅

Timeline:
DTS: [0ms] [33ms] [66ms] [100ms] [133ms] [166ms]
PTS: [0ms] [100ms] [133ms] [33ms] [66ms] [166ms]
          ↑                        ↑
          Decode I-frame first     Display later

Why DTS ≠ PTS? B-frames (bi-directional) reference future frames, so you must decode P₁ and P₂ before displaying B₃ and B₄.

A/V Sync: Audio PTS and video PTS must align. If video PTS = 1.5s and audio PTS = 1.2s, you have 300ms desync (noticeable drift).

Bitstreams: The Codec’s Internal Language

Inside a video packet lies a bitstream—the codec’s proprietary format:

H.264 Packet (Annex B Format):
┌──────────────────────────────────────────────────────┐
│ 00 00 00 01                                          │  Start code (NAL separator)
├──────────────────────────────────────────────────────┤
│ 67 64 00 1F AC D9 40 50 05 BB 01 10 00 00 03 ...    │  NAL Unit: SPS (sequence params)
├──────────────────────────────────────────────────────┤
│ 00 00 00 01                                          │  Start code
├──────────────────────────────────────────────────────┤
│ 68 EE 3C 80                                          │  NAL Unit: PPS (picture params)
├──────────────────────────────────────────────────────┤
│ 00 00 00 01                                          │  Start code
├──────────────────────────────────────────────────────┤
│ 65 88 84 00 33 FF ...                                │  NAL Unit: IDR Slice (actual frame data)
└──────────────────────────────────────────────────────┘

Understanding bitstreams lets you:

  • Parse codec configurations (resolution, profile, level)
  • Identify keyframes (I-frames) for seeking
  • Debug codec errors at the byte level
  • Optimize encoding parameters

Concept Summary Table

Concept Cluster What You Need to Internalize
Container vs. Codec The container is packaging, the codec is compression. Same codec can live in different containers.
Muxing/Demuxing Muxing = combine streams into a file. Demuxing = extract streams from a file. Both operate on packets, not pixels.
Packets vs. Frames Packets are compressed chunks (what’s in the file). Frames are raw data (what you decode to).
PTS/DTS Timing PTS = when to display. DTS = when to decode. They differ when frames are reordered (B-frames).
Color Spaces RGB = how computers think. YUV = how video codecs think. Conversion is mandatory.
Bitstreams The codec’s internal format. Understanding this lets you parse codec metadata and debug encoding issues.

Deep Dive Reading by Concept

This section maps each concept from above to specific book chapters for deeper understanding. Read these before or alongside the projects to build strong mental models.

Container Formats & File I/O

Concept Book & Chapter
Binary file parsing in C C Programming: A Modern Approach by K.N. King — Ch. 22: “Input/Output”
Struct layouts and alignment Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 3.9: “Heterogeneous Data Structures”
Endianness handling Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 2.1: “Information Storage”

Color Spaces & Image Representation

Concept Book & Chapter
RGB and YUV color models Computer Graphics from Scratch by Gabriel Gambetta — Ch. 2: “Basic Rendering”
Pixel formats (planar vs. packed) Digital Video and HD by Charles Poynton — Ch. 3-4

Compression & Codecs

Concept Book & Chapter
H.264 fundamentals and NAL units H.264 and MPEG-4 Video Compression by Iain Richardson — Ch. 1-5
Video compression basics Video Demystified by Keith Jack — Ch. 8-9

Multimedia Pipelines

Concept Book & Chapter
FFmpeg libav architecture FFmpeg libav tutorial — Complete walkthrough
Audio/Video synchronization Video Demystified by Keith Jack — Ch. 11: “Timing and Synchronization”

Essential Reading Order

For maximum comprehension, read in this order:

  1. Foundation (Week 1):
    • C Programming: A Modern Approach Ch. 22 (binary file I/O)
    • Computer Systems: A Programmer’s Perspective Ch. 2 (data representation)
  2. Multimedia Basics (Week 2):
  3. Codec Internals (Week 3-4):
    • H.264 and MPEG-4 Video Compression Ch. 1-5 (as you build Projects 3-5)

Quick Start Guide (First 48 Hours)

Feeling overwhelmed? Start here:

Day 1: Get Your Hands Dirty (4 hours)

  1. Install tools (30 minutes)
    # Ubuntu/Debian
    sudo apt-get install ffmpeg mediainfo hexdump
    
  2. Explore a real video file (1 hour)
    # Download a sample video
    wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
    
    # Inspect with mediainfo
    mediainfo BigBuckBunny.mp4
    
    # View container metadata
    ffprobe -show_format -show_streams BigBuckBunny.mp4
    
    # Look at raw bytes
    hexdump -C BigBuckBunny.mp4 | head -50
    
  3. Experiment with FFmpeg (1 hour)
    # Extract video stream only
    ffmpeg -i BigBuckBunny.mp4 -c:v copy -an video_only.mp4
    
    # Extract audio stream only
    ffmpeg -i BigBuckBunny.mp4 -c:a copy -vn audio_only.m4a
    
    # Transmux (change container, don't re-encode)
    ffmpeg -i BigBuckBunny.mp4 -c copy output.mkv
    
    # Re-encode video
    ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -crf 28 smaller.mp4
    
  4. Start Project 1 (1.5 hours)
    • Download a WAV file
    • Read the binary header with C
    • Print sample rate, channels, bit depth

Day 2: Build Something (4 hours)

  1. Finish Project 1 (2 hours)
    • Modify PCM samples (volume change)
    • Write a valid WAV file
  2. Read about containers (1 hour)
  3. Plan your learning path (1 hour)
    • Decide: deep-dive containers (Project 3) or full pipeline (Project 4)?
    • Schedule 2-3 hours/day for next 2 weeks

Goal: Build production-quality tools using libav APIs

Projects: 1 → 2 → 4 → Final Project Timeline: 6-8 weeks part-time Best for: Backend engineers, video platform developers

Why this path: Project 4 (Video Player) forces you to learn the entire FFmpeg architecture. You’ll use libavformat, libavcodec, libswscale—the same APIs used by VLC, OBS, and every professional tool.

Path 2: “I Want to Understand Codec Internals” (Advanced)

Goal: Parse bitstreams, understand compression algorithms

Projects: 1 → 2 → 5 → 3 → Final Project Timeline: 8-12 weeks part-time Best for: Codec developers, video compression researchers

Why this path: Project 5 (H.264 NAL Parser) teaches you how data is compressed. You’ll read codec specifications and implement parsers from scratch.

Path 3: “I’m a Beginner, Build My Confidence” (Gentle)

Goal: Learn multimedia fundamentals without overwhelming complexity

Projects: 1 → 2 → (stop and practice) → 4 Timeline: 4-6 weeks part-time Best for: Students, junior developers

Why this path: Projects 1-2 are approachable and give immediate feedback. Take time to internalize concepts before tackling libav APIs.


Project 1: WAV Audio File Parser & Writer

  • File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Multimedia / File Formats
  • Software or Tool: WAV Format
  • Main Book: “The Audio Programming Book” by Richard Boulanger

What you’ll build: A C program that reads WAV files, displays header information, manipulates raw PCM audio data, and writes modified WAV files.

Why it teaches multimedia fundamentals: WAV is the “hello world” of container formats—it has a simple, well-documented structure with a header followed by raw audio samples. You’ll learn binary file parsing, endianness handling, and the fundamental concept of separating metadata (container) from payload (audio data).

Core challenges you’ll face:

  • Parsing binary structures (maps to understanding container headers)
  • Handling different sample formats (8-bit, 16-bit, 32-bit float)
  • Understanding sample rate, channels, and bit depth relationships
  • Writing properly formatted binary files

Key Concepts:

  • Binary file I/O in C: “C Programming: A Modern Approach” by K.N. King - Chapter 22 (Input/Output)
  • Endianness and byte ordering: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 2
  • Audio fundamentals (sample rate, bit depth): “Digital Audio Fundamentals” article on Wikipedia

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, understanding of binary/hex

Real world outcome:

  • Run your program on any WAV file and see: sample rate, channels, duration, bit depth printed to console
  • Apply effects like volume change, reverse audio, or fade in/out
  • Play your modified WAV file in any audio player to verify it works

Learning milestones:

  1. Successfully parse WAV header → understand container structure concept
  2. Read and modify PCM samples → grasp raw vs. encoded data distinction
  3. Write valid WAV file → understand muxing at its simplest level

Real World Outcome

You’ll build a C program that opens any WAV file, displays its metadata in a formatted output, and can apply audio effects. When you run it, you’ll see:

$ ./wav_parser audio.wav

=== WAV File Information ===
File size: 5,292,044 bytes
Format: WAVE (RIFF container)

Audio Properties:
  Sample rate: 44100 Hz
  Channels: 2 (Stereo)
  Bit depth: 16 bits
  Duration: 30.02 seconds
  Data size: 5,291,856 bytes
  Samples per channel: 1,322,964

Calculated Bitrate: 1,411 kbps (uncompressed)

$ ./wav_effect volume audio.wav output.wav 1.5

Processing: audio.wav → output.wav
Effect: Volume boost (1.5x)
Progress: [████████████████████] 100%
Wrote 5,292,044 bytes to output.wav
Done! Play with: mpv output.wav

$ mpv output.wav
# Audio plays 50% louder than original

What you can verify:

  1. Open the output file in Audacity — waveform should show 1.5x amplitude
  2. Compare file sizesls -lh audio.wav output.wav (should be identical)
  3. Inspect with MediaInfomediainfo output.wav (all metadata preserved)
  4. Listen with any player — VLC, mpv, Windows Media Player all work

Effects you’ll implement:

  • Volume change: Multiply every sample by a gain factor
  • Reverse: Write samples in reverse order (plays backward)
  • Fade in/out: Apply gradual volume ramp
  • Channel swap: Exchange left/right stereo channels

The Core Question You’re Answering

“What IS a container format? How does metadata separate from payload, and why is WAV the simplest example of this pattern?”

Before you write any code, sit with this question. Most developers think “WAV is just an audio file,” but it’s actually a RIFF container that happens to hold PCM audio. Understanding this distinction—that the container format (RIFF/WAV) is separate from the audio encoding (PCM)—is the foundation for understanding MP4, MKV, and every multimedia format.

The insight: WAV has a header (metadata: sample rate, channels) followed by a data chunk (payload: raw samples). This two-part structure is the universal pattern in all multimedia containers.


Concepts You Must Understand First

Stop and research these before coding:

  1. Binary File I/O in C
    • How do you read bytes from a file with fread()?
    • What’s the difference between text mode ("r") and binary mode ("rb")?
    • How do you seek to a specific byte offset with fseek()?
    • Book Reference: C Programming: A Modern Approach Ch. 22 — K.N. King
  2. Structs and Memory Layout
    • Can you define a C struct that matches a binary format specification?
    • Do you understand struct padding and how to disable it (__attribute__((packed)))?
    • Why does sizeof(struct wav_header) sometimes give unexpected results?
    • Book Reference: Computer Systems: A Programmer’s Perspective Ch. 3.9 — Bryant & O’Hallaron
  3. Endianness (Byte Ordering)
    • Why does 0x12345678 appear as 78 56 34 12 in a hexdump?
    • What’s little-endian vs big-endian?
    • WAV uses little-endian—how do you handle this on big-endian machines?
    • Book Reference: Computer Systems: A Programmer’s Perspective Ch. 2.1 — Bryant & O’Hallaron
  4. Audio Fundamentals
    • What is sample rate? Why is 44100 Hz standard?
    • What is bit depth? What’s the range of a 16-bit signed sample?
    • How do stereo channels interleave? (L-R-L-R or L-L-L-R-R-R?)
    • Resource: Digital Audio Basics on Wikipedia

Questions to Guide Your Design

Before implementing, think through these:

  1. Header Parsing
    • Should you read the header into a struct or parse field-by-field?
    • How will you validate the RIFF signature (RIFF at offset 0)?
    • What happens if the file is truncated or corrupted?
  2. Sample Manipulation
    • How will you represent 16-bit samples in C? (int16_t? short?)
    • For volume change: What happens if you multiply a sample by 2.0 and it overflows?
    • For reverse: Will you read the entire file into memory or process in chunks?
  3. Output File Writing
    • Can you reuse the input header or must you recalculate fields?
    • How do you ensure the output is byte-for-byte identical (except data) to the input?
    • What tools will you use to verify the output WAV is valid?

Thinking Exercise

Exercise: Trace a WAV File Hex Dump

Before coding, manually parse a real WAV file with hexdump:

hexdump -C small_audio.wav | head -20

You’ll see output like:

00000000  52 49 46 46 24 08 00 00  57 41 56 45 66 6d 74 20  |RIFF$...WAVEfmt |
00000010  10 00 00 00 01 00 02 00  44 ac 00 00 10 b1 02 00  |........D.......|
00000020  04 00 10 00 64 61 74 61  00 08 00 00 ...          |....data........|

Questions while analyzing:

  • Where is the RIFF signature? (Offset 0x00: 52 49 46 46)
  • What’s the file size field value? (Offset 0x04: 24 08 00 00 = 0x824 little-endian = 2084 bytes)
  • Where is the fmt chunk? (Offset 0x0C: 66 6d 74 20)
  • What’s the sample rate? (Offset 0x18: 44 ac 00 00 = 0x0000AC44 little-endian = 44100 Hz)
  • Where does audio data start? (After the data chunk header at 0x2C)

Do this exercise before coding. It builds intuition for binary parsing.


The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the difference between a container format and a codec.” (Answer: Container is metadata + packaging (WAV/RIFF). Codec is compression algorithm (PCM is uncompressed, so it’s the simplest ‘codec’).)

  2. “Why is endianness important when parsing binary files?” (Answer: Different CPUs store multi-byte integers differently. WAV is little-endian; you must convert on big-endian systems.)

  3. “How would you detect if a WAV file is corrupted?” (Answer: Verify RIFF signature, check that chunk sizes match file length, validate format parameters (bit depth 8/16/24, sample rate > 0).)

  4. “What’s the difference between 16-bit and 24-bit audio?” (Answer: Dynamic range. 16-bit = 96 dB, 24-bit = 144 dB. More bits = quieter noise floor, larger file size.)

  5. “How do you prevent integer overflow when applying gain to audio samples?” (Answer: Clamp values to [-32768, 32767] for 16-bit, or use floating-point intermediate representation.)

  6. “Why is WAV uncompressed, and when would you use it over MP3?” (Answer: WAV is lossless and simple. Use for: audio editing, mastering, when file size doesn’t matter. MP3 is lossy but 10x smaller.)


Hints in Layers

Hint 1: Starting Point Read the WAV file format specification from Multimedia Wiki. Draw a diagram showing: RIFF header, fmt chunk, data chunk. Identify byte offsets for each field.

Hint 2: Parsing Strategy Define a struct for the RIFF header (12 bytes) and fmt chunk (16+ bytes). Use fread() to populate the struct. Print each field with printf() to verify parsing. Remember: WAV is little-endian, your CPU might not be!

Hint 3: Sample Manipulation Pseudocode

// Volume change example (not real code, conceptual flow)
for (each sample in audio_data) {
    float modified = sample * gain_factor;

    // Clamp to prevent overflow
    if (modified > 32767) modified = 32767;
    if (modified < -32768) modified = -32768;

    output[i] = (int16_t)modified;
}

Hint 4: Tools for Verification After writing a WAV file:

  • Use file output.wav (should say “RIFF … WAVE audio”)
  • Use mediainfo output.wav (displays all metadata)
  • Use ffprobe output.wav (FFmpeg’s inspector)
  • Compare hex dumps: diff <(hexdump -C input.wav | head -5) <(hexdump -C output.wav | head -5)

Books That Will Help

Topic Book Chapter
Binary file I/O in C C Programming: A Modern Approach by K.N. King Ch. 22: “Input/Output”
Struct layouts and packing Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron Ch. 3.9: “Heterogeneous Data Structures”
Endianness and byte order Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron Ch. 2.1: “Information Storage”
Audio signal processing basics The Audio Programming Book by Richard Boulanger Ch. 1: “Introduction to Digital Audio”

Common Pitfalls & Debugging

Problem 1: “Header parses correctly, but file won’t play”

  • Why: Data chunk offset calculation is wrong, or you forgot to copy the entire file
  • Fix: Use fseek(fp, 0, SEEK_SET) to rewind after reading header. Copy input → output byte-for-byte first, then modify samples.
  • Quick test: diff input.wav output.wav (should only differ in data section)

Problem 2: “Audio sounds distorted after volume boost”

  • Why: Integer overflow. 32767 * 1.5 = 49150 exceeds 16-bit signed range.
  • Fix: Implement clamping: if (sample > 32767) sample = 32767;
  • Quick test: Inspect waveform in Audacity—should be flat-topped at max amplitude, not wrap around

Problem 3: “sizeof(struct wav_header) is 48 instead of 44 bytes”

  • Why: Compiler added padding for alignment
  • Fix: Use __attribute__((packed)) in GCC: struct wav_header { ... } __attribute__((packed));
  • Quick test: printf("Header size: %zu\n", sizeof(struct wav_header)); (should be 44 or your calculated size)

Problem 4: “Stereo audio plays as noise”

  • Why: Samples are interleaved (L-R-L-R), but you’re processing them as separate channels
  • Fix: Read pairs of samples: int16_t left = samples[i]; int16_t right = samples[i+1];
  • Quick test: Extract only left channel (every other sample) and play—should sound mono but clear

Problem 5: “Big-endian machine reads garbage values”

  • Why: WAV is little-endian, your CPU is big-endian (rare but happens on some ARM/MIPS systems)
  • Fix: Use byte-swapping macros: #include <byteswap.h> and value = bswap_32(value);
  • Quick test: Print sample rate—if it’s a huge number (e.g., 3,000,000,000), endianness is wrong

Project 2: BMP Image Sequence to Raw Video Converter

  • File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Video / Data Processing
  • Software or Tool: YUV / FFmpeg
  • Main Book: “Computer Graphics from Scratch” by Gabriel Gambetta

What you’ll build: A tool that takes a folder of BMP images and creates an uncompressed raw video file (YUV4MPEG2 format), then uses FFmpeg CLI to encode it.

Why it teaches video fundamentals: Before understanding compressed video, you must understand raw video—frames as arrays of pixels, color spaces (RGB vs YUV), frame timing. YUV4MPEG2 (Y4M) is a simple header + raw frames format that FFmpeg can read.

Core challenges you’ll face:

  • Understanding RGB to YUV color space conversion (this is how video codecs think!)
  • Handling frame dimensions, aspect ratios, and pixel formats
  • Writing sequential frame data with proper timing metadata
  • Understanding planar vs. packed pixel formats

Key Concepts:

  • Color spaces (RGB, YUV): “Computer Graphics from Scratch” by Gabriel Gambetta - Chapter on color
  • Image file formats: BMP specification is publicly available and simple
  • Video fundamentals: FFmpeg libav tutorial - Introduction section

Difficulty: Beginner-Intermediate Time estimate: Weekend - 1 week Prerequisites: C programming, basic understanding of images as pixel arrays

Real world outcome:

  • Feed your Y4M file to FFmpeg: ffmpeg -i output.y4m -c:v libx264 video.mp4
  • Play the resulting MP4 in VLC—you created a video from scratch!
  • Vary frame rate and see how it affects playback speed

Learning milestones:

  1. Successfully convert BMP to raw pixel data → understand frames as data arrays
  2. Write valid Y4M file → grasp raw video container concept
  3. Successfully encode with FFmpeg → see the compression ratio difference

Project 3: MPEG-TS Demuxer from Scratch

  • File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: Level 1: The “Resume Gold”
  • Difficulty: Level 2: Intermediate (The Developer)
  • Knowledge Area: Video/Audio, Binary Parsing
  • Software or Tool: FFmpeg, MPEG-TS
  • Main Book: Digital Video and HD: Algorithms and Interfaces by Charles Poynton

What you’ll build: A C program that parses MPEG Transport Stream (.ts) files, extracts the packet structure, identifies streams (video/audio PIDs), and dumps elementary streams to separate files.

Why it teaches demuxing deeply: MPEG-TS is the format used for broadcast TV, streaming (HLS), and Blu-rays. It has a relatively simple packet structure (188-byte fixed packets) but teaches you about Program Association Tables (PAT), Program Map Tables (PMT), PIDs, and how multiple streams are interleaved. This is real demuxing.

Core challenges you’ll face:

  • Parsing fixed-size packet headers and sync bytes
  • Understanding PID (Packet Identifier) routing
  • Parsing PAT/PMT tables to discover stream types
  • Reconstructing elementary streams from fragmented packets
  • Handling adaptation fields and stuffing bytes

Resources for key challenges:

  • FFmpeg mov.c source - Reference implementation to study
  • ISO/IEC 13818-1 specification (MPEG-2 Systems) - The actual standard
  • “Digital Video and HD: Algorithms and Interfaces” by Charles Poynton - Comprehensive reference

Key Concepts:

  • Binary protocol parsing: “Computer Systems: A Programmer’s Perspective” - Chapter 7 (Linking) for understanding structured binary
  • Transport streams: MPEG-TS specification overview on Wikipedia
  • Packet-based multiplexing: FFmpeg libav tutorial - Demuxing section

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Comfortable with C, binary file parsing, bitwise operations

Real world outcome:

  • Run on any .ts file (record from TV, download HLS segment)
  • Print stream table: “PID 256: H.264 Video, PID 257: AAC Audio”
  • Extract raw H.264 stream to file, verify with ffprobe extracted.h264
  • Feed extracted stream back through FFmpeg to remux into MP4

Learning milestones:

  1. Parse packet headers, find sync bytes → understand transport layer
  2. Parse PAT/PMT, identify streams → understand stream discovery
  3. Extract complete elementary stream → understand demuxing fully

Project 4: Video Player Using libav (FFmpeg Libraries)

  • File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: Level 2: The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate (The Developer)
  • Knowledge Area: Video/Audio, Multimedia
  • Software or Tool: FFmpeg, SDL2
  • Main Book: Video Demystified by Keith Jack

What you’ll build: A minimal video player in C that uses FFmpeg’s libavformat (demuxing), libavcodec (decoding), and SDL2 (display) to play video files with audio sync.

Why it teaches FFmpeg architecture: This is how real video players work. You’ll understand AVFormatContext, AVCodecContext, AVPacket, AVFrame—the core abstractions that power everything from VLC to YouTube’s backend. You’ll experience firsthand the demux → decode → render pipeline.

Core challenges you’ll face:

  • Opening containers and finding streams with libavformat
  • Setting up decoders with libavcodec
  • Converting pixel formats with libswscale
  • Audio/video synchronization using PTS values
  • Real-time playback timing

Resources for key challenges:

Key Concepts:

  • FFmpeg data structures: FFmpeg libav tutorial - “Learn FFmpeg libav the Hard Way”
  • A/V sync: “Video Demystified” by Keith Jack - Chapter on timing
  • SDL2 basics: SDL2 documentation and tutorials

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Solid C, understanding of pointers and memory management, basic threading concepts

Real world outcome:

  • Play any video file (MP4, MKV, AVI) in your own player window
  • See frames render on screen with synchronized audio
  • Add features: seek, pause, volume control
  • Understand exactly what VLC does under the hood

Learning milestones:

  1. Open file, enumerate streams → understand libavformat
  2. Decode video frame, display it → understand libavcodec
  3. Sync audio and video playback → understand PTS/DTS timing
  4. Handle multiple container formats → appreciate format abstraction

Project 5: H.264 NAL Unit Parser

  • File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: Level 1: The “Resume Gold”
  • Difficulty: Level 3: Advanced (The Engineer)
  • Knowledge Area: Video Codecs, Binary Parsing
  • Software or Tool: H.264, x264
  • Main Book: H.264 and MPEG-4 Video Compression by Iain Richardson

What you’ll build: A tool that parses H.264/AVC bitstreams, identifies NAL (Network Abstraction Layer) units, extracts SPS/PPS parameters, and reports frame types (I/P/B frames).

Why it teaches codec internals: Demuxing gets you packets, but those packets contain encoded bitstreams. H.264 organizes data into NAL units—understanding this layer bridges the gap between container and raw pixels. You’ll see I-frames (keyframes), understand why seeking jumps to keyframes, and grasp the concept of reference frames.

Core challenges you’ll face:

  • Finding NAL unit start codes (0x000001 or 0x00000001)
  • Parsing NAL unit headers (type, reference IDC)
  • Understanding SPS (Sequence Parameter Set) and PPS (Picture Parameter Set)
  • Exponential-Golomb coding for parsing syntax elements
  • Distinguishing slice types (I/P/B)

Resources for key challenges:

  • Vcodex H.264 Overview - Excellent technical introduction
  • “H.264 and MPEG-4 Video Compression” by Iain Richardson - The definitive book
  • x264 source code - Study a real implementation

Key Concepts:

  • Bitstream parsing and variable-length codes: H.264 specification ITU-T H.264
  • Video compression fundamentals: “H.264 and MPEG-4 Video Compression” by Iain Richardson - Chapters 1-5
  • NAL unit structure: Vcodex H.264 Overview

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Strong C, comfortable with bit manipulation, understanding of video frames

Real world outcome:

  • Run on any H.264 file: ./h264parse video.h264
  • Output like:
    NAL Unit 0: SPS (width=1920, height=1080, profile=High)
    NAL Unit 1: PPS
    NAL Unit 2: IDR Slice (I-frame, keyframe)
    NAL Unit 3: Non-IDR Slice (P-frame, refs=1)
    ...
    
  • Understand why ffmpeg -i input.mp4 -c:v copy -f h264 output.h264 produces what it does

Learning milestones:

  1. Find and count NAL units → understand bitstream structure
  2. Parse SPS, extract resolution → understand parameter sets
  3. Identify frame types → understand I/P/B frame dependencies

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
WAV Parser/Writer Beginner Weekend ⭐⭐ (container basics) ⭐⭐⭐ (immediate audio feedback)
BMP to Raw Video Beginner-Int Weekend-1wk ⭐⭐⭐ (video fundamentals) ⭐⭐⭐⭐ (create videos!)
MPEG-TS Demuxer Intermediate 1-2 weeks ⭐⭐⭐⭐ (real demuxing) ⭐⭐⭐ (satisfying parsing)
libav Video Player Intermediate 2-3 weeks ⭐⭐⭐⭐⭐ (full pipeline) ⭐⭐⭐⭐⭐ (build a player!)
H.264 NAL Parser Advanced 2-3 weeks ⭐⭐⭐⭐⭐ (codec internals) ⭐⭐⭐ (deep but abstract)

Recommendation

Based on the goal of understanding how FFmpeg works and learning low-level programming:

Start with: Project 1 (WAV Parser) → Takes a weekend, builds binary parsing confidence

Then: Project 2 (BMP to Raw Video) → Understand video fundamentals before compression

Main learning: Project 4 (libav Video Player) → This is where everything clicks. The ffmpeg-libav-tutorial by Leandro Moreira is exceptional and will guide you through building this.

Deep dive: Project 3 (MPEG-TS Demuxer) if you want to understand container internals, or Project 5 (H.264 Parser) if you want to understand codec internals.


Final Overall Project: Build a Media Transcoder CLI

What you’ll build: A complete command-line transcoder (like a mini-FFmpeg) that can:

  • Read any video file (MP4, MKV, AVI, TS)
  • Decode video and audio streams
  • Apply filters (resize, crop, audio gain)
  • Re-encode to different codecs (using libavcodec)
  • Mux into a different container format

Why this is the capstone: This project ties together everything:

  • Demuxing (libavformat)
  • Decoding (libavcodec)
  • Frame processing (libavutil, libswscale, libswresample)
  • Encoding (libavcodec)
  • Muxing (libavformat)

You’ll implement the exact pipeline that FFmpeg uses: input → demux → decode → filter → encode → mux → output

Core challenges you’ll face:

  • Managing multiple codecs simultaneously
  • Handling different timebase conversions
  • Memory management for frame buffers
  • Supporting various input/output format combinations
  • Implementing proper flush/drain on stream end

Key Concepts:

Difficulty: Advanced Time estimate: 1 month+ Prerequisites: Completed Projects 1-4, solid understanding of FFmpeg libraries

Real world outcome:

  • Run: ./mytranscoder input.mkv -vcodec h264 -acodec aac -s 1280x720 output.mp4
  • Produce valid, playable output files
  • Understand exactly what ffmpeg -i input.mkv -c:v libx264 -c:a aac -s 1280x720 output.mp4 does internally
  • Be able to read FFmpeg source code and understand it

Learning milestones:

  1. Transmux (change container, copy codecs) → understand format independence
  2. Transcode video only → understand decode/encode cycle
  3. Add audio transcoding → understand multi-stream handling
  4. Add filters → understand frame processing pipeline
  5. Handle edge cases → production-quality understanding

Essential Resources

The single best resource for this entire learning journey:

📚 FFmpeg libav tutorial by Leandro Moreira - This GitHub repo walks you through everything from “hello world” to transcoding, with excellent explanations and working C code.

Additional resources:



Summary

This learning path covers video/audio muxing and demuxing through 5+ hands-on projects. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate
1 WAV Audio File Parser & Writer C Beginner Weekend
2 BMP Image Sequence to Raw Video Converter C Intermediate Weekend-1 week
3 MPEG-TS Demuxer from Scratch C Intermediate 1-2 weeks
4 Video Player Using libav (FFmpeg Libraries) C Intermediate 2-3 weeks
5 H.264 NAL Unit Parser C Advanced 2-3 weeks
Final Media Transcoder CLI (Mini-FFmpeg) C Advanced 1 month+

For beginners: Start with projects #1, #2, then #4 (skip #3 and #5 initially)

For intermediate developers: Jump to projects #2, #4, #5 (do #1 as a warm-up if needed)

For advanced/codec-focused: Focus on projects #5, #3, then Final Project

Expected Outcomes

After completing these projects, you will:

  • Understand the full multimedia pipeline: From raw pixels/samples → containers → codecs → bitstreams
  • Master FFmpeg’s libav architecture: AVFormatContext, AVCodecContext, AVPacket, AVFrame—the core abstractions powering VLC, OBS, and every video tool
  • Parse binary formats confidently: Read codec specifications (H.264, AAC) and implement parsers from scratch
  • Debug A/V sync issues: Understand PTS/DTS timestamps and how to fix desynchronization
  • Build production-quality tools: Create transcoders, video players, and demuxers comparable to professional implementations
  • Read multimedia source code: Understand FFmpeg, VLC, and industry codebases line-by-line
  • Answer interview questions: Confidently explain containers vs codecs, color spaces, compression algorithms, and streaming protocols

You’ll have built 6 working multimedia projects that demonstrate deep understanding of video/audio processing from first principles—knowledge that 99% of developers never acquire.


Sources: