Learn Video/Audio Muxing & Demuxing: From Zero to Multimedia Systems Master

Goal: Deeply understand how multimedia data flows through processing pipelines—from raw pixels and audio samples to compressed, synchronized streams in container formats. You’ll master the architecture behind FFmpeg, VLC, and every video processing tool, understanding not just how to use libav libraries, but why containers separate metadata from payloads, how codecs achieve compression, and what happens when you click “play” on a video. After completing these projects, you’ll be able to build your own transcoder, debug A/V sync issues, parse any multimedia format, and read codec specifications with confidence.

Why Video/Audio Muxing & Demuxing Matters

Every second, hundreds of millions of hours of video are streamed globally. The video streaming market reached $811.37 billion in 2025 and is projected to grow to $2.66 trillion by 2032 (Grand View Research). Behind every Netflix stream, YouTube video, and Zoom call lies a sophisticated pipeline of muxing, demuxing, encoding, and decoding.

The Foundation of Modern Media

In 2000, Fabrice Bellard created FFmpeg—a project that would become the most widely adopted multimedia framework in history. Today, FFmpeg powers:

Netflix, YouTube, Facebook: Server-side transcoding at massive scale
VLC, OBS Studio: Real-time playback and streaming applications
Browsers (Chrome, Firefox): HTML5 video via libav integration
Mars Perseverance Rover: Image compression before transmitting to Earth (Ant Media)

FFmpeg received unanimous first place in a streaming technology survey due to its “robust and universally-accepted open-source encoding and transcoding” capabilities (The Streaming Company). 79% of video industry developers use H.264/AVC as their primary codec (Uploadcare).

The Muxing/Demuxing Pipeline

Traditional Video Processing (High-Level)         What Actually Happens (Low-Level)
┌──────────────────────────┐                     ┌──────────────────────────┐
│  ffmpeg -i input.mp4     │                     │  1. DEMUX: Parse MP4     │
│    -c:v libx264          │                     │     container, extract   │
│    -c:a aac output.mkv   │                     │     H.264 + AAC packets  │
└──────────────────────────┘                     └────────┬─────────────────┘
                                                          │
"Just works!"                                    ┌────────▼─────────────────┐
                                                 │  2. DECODE: Decompress   │
                                                 │     H.264 NAL units to   │
                                                 │     raw YUV frames       │
                                                 └────────┬─────────────────┘
                                                          │
                                                 ┌────────▼─────────────────┐
                                                 │  3. PROCESS: Apply       │
                                                 │     filters, resize,     │
                                                 │     color correction     │
                                                 └────────┬─────────────────┘
                                                          │
                                                 ┌────────▼─────────────────┐
                                                 │  4. ENCODE: Compress     │
                                                 │     YUV back to H.264    │
                                                 │     with x264 library    │
                                                 └────────┬─────────────────┘
                                                          │
                                                 ┌────────▼─────────────────┐
                                                 │  5. MUX: Write Matroska  │
                                                 │     container with new   │
                                                 │     streams + metadata   │
                                                 └──────────────────────────┘

You understand: "It transcodes video"          You DEEPLY understand: Every byte's journey

Why This Knowledge Matters

For Careers:

Multimedia engineers earn $120k-$180k+ (streaming platforms, game engines, video conferencing)
Understanding codec internals is a rare, high-value skill (most developers only know how to call FFmpeg CLI)
Interview questions focus on A/V sync, container formats, bitstream parsing—concepts these projects teach

For Technical Depth:

Video processing is pure systems programming: binary parsing, performance optimization, concurrency
Teaches real-world engineering tradeoffs: compression ratio vs speed, quality vs bandwidth
Bridges theory and practice: codec specs become tangible through implementation

For Building Real Systems:

Custom transcoders for specific workflows (medical imaging, security cameras, drones)
Media servers handling millions of concurrent streams
Video analytics pipelines (computer vision requires raw frames from demuxing)

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Before starting these projects, you should have:

Strong C programming skills
- Comfortable with pointers, structs, memory management
- Experience with binary file I/O (fopen, fread, fwrite)
- Understanding of malloc/free and avoiding memory leaks
Binary data fundamentals
- Reading hex dumps and understanding byte layouts
- Endianness (little-endian vs big-endian)
- Bitwise operations (&, |, >>, <<)
Basic command-line skills
- Using FFmpeg CLI (e.g., ffmpeg -i input.mp4 -c:v copy output.mkv)
- Playing video files in VLC or mpv
- Inspecting files with hexdump -C or xxd

Helpful But Not Required (You’ll Learn During Projects)

Color space theory (RGB, YUV)
Compression algorithms (DCT, entropy coding)
Threading and concurrency
Audio DSP fundamentals

Self-Assessment Questions

Stop and verify you can answer these before starting:

C Programming: Can you write a struct parser that reads binary data from a file?
Endianness: Why does 0x1234 appear as 34 12 in a hex dump on x86?
File I/O: How do you seek to byte offset 1000 in a file?
Binary Basics: What’s the difference between reading bytes vs. reading text lines?
FFmpeg CLI: Can you extract audio from a video file using ffmpeg?

If you answered “no” to more than 2 questions, start with:

“C Programming: A Modern Approach” by K.N. King - Chapters 22-24 (I/O, pointers, memory)
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 2 (binary data representation)

Development Environment Setup

Required Tools:

# FFmpeg libraries (Ubuntu/Debian)
sudo apt-get install libavcodec-dev libavformat-dev libavutil-dev libswscale-dev libswresample-dev

# For video player project
sudo apt-get install libsdl2-dev

# For analysis
sudo apt-get install ffmpeg mediainfo hexdump

macOS (via Homebrew):

brew install ffmpeg pkg-config sdl2

Recommended Tools:

VLC / mpv: For testing output files
MediaInfo: Inspect container metadata (mediainfo video.mp4)
xxd / hexdump: View raw file bytes
ffprobe: Analyze streams (ffprobe -show_streams input.mp4)

Time Investment Expectations

Project Level	Time Estimate	What You’ll Build
Beginner (Projects 1-2)	1-2 weeks	Simple parsers, raw format converters
Intermediate (Projects 3-4)	2-4 weeks	Real demuxers, video players
Advanced (Project 5+)	3-6 weeks	Codec-level parsing, full transcoders

Total mastery: 3-6 months part-time (2-3 hours/day)

Important Reality Check

This is genuinely hard. You will:

Spend hours debugging why “everything looks right” but the video won’t play
Stare at hex dumps trying to find off-by-one errors
Deal with cryptic FFmpeg error messages
Read codec specifications that feel like legal documents

But it’s worth it. The moment you:

See your first self-decoded video frame render
Build a tool that processes video faster than FFmpeg
Debug an A/V sync issue in production and know exactly what’s wrong
Read VLC source code and understand every line

…you’ll have knowledge that 99% of developers don’t possess.

Core Concept Analysis

To truly understand video/audio muxing and demuxing, you need to grasp these fundamental building blocks:

The Container vs. Codec Distinction

This is the #1 source of confusion for beginners. A video file has two layers:

video.mp4 (the file you see)
│
├─ Container Format (MP4)        ← Metadata, timing, stream organization
│  ├─ Video Stream Metadata      ← Resolution, framerate, duration
│  ├─ Audio Stream Metadata      ← Sample rate, channels, duration
│  └─ Interleaved Packets        ← Actual data chunks with timestamps
│     ├─ Video Packet #1 (PTS=0.0s)
│     ├─ Audio Packet #1 (PTS=0.0s)
│     ├─ Video Packet #2 (PTS=0.033s)
│     └─ Audio Packet #2 (PTS=0.021s)
│
└─ Inside Each Packet:
   ├─ Video: H.264 Codec         ← Compressed video bitstream
   └─ Audio: AAC Codec           ← Compressed audio bitstream

Analogy: The container is like a ZIP file. The codecs are like JPEG or PNG inside the ZIP. FFmpeg demuxes the ZIP (unpack streams), then decodes the images (decompress data).

Key Concepts Explained

Concept	What It Means	Why It Matters
Container Format	The “box” that holds streams (MP4, MKV, AVI, TS) - stores metadata, timing, and interleaves data	Determines file compatibility, seeking behavior, streaming support
Codec	Algorithm that compresses/decompresses actual video/audio data (H.264, AAC, VP9)	Determines quality, file size, encoding speed
Muxing	Combining multiple streams (video, audio, subtitles) into a single container file	Creates final playable files with synchronized A/V
Demuxing	Extracting individual streams from a container file	First step in playback, transcoding, or analysis
Packets	Chunks of compressed data belonging to a stream	Basic unit of data transfer in multimedia pipelines
Frames	Decoded raw data (pixels for video, samples for audio)	What you actually see/hear after decompression
PTS/DTS	Presentation/Decode Timestamps - controls playback timing and sync	Prevents A/V desync, enables seeking
Bitstream	The raw encoded data format within a codec (NAL units for H.264)	Understanding this lets you parse codec internals

The Full Multimedia Data Hierarchy

┌─────────────────────────────────────────────────────────┐
│                    FILE (video.mp4)                     │  What you download
│  ┌───────────────────────────────────────────────────┐  │
│  │           CONTAINER (MP4 Format)                  │  │  Muxing/Demuxing layer
│  │  ┌─────────────────────────────────────────────┐  │  │
│  │  │  STREAM #0: Video (codec: H.264)            │  │  │
│  │  │    ┌──────────────────────────────────────┐ │  │  │
│  │  │    │  PACKET #0 (PTS=0.0s, 5KB)           │ │  │  │  Transport units
│  │  │    │    ┌──────────────────────────────┐  │ │  │  │
│  │  │    │    │  NAL UNIT: SPS (config)      │  │ │  │  │  Codec bitstream
│  │  │    │    │  NAL UNIT: IDR Slice (data)  │  │ │  │  │
│  │  │    │    └──────────────────────────────┘  │ │  │  │
│  │  │    │         ↓ DECODE                     │ │  │  │
│  │  │    │    ┌──────────────────────────────┐  │ │  │  │
│  │  │    │    │  FRAME: 1920x1080 YUV pixels │  │ │  │  │  Raw data
│  │  │    │    └──────────────────────────────┘  │ │  │  │
│  │  │    └──────────────────────────────────────┘ │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────┐  │  │
│  │  │  STREAM #1: Audio (codec: AAC)              │  │  │
│  │  │    ┌──────────────────────────────────────┐ │  │  │
│  │  │    │  PACKET #0 (PTS=0.0s, 512B)          │ │  │  │
│  │  │    │    ↓ DECODE                          │ │  │  │
│  │  │    │  1024 PCM samples (float32)          │ │  │  │
│  │  │    └──────────────────────────────────────┘ │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Container Formats: The Organizational Layer

Different containers optimize for different use cases:

Container	Extension	Use Case	Characteristics
MP4	`.mp4`	Streaming, mobile, web	Atom-based structure, fast seeking, MPEG-4 Part 14 standard
Matroska (MKV)	`.mkv`	High-quality archival	Flexible, EBML-based, unlimited tracks, chapter support
MPEG-TS	`.ts`	Broadcasting, live streaming	Fixed 188-byte packets, error resilience, HLS segments
AVI	`.avi`	Legacy Windows	Simple RIFF format, limited codec support
WebM	`.webm`	Web video	Subset of Matroska, VP8/VP9 + Opus, royalty-free

Key Insight: You can have H.264 video in MP4, MKV, TS, or AVI—same codec, different container. Transm uxing (changing container without re-encoding) is fast because you’re just repackaging packets.

Codec Landscape: The Compression Layer

Video Codecs:

Codec Evolution: Compression vs. Complexity

Old (1990s)         Efficient (2000s)       Modern (2010s)       Future (2020s)
┌─────────┐         ┌─────────┐            ┌─────────┐          ┌─────────┐
│ MPEG-1  │  ────▶  │ H.264   │  ────▶     │ H.265   │  ────▶   │  AV1    │
│ MPEG-2  │         │ (AVC)   │            │ (HEVC)  │          │  VVC    │
└─────────┘         └─────────┘            └─────────┘          └─────────┘
  Simple              79% market            50% better            Open, 30%
  Low compression     Universal support     compression          better than
  Fast decode         Hardware everywhere   Patent issues        HEVC

Audio Codecs:

Codec	Use Case	Characteristics
PCM (WAV)	Uncompressed	Raw samples, huge files, no quality loss
MP3	Music, legacy	Lossy, widely supported, moderate compression
AAC	Streaming, mobile	Lossy, better than MP3, part of MPEG-4
Opus	VoIP, WebRTC	Lossy, best quality at low bitrates, open-source
FLAC	Archival, audiophiles	Lossless compression, 50-60% size reduction

PTS/DTS: The Synchronization System

Every packet has timestamps that control when to decode and present:

Video Stream (with B-frames):
Decode Order (DTS):  I₀  P₁  P₂  B₃  B₄  P₅
Display Order (PTS): I₀  B₃  B₄  P₁  P₂  P₅

Timeline:
DTS: [0ms] [33ms] [66ms] [100ms] [133ms] [166ms]
PTS: [0ms] [100ms] [133ms] [33ms] [66ms] [166ms]
          ↑                        ↑
          Decode I-frame first     Display later

Why DTS ≠ PTS? B-frames (bi-directional) reference future frames, so you must decode P₁ and P₂ before displaying B₃ and B₄.

A/V Sync: Audio PTS and video PTS must align. If video PTS = 1.5s and audio PTS = 1.2s, you have 300ms desync (noticeable drift).

Bitstreams: The Codec’s Internal Language

Inside a video packet lies a bitstream—the codec’s proprietary format:

H.264 Packet (Annex B Format):
┌──────────────────────────────────────────────────────┐
│ 00 00 00 01                                          │  Start code (NAL separator)
├──────────────────────────────────────────────────────┤
│ 67 64 00 1F AC D9 40 50 05 BB 01 10 00 00 03 ...    │  NAL Unit: SPS (sequence params)
├──────────────────────────────────────────────────────┤
│ 00 00 00 01                                          │  Start code
├──────────────────────────────────────────────────────┤
│ 68 EE 3C 80                                          │  NAL Unit: PPS (picture params)
├──────────────────────────────────────────────────────┤
│ 00 00 00 01                                          │  Start code
├──────────────────────────────────────────────────────┤
│ 65 88 84 00 33 FF ...                                │  NAL Unit: IDR Slice (actual frame data)
└──────────────────────────────────────────────────────┘

Understanding bitstreams lets you:

Parse codec configurations (resolution, profile, level)
Identify keyframes (I-frames) for seeking
Debug codec errors at the byte level
Optimize encoding parameters

Concept Summary Table

Concept Cluster	What You Need to Internalize
Container vs. Codec	The container is packaging, the codec is compression. Same codec can live in different containers.
Muxing/Demuxing	Muxing = combine streams into a file. Demuxing = extract streams from a file. Both operate on packets, not pixels.
Packets vs. Frames	Packets are compressed chunks (what’s in the file). Frames are raw data (what you decode to).
PTS/DTS Timing	PTS = when to display. DTS = when to decode. They differ when frames are reordered (B-frames).
Color Spaces	RGB = how computers think. YUV = how video codecs think. Conversion is mandatory.
Bitstreams	The codec’s internal format. Understanding this lets you parse codec metadata and debug encoding issues.

Deep Dive Reading by Concept

This section maps each concept from above to specific book chapters for deeper understanding. Read these before or alongside the projects to build strong mental models.

Container Formats & File I/O

Concept	Book & Chapter
Binary file parsing in C	C Programming: A Modern Approach by K.N. King — Ch. 22: “Input/Output”
Struct layouts and alignment	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 3.9: “Heterogeneous Data Structures”
Endianness handling	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 2.1: “Information Storage”

Color Spaces & Image Representation

Concept	Book & Chapter
RGB and YUV color models	Computer Graphics from Scratch by Gabriel Gambetta — Ch. 2: “Basic Rendering”
Pixel formats (planar vs. packed)	Digital Video and HD by Charles Poynton — Ch. 3-4

Compression & Codecs

Concept	Book & Chapter
H.264 fundamentals and NAL units	H.264 and MPEG-4 Video Compression by Iain Richardson — Ch. 1-5
Video compression basics	Video Demystified by Keith Jack — Ch. 8-9

Multimedia Pipelines

Concept	Book & Chapter
FFmpeg libav architecture	FFmpeg libav tutorial — Complete walkthrough
Audio/Video synchronization	Video Demystified by Keith Jack — Ch. 11: “Timing and Synchronization”

Essential Reading Order

For maximum comprehension, read in this order:

Foundation (Week 1):
- C Programming: A Modern Approach Ch. 22 (binary file I/O)
- Computer Systems: A Programmer’s Perspective Ch. 2 (data representation)
Multimedia Basics (Week 2):
- FFmpeg libav tutorial — Introduction
- Computer Graphics from Scratch Ch. 2 (color spaces)
Codec Internals (Week 3-4):
- H.264 and MPEG-4 Video Compression Ch. 1-5 (as you build Projects 3-5)

Quick Start Guide (First 48 Hours)

Feeling overwhelmed? Start here:

Day 1: Get Your Hands Dirty (4 hours)

Install tools (30 minutes)

# Ubuntu/Debian
sudo apt-get install ffmpeg mediainfo hexdump

Explore a real video file (1 hour)

# Download a sample video
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4

# Inspect with mediainfo
mediainfo BigBuckBunny.mp4

# View container metadata
ffprobe -show_format -show_streams BigBuckBunny.mp4

# Look at raw bytes
hexdump -C BigBuckBunny.mp4 | head -50

Experiment with FFmpeg (1 hour)

# Extract video stream only
ffmpeg -i BigBuckBunny.mp4 -c:v copy -an video_only.mp4

# Extract audio stream only
ffmpeg -i BigBuckBunny.mp4 -c:a copy -vn audio_only.m4a

# Transmux (change container, don't re-encode)
ffmpeg -i BigBuckBunny.mp4 -c copy output.mkv

# Re-encode video
ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -crf 28 smaller.mp4

Start Project 1 (1.5 hours)
- Download a WAV file
- Read the binary header with C
- Print sample rate, channels, bit depth

Day 2: Build Something (4 hours)

Finish Project 1 (2 hours)
- Modify PCM samples (volume change)
- Write a valid WAV file
Read about containers (1 hour)
- Skim FFmpeg libav tutorial Introduction
- Understand AVFormatContext, AVStream, AVPacket
Plan your learning path (1 hour)
- Decide: deep-dive containers (Project 3) or full pipeline (Project 4)?
- Schedule 2-3 hours/day for next 2 weeks

Recommended Learning Paths

Path 1: “I Want to Understand FFmpeg Internals” (Recommended)

Goal: Build production-quality tools using libav APIs

Projects: 1 → 2 → 4 → Final Project Timeline: 6-8 weeks part-time Best for: Backend engineers, video platform developers

Why this path: Project 4 (Video Player) forces you to learn the entire FFmpeg architecture. You’ll use libavformat, libavcodec, libswscale—the same APIs used by VLC, OBS, and every professional tool.

Path 2: “I Want to Understand Codec Internals” (Advanced)

Goal: Parse bitstreams, understand compression algorithms

Projects: 1 → 2 → 5 → 3 → Final Project Timeline: 8-12 weeks part-time Best for: Codec developers, video compression researchers

Why this path: Project 5 (H.264 NAL Parser) teaches you how data is compressed. You’ll read codec specifications and implement parsers from scratch.

Path 3: “I’m a Beginner, Build My Confidence” (Gentle)

Goal: Learn multimedia fundamentals without overwhelming complexity

Projects: 1 → 2 → (stop and practice) → 4 Timeline: 4-6 weeks part-time Best for: Students, junior developers

Why this path: Projects 1-2 are approachable and give immediate feedback. Take time to internalize concepts before tackling libav APIs.

Project 1: WAV Audio File Parser & Writer

File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
Programming Language: C
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 1: Beginner
Knowledge Area: Multimedia / File Formats
Software or Tool: WAV Format
Main Book: “The Audio Programming Book” by Richard Boulanger

What you’ll build: A C program that reads WAV files, displays header information, manipulates raw PCM audio data, and writes modified WAV files.

Why it teaches multimedia fundamentals: WAV is the “hello world” of container formats—it has a simple, well-documented structure with a header followed by raw audio samples. You’ll learn binary file parsing, endianness handling, and the fundamental concept of separating metadata (container) from payload (audio data).

Core challenges you’ll face:

Parsing binary structures (maps to understanding container headers)
Handling different sample formats (8-bit, 16-bit, 32-bit float)
Understanding sample rate, channels, and bit depth relationships
Writing properly formatted binary files

Key Concepts:

Binary file I/O in C: “C Programming: A Modern Approach” by K.N. King - Chapter 22 (Input/Output)
Endianness and byte ordering: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Chapter 2
Audio fundamentals (sample rate, bit depth): “Digital Audio Fundamentals” article on Wikipedia

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, understanding of binary/hex

Real world outcome:

Run your program on any WAV file and see: sample rate, channels, duration, bit depth printed to console
Apply effects like volume change, reverse audio, or fade in/out
Play your modified WAV file in any audio player to verify it works

Learning milestones:

Successfully parse WAV header → understand container structure concept
Read and modify PCM samples → grasp raw vs. encoded data distinction
Write valid WAV file → understand muxing at its simplest level

Real World Outcome

You’ll build a C program that opens any WAV file, displays its metadata in a formatted output, and can apply audio effects. When you run it, you’ll see:

$ ./wav_parser audio.wav

=== WAV File Information ===
File size: 5,292,044 bytes
Format: WAVE (RIFF container)

Audio Properties:
  Sample rate: 44100 Hz
  Channels: 2 (Stereo)
  Bit depth: 16 bits
  Duration: 30.02 seconds
  Data size: 5,291,856 bytes
  Samples per channel: 1,322,964

Calculated Bitrate: 1,411 kbps (uncompressed)

$ ./wav_effect volume audio.wav output.wav 1.5

Processing: audio.wav → output.wav
Effect: Volume boost (1.5x)
Progress: [████████████████████] 100%
Wrote 5,292,044 bytes to output.wav
Done! Play with: mpv output.wav

$ mpv output.wav
# Audio plays 50% louder than original

What you can verify:

Open the output file in Audacity — waveform should show 1.5x amplitude
Compare file sizes — ls -lh audio.wav output.wav (should be identical)
Inspect with MediaInfo — mediainfo output.wav (all metadata preserved)
Listen with any player — VLC, mpv, Windows Media Player all work

Effects you’ll implement:

Volume change: Multiply every sample by a gain factor
Reverse: Write samples in reverse order (plays backward)
Fade in/out: Apply gradual volume ramp
Channel swap: Exchange left/right stereo channels

The Core Question You’re Answering

“What IS a container format? How does metadata separate from payload, and why is WAV the simplest example of this pattern?”

Before you write any code, sit with this question. Most developers think “WAV is just an audio file,” but it’s actually a RIFF container that happens to hold PCM audio. Understanding this distinction—that the container format (RIFF/WAV) is separate from the audio encoding (PCM)—is the foundation for understanding MP4, MKV, and every multimedia format.

The insight: WAV has a header (metadata: sample rate, channels) followed by a data chunk (payload: raw samples). This two-part structure is the universal pattern in all multimedia containers.

Concepts You Must Understand First

Stop and research these before coding:

Binary File I/O in C
- How do you read bytes from a file with fread()?
- What’s the difference between text mode ("r") and binary mode ("rb")?
- How do you seek to a specific byte offset with fseek()?
- Book Reference: C Programming: A Modern Approach Ch. 22 — K.N. King
Structs and Memory Layout
- Can you define a C struct that matches a binary format specification?
- Do you understand struct padding and how to disable it (__attribute__((packed)))?
- Why does sizeof(struct wav_header) sometimes give unexpected results?
- Book Reference: Computer Systems: A Programmer’s Perspective Ch. 3.9 — Bryant & O’Hallaron
Endianness (Byte Ordering)
- Why does 0x12345678 appear as 78 56 34 12 in a hexdump?
- What’s little-endian vs big-endian?
- WAV uses little-endian—how do you handle this on big-endian machines?
- Book Reference: Computer Systems: A Programmer’s Perspective Ch. 2.1 — Bryant & O’Hallaron
Audio Fundamentals
- What is sample rate? Why is 44100 Hz standard?
- What is bit depth? What’s the range of a 16-bit signed sample?
- How do stereo channels interleave? (L-R-L-R or L-L-L-R-R-R?)
- Resource: Digital Audio Basics on Wikipedia

Questions to Guide Your Design

Before implementing, think through these:

Header Parsing
- Should you read the header into a struct or parse field-by-field?
- How will you validate the RIFF signature (RIFF at offset 0)?
- What happens if the file is truncated or corrupted?
Sample Manipulation
- How will you represent 16-bit samples in C? (int16_t? short?)
- For volume change: What happens if you multiply a sample by 2.0 and it overflows?
- For reverse: Will you read the entire file into memory or process in chunks?
Output File Writing
- Can you reuse the input header or must you recalculate fields?
- How do you ensure the output is byte-for-byte identical (except data) to the input?
- What tools will you use to verify the output WAV is valid?

Thinking Exercise

Exercise: Trace a WAV File Hex Dump

Before coding, manually parse a real WAV file with hexdump:

hexdump -C small_audio.wav | head -20

You’ll see output like:

00000000  52 49 46 46 24 08 00 00  57 41 56 45 66 6d 74 20  |RIFF$...WAVEfmt |
00000010  10 00 00 00 01 00 02 00  44 ac 00 00 10 b1 02 00  |........D.......|
00000020  04 00 10 00 64 61 74 61  00 08 00 00 ...          |....data........|

Questions while analyzing:

Where is the RIFF signature? (Offset 0x00: 52 49 46 46)
What’s the file size field value? (Offset 0x04: 24 08 00 00 = 0x824 little-endian = 2084 bytes)
Where is the fmt chunk? (Offset 0x0C: 66 6d 74 20)
What’s the sample rate? (Offset 0x18: 44 ac 00 00 = 0x0000AC44 little-endian = 44100 Hz)
Where does audio data start? (After the data chunk header at 0x2C)

Do this exercise before coding. It builds intuition for binary parsing.

The Interview Questions They’ll Ask

Prepare to answer these:

“Explain the difference between a container format and a codec.” (Answer: Container is metadata + packaging (WAV/RIFF). Codec is compression algorithm (PCM is uncompressed, so it’s the simplest ‘codec’).)
“Why is endianness important when parsing binary files?” (Answer: Different CPUs store multi-byte integers differently. WAV is little-endian; you must convert on big-endian systems.)
“How would you detect if a WAV file is corrupted?” (Answer: Verify RIFF signature, check that chunk sizes match file length, validate format parameters (bit depth 8/16/24, sample rate > 0).)
“What’s the difference between 16-bit and 24-bit audio?” (Answer: Dynamic range. 16-bit = 96 dB, 24-bit = 144 dB. More bits = quieter noise floor, larger file size.)
“How do you prevent integer overflow when applying gain to audio samples?” (Answer: Clamp values to [-32768, 32767] for 16-bit, or use floating-point intermediate representation.)
“Why is WAV uncompressed, and when would you use it over MP3?” (Answer: WAV is lossless and simple. Use for: audio editing, mastering, when file size doesn’t matter. MP3 is lossy but 10x smaller.)

Hints in Layers

Hint 1: Starting Point Read the WAV file format specification from Multimedia Wiki. Draw a diagram showing: RIFF header, fmt chunk, data chunk. Identify byte offsets for each field.

Hint 2: Parsing Strategy Define a struct for the RIFF header (12 bytes) and fmt chunk (16+ bytes). Use fread() to populate the struct. Print each field with printf() to verify parsing. Remember: WAV is little-endian, your CPU might not be!

Hint 3: Sample Manipulation Pseudocode

// Volume change example (not real code, conceptual flow)
for (each sample in audio_data) {
    float modified = sample * gain_factor;

    // Clamp to prevent overflow
    if (modified > 32767) modified = 32767;
    if (modified < -32768) modified = -32768;

    output[i] = (int16_t)modified;
}

Hint 4: Tools for Verification After writing a WAV file:

Use file output.wav (should say “RIFF … WAVE audio”)
Use mediainfo output.wav (displays all metadata)
Use ffprobe output.wav (FFmpeg’s inspector)
Compare hex dumps: diff <(hexdump -C input.wav | head -5) <(hexdump -C output.wav | head -5)

Books That Will Help

Topic	Book	Chapter
Binary file I/O in C	C Programming: A Modern Approach by K.N. King	Ch. 22: “Input/Output”
Struct layouts and packing	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron	Ch. 3.9: “Heterogeneous Data Structures”
Endianness and byte order	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron	Ch. 2.1: “Information Storage”
Audio signal processing basics	The Audio Programming Book by Richard Boulanger	Ch. 1: “Introduction to Digital Audio”

Common Pitfalls & Debugging

Problem 1: “Header parses correctly, but file won’t play”

Why: Data chunk offset calculation is wrong, or you forgot to copy the entire file
Fix: Use fseek(fp, 0, SEEK_SET) to rewind after reading header. Copy input → output byte-for-byte first, then modify samples.
Quick test: diff input.wav output.wav (should only differ in data section)

Problem 2: “Audio sounds distorted after volume boost”

Why: Integer overflow. 32767 * 1.5 = 49150 exceeds 16-bit signed range.
Fix: Implement clamping: if (sample > 32767) sample = 32767;
Quick test: Inspect waveform in Audacity—should be flat-topped at max amplitude, not wrap around

Problem 3: “sizeof(struct wav_header) is 48 instead of 44 bytes”

Why: Compiler added padding for alignment
Fix: Use __attribute__((packed)) in GCC: struct wav_header { ... } __attribute__((packed));
Quick test: printf("Header size: %zu\n", sizeof(struct wav_header)); (should be 44 or your calculated size)

Problem 4: “Stereo audio plays as noise”

Why: Samples are interleaved (L-R-L-R), but you’re processing them as separate channels
Fix: Read pairs of samples: int16_t left = samples[i]; int16_t right = samples[i+1];
Quick test: Extract only left channel (every other sample) and play—should sound mono but clear

Problem 5: “Big-endian machine reads garbage values”

Why: WAV is little-endian, your CPU is big-endian (rare but happens on some ARM/MIPS systems)
Fix: Use byte-swapping macros: #include <byteswap.h> and value = bswap_32(value);
Quick test: Print sample rate—if it’s a huge number (e.g., 3,000,000,000), endianness is wrong

Project 2: BMP Image Sequence to Raw Video Converter

File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
Programming Language: C
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Video / Data Processing
Software or Tool: YUV / FFmpeg
Main Book: “Computer Graphics from Scratch” by Gabriel Gambetta

What you’ll build: A tool that takes a folder of BMP images and creates an uncompressed raw video file (YUV4MPEG2 format), then uses FFmpeg CLI to encode it.

Why it teaches video fundamentals: Before understanding compressed video, you must understand raw video—frames as arrays of pixels, color spaces (RGB vs YUV), frame timing. YUV4MPEG2 (Y4M) is a simple header + raw frames format that FFmpeg can read.

Core challenges you’ll face:

Understanding RGB to YUV color space conversion (this is how video codecs think!)
Handling frame dimensions, aspect ratios, and pixel formats
Writing sequential frame data with proper timing metadata
Understanding planar vs. packed pixel formats

Key Concepts:

Color spaces (RGB, YUV): “Computer Graphics from Scratch” by Gabriel Gambetta - Chapter on color
Image file formats: BMP specification is publicly available and simple
Video fundamentals: FFmpeg libav tutorial - Introduction section

Difficulty: Beginner-Intermediate Time estimate: Weekend - 1 week Prerequisites: C programming, basic understanding of images as pixel arrays

Real world outcome:

Feed your Y4M file to FFmpeg: ffmpeg -i output.y4m -c:v libx264 video.mp4
Play the resulting MP4 in VLC—you created a video from scratch!
Vary frame rate and see how it affects playback speed

Learning milestones:

Successfully convert BMP to raw pixel data → understand frames as data arrays
Write valid Y4M file → grasp raw video container concept
Successfully encode with FFmpeg → see the compression ratio difference

Project 3: MPEG-TS Demuxer from Scratch

File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: Level 1: The “Resume Gold”
Difficulty: Level 2: Intermediate (The Developer)
Knowledge Area: Video/Audio, Binary Parsing
Software or Tool: FFmpeg, MPEG-TS
Main Book: Digital Video and HD: Algorithms and Interfaces by Charles Poynton

What you’ll build: A C program that parses MPEG Transport Stream (.ts) files, extracts the packet structure, identifies streams (video/audio PIDs), and dumps elementary streams to separate files.

Why it teaches demuxing deeply: MPEG-TS is the format used for broadcast TV, streaming (HLS), and Blu-rays. It has a relatively simple packet structure (188-byte fixed packets) but teaches you about Program Association Tables (PAT), Program Map Tables (PMT), PIDs, and how multiple streams are interleaved. This is real demuxing.

Core challenges you’ll face:

Parsing fixed-size packet headers and sync bytes
Understanding PID (Packet Identifier) routing
Parsing PAT/PMT tables to discover stream types
Reconstructing elementary streams from fragmented packets
Handling adaptation fields and stuffing bytes

Resources for key challenges:

FFmpeg mov.c source - Reference implementation to study
ISO/IEC 13818-1 specification (MPEG-2 Systems) - The actual standard
“Digital Video and HD: Algorithms and Interfaces” by Charles Poynton - Comprehensive reference

Key Concepts:

Binary protocol parsing: “Computer Systems: A Programmer’s Perspective” - Chapter 7 (Linking) for understanding structured binary
Transport streams: MPEG-TS specification overview on Wikipedia
Packet-based multiplexing: FFmpeg libav tutorial - Demuxing section

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Comfortable with C, binary file parsing, bitwise operations

Real world outcome:

Run on any .ts file (record from TV, download HLS segment)
Print stream table: “PID 256: H.264 Video, PID 257: AAC Audio”
Extract raw H.264 stream to file, verify with ffprobe extracted.h264
Feed extracted stream back through FFmpeg to remux into MP4

Learning milestones:

Parse packet headers, find sync bytes → understand transport layer
Parse PAT/PMT, identify streams → understand stream discovery
Extract complete elementary stream → understand demuxing fully

Project 4: Video Player Using libav (FFmpeg Libraries)

File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: Level 2: The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate (The Developer)
Knowledge Area: Video/Audio, Multimedia
Software or Tool: FFmpeg, SDL2
Main Book: Video Demystified by Keith Jack

What you’ll build: A minimal video player in C that uses FFmpeg’s libavformat (demuxing), libavcodec (decoding), and SDL2 (display) to play video files with audio sync.

Why it teaches FFmpeg architecture: This is how real video players work. You’ll understand AVFormatContext, AVCodecContext, AVPacket, AVFrame—the core abstractions that power everything from VLC to YouTube’s backend. You’ll experience firsthand the demux → decode → render pipeline.

Core challenges you’ll face:

Opening containers and finding streams with libavformat
Setting up decoders with libavcodec
Converting pixel formats with libswscale
Audio/video synchronization using PTS values
Real-time playback timing

Resources for key challenges:

FFmpeg libav tutorial by Leandro Moreira - Essential resource, walks through this exact project
FFmpeg API documentation - Official reference

Key Concepts:

FFmpeg data structures: FFmpeg libav tutorial - “Learn FFmpeg libav the Hard Way”
A/V sync: “Video Demystified” by Keith Jack - Chapter on timing
SDL2 basics: SDL2 documentation and tutorials

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Solid C, understanding of pointers and memory management, basic threading concepts

Real world outcome:

Play any video file (MP4, MKV, AVI) in your own player window
See frames render on screen with synchronized audio
Add features: seek, pause, volume control
Understand exactly what VLC does under the hood

Learning milestones:

Open file, enumerate streams → understand libavformat
Decode video frame, display it → understand libavcodec
Sync audio and video playback → understand PTS/DTS timing
Handle multiple container formats → appreciate format abstraction

Project 5: H.264 NAL Unit Parser

File: VIDEO_AUDIO_MUXING_DEMUXING_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: Level 1: The “Resume Gold”
Difficulty: Level 3: Advanced (The Engineer)
Knowledge Area: Video Codecs, Binary Parsing
Software or Tool: H.264, x264
Main Book: H.264 and MPEG-4 Video Compression by Iain Richardson

What you’ll build: A tool that parses H.264/AVC bitstreams, identifies NAL (Network Abstraction Layer) units, extracts SPS/PPS parameters, and reports frame types (I/P/B frames).

Why it teaches codec internals: Demuxing gets you packets, but those packets contain encoded bitstreams. H.264 organizes data into NAL units—understanding this layer bridges the gap between container and raw pixels. You’ll see I-frames (keyframes), understand why seeking jumps to keyframes, and grasp the concept of reference frames.

Core challenges you’ll face:

Finding NAL unit start codes (0x000001 or 0x00000001)
Parsing NAL unit headers (type, reference IDC)
Understanding SPS (Sequence Parameter Set) and PPS (Picture Parameter Set)
Exponential-Golomb coding for parsing syntax elements
Distinguishing slice types (I/P/B)

Resources for key challenges:

Vcodex H.264 Overview - Excellent technical introduction
“H.264 and MPEG-4 Video Compression” by Iain Richardson - The definitive book
x264 source code - Study a real implementation

Key Concepts:

Bitstream parsing and variable-length codes: H.264 specification ITU-T H.264
Video compression fundamentals: “H.264 and MPEG-4 Video Compression” by Iain Richardson - Chapters 1-5
NAL unit structure: Vcodex H.264 Overview

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Strong C, comfortable with bit manipulation, understanding of video frames

Real world outcome:

Run on any H.264 file: ./h264parse video.h264

Output like:

NAL Unit 0: SPS (width=1920, height=1080, profile=High)
NAL Unit 1: PPS
NAL Unit 2: IDR Slice (I-frame, keyframe)
NAL Unit 3: Non-IDR Slice (P-frame, refs=1)
...

Understand why ffmpeg -i input.mp4 -c:v copy -f h264 output.h264 produces what it does

Learning milestones:

Find and count NAL units → understand bitstream structure
Parse SPS, extract resolution → understand parameter sets
Identify frame types → understand I/P/B frame dependencies

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
WAV Parser/Writer	Beginner	Weekend	⭐⭐ (container basics)	⭐⭐⭐ (immediate audio feedback)
BMP to Raw Video	Beginner-Int	Weekend-1wk	⭐⭐⭐ (video fundamentals)	⭐⭐⭐⭐ (create videos!)
MPEG-TS Demuxer	Intermediate	1-2 weeks	⭐⭐⭐⭐ (real demuxing)	⭐⭐⭐ (satisfying parsing)
libav Video Player	Intermediate	2-3 weeks	⭐⭐⭐⭐⭐ (full pipeline)	⭐⭐⭐⭐⭐ (build a player!)
H.264 NAL Parser	Advanced	2-3 weeks	⭐⭐⭐⭐⭐ (codec internals)	⭐⭐⭐ (deep but abstract)

Recommendation

Based on the goal of understanding how FFmpeg works and learning low-level programming:

Start with: Project 1 (WAV Parser) → Takes a weekend, builds binary parsing confidence

Then: Project 2 (BMP to Raw Video) → Understand video fundamentals before compression

Main learning: Project 4 (libav Video Player) → This is where everything clicks. The ffmpeg-libav-tutorial by Leandro Moreira is exceptional and will guide you through building this.

Deep dive: Project 3 (MPEG-TS Demuxer) if you want to understand container internals, or Project 5 (H.264 Parser) if you want to understand codec internals.

Final Overall Project: Build a Media Transcoder CLI

What you’ll build: A complete command-line transcoder (like a mini-FFmpeg) that can:

Read any video file (MP4, MKV, AVI, TS)
Decode video and audio streams
Apply filters (resize, crop, audio gain)
Re-encode to different codecs (using libavcodec)
Mux into a different container format

Why this is the capstone: This project ties together everything:

Demuxing (libavformat)
Decoding (libavcodec)
Frame processing (libavutil, libswscale, libswresample)
Encoding (libavcodec)
Muxing (libavformat)

You’ll implement the exact pipeline that FFmpeg uses: input → demux → decode → filter → encode → mux → output

Core challenges you’ll face:

Managing multiple codecs simultaneously
Handling different timebase conversions
Memory management for frame buffers
Supporting various input/output format combinations
Implementing proper flush/drain on stream end

Key Concepts:

Complete transcoding pipeline: ffmpeg-libav-tutorial transcoding.c
Format conversion: libavformat documentation
Filter graphs: FFmpeg libavfilter documentation

Difficulty: Advanced Time estimate: 1 month+ Prerequisites: Completed Projects 1-4, solid understanding of FFmpeg libraries

Real world outcome:

Run: ./mytranscoder input.mkv -vcodec h264 -acodec aac -s 1280x720 output.mp4
Produce valid, playable output files
Understand exactly what ffmpeg -i input.mkv -c:v libx264 -c:a aac -s 1280x720 output.mp4 does internally
Be able to read FFmpeg source code and understand it

Learning milestones:

Transmux (change container, copy codecs) → understand format independence
Transcode video only → understand decode/encode cycle
Add audio transcoding → understand multi-stream handling
Add filters → understand frame processing pipeline
Handle edge cases → production-quality understanding

Essential Resources

The single best resource for this entire learning journey:

📚 FFmpeg libav tutorial by Leandro Moreira - This GitHub repo walks you through everything from “hello world” to transcoding, with excellent explanations and working C code.

Additional resources:

FFmpeg Official Documentation
Bento4 - C++ toolkit for MP4, useful for studying container parsing
pl_mpeg - Single-file C library for MPEG1, great for studying simple decoder architecture

Summary

This learning path covers video/audio muxing and demuxing through 5+ hands-on projects. Here’s the complete list:

#	Project Name	Main Language	Difficulty	Time Estimate
1	WAV Audio File Parser & Writer	C	Beginner	Weekend
2	BMP Image Sequence to Raw Video Converter	C	Intermediate	Weekend-1 week
3	MPEG-TS Demuxer from Scratch	C	Intermediate	1-2 weeks
4	Video Player Using libav (FFmpeg Libraries)	C	Intermediate	2-3 weeks
5	H.264 NAL Unit Parser	C	Advanced	2-3 weeks
Final	Media Transcoder CLI (Mini-FFmpeg)	C	Advanced	1 month+

Recommended Learning Path

For beginners: Start with projects #1, #2, then #4 (skip #3 and #5 initially)

For intermediate developers: Jump to projects #2, #4, #5 (do #1 as a warm-up if needed)

For advanced/codec-focused: Focus on projects #5, #3, then Final Project

Expected Outcomes

After completing these projects, you will:

Understand the full multimedia pipeline: From raw pixels/samples → containers → codecs → bitstreams
Master FFmpeg’s libav architecture: AVFormatContext, AVCodecContext, AVPacket, AVFrame—the core abstractions powering VLC, OBS, and every video tool
Parse binary formats confidently: Read codec specifications (H.264, AAC) and implement parsers from scratch
Debug A/V sync issues: Understand PTS/DTS timestamps and how to fix desynchronization
Build production-quality tools: Create transcoders, video players, and demuxers comparable to professional implementations
Read multimedia source code: Understand FFmpeg, VLC, and industry codebases line-by-line
Answer interview questions: Confidently explain containers vs codecs, color spaces, compression algorithms, and streaming protocols

You’ll have built 6 working multimedia projects that demonstrate deep understanding of video/audio processing from first principles—knowledge that 99% of developers never acquire.

Sources: