Compression Algorithms Deep Dive: Project-Based Learning

Understanding How Data Compression Really Works

Goal: Build a complete mental model of compression by implementing the core lossless and lossy families from first principles. You will learn how entropy sets theoretical limits, how transforms change the statistical structure of data, and how real compressors balance ratio, speed, and memory. By the end, you will be able to design a compression pipeline, justify its trade-offs, and validate it with measurable benchmarks.

Compression is everywhere: ZIP files, JPEG images, MP3 audio, Netflix streaming, website loading. Yet most developers treat it as a black box. By building compressors yourself, you’ll understand the mathematical foundations, the engineering trade-offs, and why certain algorithms dominate specific domains.

Why Compression Matters (Beyond Smaller Files)

Compression is fundamentally about bandwidth, latency, and energy:

Storage: Smaller files reduce disk usage and improve cache hit rates.
Networking: Fewer bytes on the wire means faster transfers and lower egress costs.
Performance: Some systems trade CPU for I/O speed (fast compression to reduce disk reads).
Reliability: Formats encode structure, checks, and framing for streaming and error detection.
Energy: Less I/O can save power on mobile devices and large data centers.

Compression is also an algorithm design masterclass. It is one of the few domains where theory (entropy limits) meets ruthless engineering constraints (bit-level encoders, CPU caches, and streaming requirements).

Think of compression as a negotiation between data structure and distribution:

Data Type        Typical Redundancy           Winning Strategy
---------        -------------------          ------------------------
Text / Logs      Repeated tokens, skew         Huffman + LZ
Images (raw)     Spatial correlation           Transform + Quantize
Video/Audio      Temporal + perceptual         Motion + psychoacoustics
Binary blobs     Mixed patterns                Hybrid (LZ + entropy)

The Big Picture: Compression as a Pipeline

Most real compressors are pipelines with distinct stages. Understanding the pipeline lets you mix and match techniques:

Raw Data
  ├─ Preprocess (delta, RLE, MTF)
  ├─ Transform (BWT, DCT, Wavelets)
  ├─ Modeling (context, dictionaries)
  └─ Entropy Coding (Huffman, Arithmetic, ANS)

Lossless pipelines focus on removing redundancy.
Lossy pipelines focus on removing perceptually irrelevant data.

Another useful mental model is modeling vs coding:

Modeling: "What patterns are in the data?"
Coding:   "How do I assign fewer bits to common patterns?"

Raw Bytes -> Model (probabilities / matches) -> Coder (bitstream)

If compression fails, one of these is broken:

The model does not capture the true structure.
The coder is inefficient or too much metadata is stored.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Solid C or C++ file I/O and bit manipulation
Basic probability and logarithms
Comfort with arrays, trees, and hash tables

Helpful But Not Required

Familiarity with FFT/DCT for lossy codecs
Basic DSP concepts (sampling, windows)
Some exposure to networking and file formats

Self-Assessment Questions

Can you pack bits into a byte stream without losing order?
Can you build and traverse a binary tree?
Can you explain why Shannon entropy limits compression?

Development Environment Setup

C compiler with -Wall -Wextra
A hex viewer (e.g., xxd) for debugging bitstreams
A plotting tool (optional) for benchmark charts

Core Concept Analysis

To truly understand compression, you need to grasp these fundamental building blocks:

1. Information Theory Foundations

Entropy: The theoretical minimum bits needed to represent data (Shannon’s limit)
Redundancy: The gap between actual data size and theoretical minimum
Symbol probability: How often each byte/character appears

2. Lossless Compression Techniques

Statistical coding: Huffman, Arithmetic, ANS (encode based on frequency)
Dictionary coding: LZ77, LZ78, LZW (encode based on repetition)
Transform coding: BWT, MTF (rearrange data to be more compressible)

3. Lossy Compression Techniques

Transform domain: DCT, DFT, Wavelets (move to frequency space)
Quantization: Reduce precision strategically
Perceptual models: Exploit human sensory limitations

4. The Compression Pipeline

Most real compressors combine multiple techniques:

Input → Transform → Statistical Model → Entropy Coder → Output
        (BWT)       (Context)           (ANS/Huffman)

5. Format Anatomy & Bitstream Hygiene

Framing: How a decoder finds block boundaries in a stream
Headers: Metadata that enables decoding (tables, window sizes, checksums)
Alignment: Bit packing, padding, and byte order considerations
Error detection: CRCs and sentinel markers for robustness

6. Speed vs Ratio vs Memory

Speed: Fewer branches, cache-friendly tables, SIMD-friendly loops
Ratio: Better modeling usually costs more CPU and memory
Memory: Dictionary size and context length directly impact RAM

7. What Compresses Well (and Why)

High Redundancy                         Low Redundancy
AAAAAAAAAAAAA....                       3F 98 12 7C ...
RLE / LZ shines                         Entropy coder can't help much

If the data already looks random, your compressor should detect that and avoid expansion.

Compression Vocabulary & Metrics

Key terms you will use constantly:

Compression ratio = original_size / compressed_size
Bits per symbol (bps) = (compressed_bits / symbols)
Throughput = bytes processed per second
Window size = how far back a match can reference
Block size = chunk boundaries for resets and random access
Streaming = ability to decode without full file in memory

Sanity targets for each project:

Ratio: At least beats raw storage on data with obvious redundancy
Speed: Decompress faster than compress (most codecs follow this)
Correctness: Bit-identical round-trip for lossless codecs

Concept Summary Table

Concept Cluster	What You Need to Internalize
Information Theory	Entropy defines the lower bound; redundancy is what you remove.
Entropy Coding	Huffman is optimal per-symbol; arithmetic/ANS approach optimality for any distribution.
Dictionary Coding	LZ-style algorithms exploit repeated sequences with back-references.
Transforms	BWT/MTF rearrange data to make entropy coding more effective.
Lossy Compression	Transform + quantization trades fidelity for size.
Bitstream Engineering	Formats, headers, alignment, and metadata dominate real-world correctness.
Performance Tuning	Branch prediction and cache behavior often dominate throughput.
Streaming	Framing and block design enable progressive decoding and error recovery.

Success Metrics (What “Mastery” Looks Like)

You can consider this track successful when you can:

Explain why a given dataset compresses well or poorly.
Implement a compressor + decompressor pair and prove correctness.
Measure compression ratio, throughput, and memory usage across algorithms.
Design a pipeline that balances ratio vs speed.
Debug bitstreams by inspecting headers and code tables.

How to Use This Guide

Start with Project 1–3 to build the lossless foundation.
Always build a decompressor before trusting your compressor.
Use the benchmark suite (Project 17) after every milestone.
Compare results against theory (entropy bounds, expected ratios).

Quick Start (First 48 Hours)

Implement a minimal RLE encoder/decoder.
Write a histogram tool that prints symbol frequencies.
Pack bits into a buffer and verify with a hex dump.
Compress a text file and explain the ratio using entropy estimates.

Recommended Learning Paths

Path A: Lossless Foundations

RLE → Huffman → LZ77 → DEFLATE
Arithmetic → ANS → BWT
Fast compressors (LZ4) → Zstd

Path B: Media Codecs

JPEG → PNG → MP3 → H.264-lite
Then add benchmarking + library integration

Path C: Research/Experimental

Context-based (PPM) → Neural compression
Benchmarking suite + custom mini-library

Deep Dive Reading by Concept

Concept	Supporting Book
Entropy & probability	Concrete Mathematics Ch. 1-3
Trees & greedy coding	Algorithms, 4th Edition Ch. 5
Data structures for LZ	Algorithms in C Parts 1-3
Bit-level encoding	Computer Systems: A Programmer’s Perspective Ch. 2
Recursion & transforms	The Recursive Book of Recursion Ch. 7-9

Project 1: Run-Length Encoding (RLE) Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Python, Rust, Go
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Basic Compression / Encoding
Software or Tool: Simple Compressor
Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 2)

What you’ll build: A compressor that replaces repeated consecutive bytes with a count and the byte value. Input “AAABBCCCC” becomes “3A2B4C”.

Why it teaches compression fundamentals: RLE is the simplest compression algorithm that actually works. It teaches the core insight: compression finds and eliminates redundancy. You’ll immediately see when compression helps (repeated data) and when it hurts (random data expands!).

Core challenges you’ll face:

Encoding format design (how to represent count + byte) → maps to format trade-offs
Handling non-repeated bytes (1A1B1C is worse than ABC) → maps to expansion problem
Count overflow (what if 300 A’s in a row?) → maps to boundary conditions
Binary vs text (works differently on different data) → maps to data type awareness
Escape sequences (distinguishing counts from data) → maps to encoding ambiguity

Key Concepts:

Run-Length Encoding: Wikipedia RLE
Entropy Basics: “Introduction to Data Compression” Chapter 1 - Khalid Sayood
When RLE Works: Hydrolix RLE Guide

Difficulty: Beginner Time estimate: 1-2 days Prerequisites: Basic C programming, file I/O

Real world outcome:

$ ./rle compress image.bmp image.rle
Original: 1,234,567 bytes
Compressed: 234,567 bytes (81% reduction)

$ ./rle compress random.bin random.rle
Original: 1,000 bytes
Compressed: 1,847 bytes (EXPANSION - random data!)

$ ./rle decompress image.rle restored.bmp
Decompressed successfully. Files match: YES

The Core Question You’re Answering

“When does a simple run-based encoding actually reduce size, and how do you design a format that never becomes ambiguous?”

Concepts You Must Understand First

Run boundaries and thresholds (when to emit literals vs runs)
Marker/escape byte design to avoid ambiguity
Byte-level I/O and binary-safe streams
Worst-case expansion behavior

Questions to Guide Your Design

What minimum run length justifies a run token?
How will you encode runs longer than 255?
What happens when the marker byte appears in data?
Will your format be streamable without a header?

Thinking Exercise

Manually RLE-encode: AAAABBBCCDAAAAAA using your proposed format. Does it expand?

The Interview Questions They’ll Ask

Why can RLE expand data, and how do you detect it?
What data types compress well with RLE?
How do you handle long runs in a fixed-width counter?
Why must the encoding be unambiguous?

Hints in Layers

Hint 1: Start With Raw Runs Emit (count, byte) for runs and test on tiny inputs.

Hint 2: Add a Marker Byte Only emit runs with a marker; literals pass through.

Hint 3: Escape the Marker If data contains marker, emit an escape sequence.

Hint 4: Add a “no compression” path If compressed size grows, store raw data.

Books That Will Help

Topic	Book	Chapter
Entropy & limits	Concrete Mathematics	Ch. 1-2
Bit-level I/O	Computer Systems: A Programmer’s Perspective	Ch. 2
Algorithm basics	Algorithms, 4th Edition	Ch. 1

Common Pitfalls & Debugging

Problem: “Decompressor desyncs after a marker”

Why: Marker byte appears in data and is not escaped.
Fix: Reserve a marker and escape it in literal streams.
Quick test: Compress a file with many 0xFF bytes and verify round-trip.

Problem: “Random data doubles in size”

Why: RLE encodes everything as runs.
Fix: Use a minimum run threshold and raw-literal path.
Quick test: Compress a random buffer and ensure size does not exceed a cap.

Definition of Done

Compression/decompression is byte-identical for binary files
Marker escape works correctly for all byte values
Long runs split without overflow
Random input does not expand beyond a defined threshold

Implementation Hints: The naive approach “3A2B4C” fails for text that contains digits. Instead, use a binary format: for runs ≥4, output a special marker byte (e.g., 0xFF), then count, then the byte. For runs <4, output bytes literally. Handle the marker byte appearing in data by escaping it (output 0xFF 0x01 0xFF). For count overflow, output multiple run records. Test on BMP images (large uniform regions) vs compressed files (no runs).

Learning milestones:

Basic RLE works on simple input → You understand the core algorithm
Binary files compress/decompress correctly → You handle edge cases
You understand why random data expands → You grasp entropy limits
BMP images compress well, PNGs don’t → You understand data characteristics

Project 2: Huffman Coding Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Entropy Coding / Trees
Software or Tool: Entropy Compressor
Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 3)

What you’ll build: A compressor that assigns shorter bit sequences to more frequent bytes. If ‘e’ appears 1000 times and ‘z’ appears once, ‘e’ might be encoded as “10” (2 bits) and ‘z’ as “1101110” (7 bits).

Why it teaches compression fundamentals: Huffman coding is the foundation of entropy coding. It’s optimal for symbol-by-symbol coding and appears inside DEFLATE (ZIP), JPEG, MP3, and more. Understanding Huffman means understanding why some data compresses better than others.

Core challenges you’ll face:

Frequency counting (scanning input to build histogram) → maps to statistical modeling
Tree construction (building optimal prefix-free tree) → maps to greedy algorithms
Canonical codes (standardizing tree representation) → maps to format efficiency
Bit manipulation (packing variable-length codes into bytes) → maps to low-level coding
Header storage (transmitting the tree to decoder) → maps to metadata overhead

Key Concepts:

Huffman Algorithm: “Algorithms” by Sedgewick & Wayne - Chapter 5.5
Prefix-Free Codes: “Introduction to Data Compression” Chapter 3 - Khalid Sayood
Canonical Huffman: Managing Gigabytes
Information Entropy: “Elements of Information Theory” Chapter 2 - Cover & Thomas

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Binary trees, bit manipulation, file I/O

Real world outcome:

$ ./huffman compress book.txt book.huf
Original: 500,000 bytes
Compressed: 285,000 bytes (43% reduction)
Entropy: 4.56 bits/byte (theoretical limit: 285,000 bytes)

$ ./huffman analyze book.txt
Symbol frequencies:
  ' ' (space): 98,234 (19.6%) → code: 00 (2 bits)
  'e': 62,145 (12.4%) → code: 010 (3 bits)
  't': 45,231 (9.0%) → code: 0110 (4 bits)
  ...
  'z': 234 (0.05%) → code: 11111110 (8 bits)

$ ./huffman decompress book.huf restored.txt
Decompressed successfully. Files match: YES

The Core Question You’re Answering

“How do you assign variable-length codes so frequent symbols cost fewer bits without losing decodability?”

Concepts You Must Understand First

Prefix-free codes and instantaneous decoding
Frequency histograms and entropy estimates
Canonical Huffman codes (compact headers)
Bit packing and byte alignment

Questions to Guide Your Design

How will you handle the single-symbol edge case?
Will you store a full tree or canonical code lengths?
How do you pack variable-length codes into bytes?
What happens if your input has only a few symbols?

Thinking Exercise

Given frequencies: A=5, B=2, C=1, D=1. Build a Huffman tree and list the codes.

The Interview Questions They’ll Ask

Why are Huffman codes prefix-free?
What makes canonical Huffman better for file formats?
How close can Huffman get to entropy?
What is the complexity of building the Huffman tree?

Hints in Layers

Hint 1: Build the Tree First Use a priority queue and merge the two lowest nodes.

Hint 2: Generate Codes by Traversal Left=0, right=1. Store lengths and codes.

Hint 3: Implement Canonical Codes Sort by length, then symbol value, assign sequentially.

Hint 4: Add Bit-Packing Use a bit buffer and flush when you have 8+ bits.

Books That Will Help

Topic	Book	Chapter
Huffman coding	Algorithms, 4th Edition	Ch. 5
Greedy algorithms	Algorithms in C	Part 3
Bit manipulation	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Decoding fails on a one-symbol file”

Why: Tree has one node and no path bits.
Fix: Emit a 1-bit code for the single symbol.
Quick test: Compress a file of 1000 identical bytes.

Problem: “Header too large”

Why: Storing full tree structure.
Fix: Use canonical code lengths only.
Quick test: Compare header size for 256 symbols.

Definition of Done

Encodes/decodes binary files losslessly
Canonical codes implemented and header minimized
Compression ratio approaches entropy on skewed data
Single-symbol and empty-file edge cases work

Implementation Hints: First pass: count byte frequencies. Build tree: create leaf nodes for each byte with its frequency, then repeatedly merge the two lowest-frequency nodes into a parent. Assign codes: traverse tree, appending 0 for left and 1 for right. For canonical Huffman: sort codes by length, then alphabetically within length, and assign sequential codes. Store only code lengths in header (much smaller). For bit packing, accumulate bits in a buffer and flush full bytes.

Learning milestones:

Tree builds correctly → You understand the greedy algorithm
Encoding produces correct bit stream → You can do bit manipulation
Compression ratio approaches entropy → You understand optimality
Canonical codes reduce header size → You understand practical optimization

Project 3: LZ77 Sliding Window Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Dictionary Compression
Software or Tool: LZ Compressor
Main Book: “Data Compression: The Complete Reference” by David Salomon

What you’ll build: A compressor that replaces repeated sequences with back-references. “ABCDEFABCDEF” becomes “ABCDEF(6,6)” meaning “go back 6 bytes, copy 6 bytes.”

Why it teaches compression fundamentals: LZ77 is the foundation of modern compression (ZIP, gzip, PNG, zstd). Unlike Huffman which exploits single-byte frequencies, LZ77 exploits repeated patterns of any length. This is why text (repeated words) and code (repeated patterns) compress so well.

Core challenges you’ll face:

Match finding (finding longest match in window) → maps to search algorithms
Hash chains (efficient duplicate detection) → maps to hash table design
Lazy matching (sometimes waiting finds better match) → maps to optimization strategies
Output format (literal vs match distinction) → maps to encoding design
Window size trade-offs (bigger = better compression, more memory) → maps to resource trade-offs

Key Concepts:

LZ77 Algorithm: Stanford LZ77 Tutorial
Hash Chains: “Data Compression: The Complete Reference” Chapter 4 - Salomon
Lazy Matching: zlib Algorithm
LZSS Variant: Wikipedia LZSS

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Hash tables, string matching, understanding of Huffman

Real world outcome:

$ ./lz77 compress linux-kernel.tar linux-kernel.lz77
Original: 850,000,000 bytes
Compressed: 125,000,000 bytes (85% reduction)
Compression speed: 45 MB/s
Window size: 32 KB

$ ./lz77 --analyze linux-kernel.tar
Match statistics:
  Literal bytes: 15%
  Match references: 85%
  Average match length: 12.3 bytes
  Longest match: 258 bytes

$ ./lz77 decompress linux-kernel.lz77 restored.tar
Decompression speed: 350 MB/s (asymmetric!)
Files match: YES

The Core Question You’re Answering

“How do you exploit repeated substrings efficiently without slowing compression to a crawl?”

Concepts You Must Understand First

Sliding window and back-reference semantics
Minimum match length and literals
Hash table / hash chain match finding
Overlapping match copy during decode

Questions to Guide Your Design

What window size balances ratio and speed?
What minimum match length makes sense for your format?
How will you encode literals vs matches?
How will you prevent matches from crossing block boundaries?

Thinking Exercise

Encode “ABCABCABC” as LZ77 triples. How many bytes do you save?

The Interview Questions They’ll Ask

Why is LZ77 asymmetric (decode faster than encode)?
How do you handle overlapping matches on decode?
What does “lazy matching” improve?
How does window size affect compression ratio?

Hints in Layers

Hint 1: Implement a Naive Matcher Start with a small window and brute-force matches.

Hint 2: Add Hashing Hash 3- or 4-byte sequences for quick candidate matches.

Hint 3: Implement Lazy Matching Look ahead one position for a better match.

Hint 4: Add LZSS Flags Use a flag bit to distinguish literals from matches.

Books That Will Help

Topic	Book	Chapter
String matching	Algorithms in C	Part 5
Hash tables	Algorithms, 4th Edition	Ch. 3.4
Bit streams	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Decoded output is corrupted”

Why: Overlapping copy not handled correctly.
Fix: Copy byte-by-byte when match overlaps output.
Quick test: Compress sequences like “AAAAAA”.

Problem: “Compression is too slow”

Why: Naive match search is O(n^2).
Fix: Use hash chains and limit search depth.
Quick test: Benchmark with and without hash chains.

Definition of Done

Compress/decompress round-trip correct
Hash-based matching yields significant speedup
Lazy matching improves ratio without breaking decode
Window size is configurable and documented

Implementation Hints: Use a hash table keyed on 3-byte sequences. For each position, compute hash of next 3 bytes and check for matches at stored positions. Chain collisions (multiple positions with same hash). For each potential match, extend forward to find longest match. Emit literals for unmatched bytes, emit (distance, length) for matches. Minimum match length is typically 3 (shorter matches aren’t worth the overhead). LZSS improvement: use a flag bit to distinguish literals from matches.

Learning milestones:

Basic LZ77 compresses/decompresses → You understand the core algorithm
Hash chains speed up matching → You understand practical implementation
Lazy matching improves ratio → You understand optimization trade-offs
Decompression is much faster than compression → You understand asymmetry

Project 4: DEFLATE Implementation (LZ77 + Huffman)

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Compression Pipeline
Software or Tool: DEFLATE Compressor
Main Book: “RFC 1951: DEFLATE Compressed Data Format Specification”

What you’ll build: A compressor combining LZ77 and Huffman coding—the algorithm behind ZIP, gzip, PNG, and HTTP compression. LZ77 finds matches, then Huffman encodes the literals, lengths, and distances.

Why it teaches compression fundamentals: DEFLATE shows how compression algorithms compose. LZ77 removes redundancy, then Huffman compresses the residual entropy. This two-stage approach is the template for most modern compressors.

Core challenges you’ll face:

Block structure (stored, static Huffman, dynamic Huffman) → maps to adaptive algorithms
Dynamic tree building (per-block optimal trees) → maps to local optimization
Length/distance alphabets (special symbols for back-references) → maps to extended coding
Bit-level alignment (variable-length codes packed into bytes) → maps to bit stream handling
Compatibility (producing gzip/zlib compatible output) → maps to format compliance

Key Concepts:

DEFLATE Specification: RFC 1951
zlib/gzip Headers: RFC 1950 (zlib), RFC 1952 (gzip)
Implementation Guide: zlib Algorithm
PNG Usage: PNG Compression

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: LZ77, Huffman, bit manipulation mastery

Real world outcome:

$ ./mydeflate compress document.pdf document.gz
Original: 1,234,567 bytes
Compressed: 987,654 bytes (20% reduction)
Compatible with: gzip, zlib, 7-zip

$ gzip -d document.gz  # Verify with standard tool
$ diff document.pdf document  # Identical!

$ ./mydeflate --stats document.pdf
Block analysis:
  Block 1: Dynamic Huffman, 45KB → 32KB
  Block 2: Dynamic Huffman, 45KB → 31KB
  ...
  Literal symbols: 123,456
  Match references: 45,678
  Average match length: 8.2 bytes

The Core Question You’re Answering

“How do modern compressors combine dictionary and entropy coding to reach strong ratios with practical speed?”

Concepts You Must Understand First

DEFLATE block structure and bit ordering
Length/distance code tables
Dynamic Huffman code encoding
Stored vs static vs dynamic blocks

Questions to Guide Your Design

When should you emit stored blocks vs compressed blocks?
How will you build and encode dynamic trees per block?
How do you pack bits in the correct order (LSB-first)?
How will you validate compatibility with gzip/zlib?

Thinking Exercise

Manually encode a short literal-only DEFLATE block and identify the end-of-block symbol.

The Interview Questions They’ll Ask

Why does DEFLATE need three block types?
What are length/distance pairs and why encode them separately?
How do code length codes work?
How would you debug a corrupted DEFLATE stream?

Hints in Layers

Hint 1: Implement Stored Blocks Get a working format before compression.

Hint 2: Add Static Huffman Blocks Use fixed trees from the RFC.

Hint 3: Add Dynamic Huffman Build trees from per-block frequencies.

Hint 4: Validate With gzip Round-trip using external tools.

Books That Will Help

Topic	Book	Chapter
Huffman coding	Algorithms, 4th Edition	Ch. 5
Bit-level formats	Computer Systems: A Programmer’s Perspective	Ch. 2
Greedy trees	Algorithms in C	Part 3

Common Pitfalls & Debugging

Problem: “gzip refuses to decode”

Why: Bit ordering or block header is wrong.
Fix: Verify LSB-first bit packing and block header bits.
Quick test: Encode a stored-only file and compare with zlib.

Problem: “Dynamic trees decode incorrectly”

Why: Code length codes not encoded in correct order.
Fix: Follow RFC 1951 ordering exactly.
Quick test: Dump code lengths and compare with a reference encoder.

Definition of Done

Stored, static, and dynamic blocks decode correctly
Output is gzip/zlib compatible
Length/distance coding matches RFC tables
Compression beats Huffman-only on repetitive data

Implementation Hints: DEFLATE has three block types: stored (no compression), static Huffman (predefined trees), dynamic Huffman (per-block trees). For dynamic blocks: after LZ77, count frequencies of literals (0-255), end-of-block (256), and length codes (257-285). Build Huffman tree. Similarly for distance codes (0-29). Encode the trees themselves using another layer of Huffman (code length codes). The RFC specifies exact bit orderings—follow it precisely. Use zlib’s output as reference to verify correctness.

Learning milestones:

Stored blocks work → You understand the format structure
Static Huffman blocks compress → You’ve integrated LZ77 + Huffman
Dynamic Huffman improves compression → You understand adaptive coding
Output is gzip-compatible → You’ve achieved format compliance

Project 5: Arithmetic Coding Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Entropy Coding / Mathematics
Software or Tool: Arithmetic Coder
Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 4)

What you’ll build: An encoder that represents the entire message as a single fraction between 0 and 1. Unlike Huffman (integer bits per symbol), arithmetic coding can use 0.3 bits for a very common symbol.

Why it teaches compression fundamentals: Arithmetic coding achieves near-optimal compression, reaching the theoretical entropy limit. It handles symbol probabilities that Huffman can’t efficiently encode. Understanding arithmetic coding explains why modern coders (ANS, range coding) exist.

Core challenges you’ll face:

Interval subdivision (shrinking range based on probabilities) → maps to mathematical precision
Fixed-point arithmetic (avoiding floating-point errors) → maps to numerical stability
Renormalization (outputting bits when range narrows) → maps to bit streaming
End-of-message (knowing when to stop decoding) → maps to termination handling
Underflow prevention (middle-of-range scaling) → maps to edge case handling

Key Concepts:

Arithmetic Coding Theory: “Introduction to Data Compression” Chapter 4 - Sayood
Practical Implementation: “Data Compression: The Complete Reference” Chapter 4.4 - Salomon
Fixed-Point Tricks: Arithmetic Coding Tutorial
Comparison to Huffman: DeepRender Entropy Coding

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Strong math, bit manipulation, probability concepts

Real world outcome:

$ ./arithmetic compress skewed_data.bin skewed.ac
Original: 1,000,000 bytes
Compressed: 125,456 bytes
Huffman would give: 127,890 bytes (1.9% worse)

$ ./arithmetic --verbose compress test.txt test.ac
Symbol 'a' (prob 0.4): interval [0.0, 0.4)
Symbol 'b' (prob 0.3): interval [0.0, 0.12)
Symbol 'c' (prob 0.2): interval [0.024, 0.048)
Symbol 'a' (prob 0.4): interval [0.024, 0.0336)
...
Final interval: [0.02847, 0.02848)
Output: 0x4A3B... (17 bits)

Entropy: 1.846 bits/symbol
Arithmetic coding: 1.848 bits/symbol (optimal!)
Huffman coding: 1.9 bits/symbol

The Core Question You’re Answering

“How can you code data at fractional bits per symbol without floating-point error?”

Concepts You Must Understand First

Range subdivision and cumulative probabilities
Fixed-point arithmetic for stability
Renormalization and underflow handling
Model updates (static vs adaptive)

Questions to Guide Your Design

How many bits of precision will your range use?
Will your model be static or adaptive?
How will you terminate decoding reliably?
How will you handle low-probability symbols?

Thinking Exercise

Encode the symbol sequence “ABAC” with probabilities A=0.5, B=0.25, C=0.25 using interval arithmetic.

The Interview Questions They’ll Ask

Why does arithmetic coding beat Huffman for skewed data?
What is renormalization and why is it needed?
How do encoder and decoder stay synchronized?
What is the difference between arithmetic coding and range coding?

Hints in Layers

Hint 1: Start With a Static Model Use a fixed histogram and verify decode.

Hint 2: Implement Renormalization Output top bits when high and low converge.

Hint 3: Add Underflow Handling Track pending bits for the mid-range case.

Hint 4: Add Adaptive Updates Update counts after each symbol.

Books That Will Help

Topic	Book	Chapter
Probability basics	Concrete Mathematics	Ch. 1
Numerical stability	Math for Programmers	Ch. 7
Bit-level coding	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Decoder diverges after a few symbols”

Why: Off-by-one in range updates.
Fix: Use inclusive/exclusive bounds consistently.
Quick test: Encode a tiny message and step through each update.

Problem: “Underflow causes infinite loop”

Why: Pending bits not flushed correctly.
Fix: Implement E3 scaling with a counter of pending bits.
Quick test: Force the mid-range case with skewed probabilities.

Definition of Done

Encoder/decoder round-trip on random inputs
Compression beats Huffman on skewed data
Adaptive and static modes both work
No floating-point usage in core loop

Implementation Hints: Use 32-bit integers for low/high range (e.g., 0x00000000 to 0xFFFFFFFF). For each symbol, subdivide current range proportionally to symbol probability. When high and low share the same top bit, output that bit and shift both left. Underflow: when range straddles 0.5 (low < 0.5 < high), track pending bits and scale around 0.5. For decoding, maintain a “code” value read from input, and determine which symbol’s subrange contains it. The tricky part is matching encoder/decoder range updates exactly.

Learning milestones:

Basic encoding produces correct output → You understand the algorithm
Decoding recovers original → You’ve matched encoder/decoder precisely
Compression beats Huffman on skewed data → You see the advantage
Fixed-point implementation is stable → You’ve handled precision issues

Project 6: ANS (Asymmetric Numeral Systems) Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 4: Expert
Knowledge Area: Modern Entropy Coding
Software or Tool: ANS Compressor
Main Book: “Asymmetric Numeral Systems” by Jarek Duda (arXiv paper)

What you’ll build: A modern entropy coder that combines arithmetic coding’s compression ratio with Huffman’s speed. ANS is used in Facebook’s zstd, Apple’s LZFSE, and JPEG XL.

Why it teaches compression fundamentals: ANS represents the cutting edge of entropy coding. By building it, you’ll understand why zstd replaced gzip in performance, and why the compression community moved beyond arithmetic coding.

Core challenges you’ll face:

State machine encoding (single integer instead of interval) → maps to simplified state
Table construction (spreading symbols across state space) → maps to precomputation
Streaming (outputting bits as state grows) → maps to bit stream management
Reverse order (ANS encodes backward, decodes forward) → maps to stack semantics
FSE variant (tabled ANS for speed) → maps to lookup table optimization

Key Concepts:

ANS Paper: arXiv 1311.2540 - Jarek Duda
Tutorial: Understanding ANS
FSE (Finite State Entropy): Yann Collet’s Blog
Patent-Free: ANS is Public Domain

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Arithmetic coding understanding, probability, state machines

Real world outcome:

$ ./ans compress book.txt book.ans
Original: 500,000 bytes
Compressed: 284,500 bytes (same as arithmetic coding!)
Speed: 450 MB/s (10x faster than arithmetic coding!)

$ ./ans benchmark book.txt
Algorithm      Ratio    Encode    Decode
Huffman        56.8%    320MB/s   380MB/s
Arithmetic     56.9%    45MB/s    48MB/s
ANS (rANS)     56.9%    450MB/s   480MB/s
ANS (tANS)     56.9%    520MB/s   600MB/s

$ ./ans decompress book.ans restored.txt
Files match: YES

The Core Question You’re Answering

“How can you achieve arithmetic-coding ratios with Huffman-like speed?”

Concepts You Must Understand First

ANS state transitions and renormalization
Frequency table construction
rANS vs tANS differences
Reverse-order encoding constraints

Questions to Guide Your Design

How large is your state space and why?
How will you normalize symbol frequencies?
How will you encode in reverse order safely?
Which variant (rANS/tANS) is your target?

Thinking Exercise

Build a tiny ANS table with symbols {A:3, B:1} and simulate two encode steps.

The Interview Questions They’ll Ask

Why does ANS decode forward but encode backward?
What makes tANS faster than rANS?
How does ANS compare to arithmetic coding?
What is renormalization in ANS?

Hints in Layers

Hint 1: Implement rANS First It is simpler and avoids large tables.

Hint 2: Add Renormalization Emit low bits when state grows too large.

Hint 3: Build tANS Tables Precompute state transitions.

Hint 4: Benchmark vs Huffman Compare both speed and ratio.

Books That Will Help

Topic	Book	Chapter
Probability	Concrete Mathematics	Ch. 1
Bit-level encoding	Computer Systems: A Programmer’s Perspective	Ch. 2
Algorithms	Algorithms, 4th Edition	Ch. 1

Common Pitfalls & Debugging

Problem: “Decoder outputs reversed symbols”

Why: Encoding order not reversed.
Fix: Buffer symbols or encode backwards.
Quick test: Encode a short known sequence and verify decode order.

Problem: “State overflows”

Why: Renormalization thresholds too high.
Fix: Choose a maximum state and renormalize earlier.
Quick test: Log state range during long encoding runs.

Definition of Done

rANS encoder/decoder correct
tANS table builds and decodes correctly
Ratio matches arithmetic coding on test data
Encode/decode speeds exceed Huffman

Implementation Hints: For rANS (range ANS): state is a large integer. To encode symbol s with probability p: state' = (state / freq[s]) * total + cumfreq[s] + (state % freq[s]). Renormalize when state exceeds threshold by outputting low bits. Decoding is the inverse. For tANS: precompute a table of state transitions. For each (state, symbol) pair, compute next state and output bits. This avoids division during encode/decode. The tricky insight: encoding must happen in reverse order of intended decode order, so buffer symbols or encode backward.

Learning milestones:

rANS encodes/decodes correctly → You understand the core algorithm
tANS table construction works → You understand the optimization
Speed matches theoretical → You’ve implemented efficiently
Compression ratio matches arithmetic coding → You’ve achieved optimality

Project 7: Burrows-Wheeler Transform (BWT) Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Transform Coding
Software or Tool: BWT Compressor
Main Book: “Managing Gigabytes” by Witten, Moffat, Bell (Chapter 4)

What you’ll build: A compressor using the Burrows-Wheeler Transform: rearrange input to cluster similar characters, then compress with Move-to-Front + entropy coding. This is how bzip2 works.

Why it teaches compression fundamentals: BWT shows that compression can be improved by transforming data before coding. Unlike LZ77 (looks for exact matches), BWT exploits the statistical structure of text—letters that follow similar contexts cluster together.

Core challenges you’ll face:

Suffix sorting (sorting all rotations efficiently) → maps to string algorithms
Inverse transform (recovering original from BWT) → maps to reversible transforms
Move-to-Front (converting to small integers) → maps to locality exploitation
Block sizing (larger blocks = better compression) → maps to resource trade-offs
Zero-run coding (RLE for the many zeros after MTF) → maps to secondary coding

Key Concepts:

BWT Algorithm: Wikipedia BWT
Suffix Arrays: “Algorithms on Strings” by Crochemore et al.
Move-to-Front: Wikipedia MTF
bzip2 Implementation: bzip2 Source

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sorting algorithms, understanding of suffix arrays

Real world outcome:

$ ./bwt compress enwik8 enwik8.bwt
Original: 100,000,000 bytes
After BWT + MTF: many zeros and small values
After Huffman: 29,008,758 bytes (71% reduction)
Compression time: 45 seconds

$ ./bwt --analyze enwik8
BWT clustering effectiveness:
  Before BWT: entropy 4.56 bits/byte
  After BWT: entropy 3.21 bits/byte
  After MTF: entropy 2.32 bits/byte

$ bunzip2 -c enwik8.bz2 | diff - restored
# Compare with real bzip2 output - similar ratio!

The Core Question You’re Answering

“How can a reversible transform make data statistically easier to compress?”

Concepts You Must Understand First

Suffix arrays or suffix sorting
LF mapping for inverse BWT
Move-to-Front and run-length coding
Block size trade-offs

Questions to Guide Your Design

Which suffix array algorithm will you use?
How will you store the primary index for inverse transform?
How large should blocks be for good compression?
How will you handle non-text binary data?

Thinking Exercise

Compute the BWT of “banana$” and identify the primary index.

The Interview Questions They’ll Ask

Why does BWT improve compression for text?
What is LF mapping and why does it work?
How does block size affect memory and ratio?
Why combine BWT with MTF and RLE?

Hints in Layers

Hint 1: Implement BWT on Small Strings Use naive rotation sorting to validate.

Hint 2: Add Inverse BWT Use LF mapping to reconstruct.

Hint 3: Add MTF + RLE Most output becomes zeros after MTF.

Hint 4: Replace Naive Sorting Switch to suffix arrays for speed.

Books That Will Help

Topic	Book	Chapter
String algorithms	Algorithms in C	Part 5
Sorting	Algorithms, 4th Edition	Ch. 2
Recursion patterns	The Recursive Book of Recursion	Ch. 8

Common Pitfalls & Debugging

Problem: “Inverse transform returns garbage”

Why: Primary index not stored correctly.
Fix: Save the row where the original string appears.
Quick test: BWT + inverse on short strings.

Problem: “Compression is too slow”

Why: O(n^2) rotation sorting.
Fix: Use suffix arrays or suffix sorting library.
Quick test: Compare runtime on 1MB vs 10MB.

Definition of Done

BWT/inverse round-trip works
MTF+RLE reduces entropy
Suffix array implementation is efficient
Output ratio comparable to bzip2 on text

Implementation Hints: Naive BWT sorts all rotations—O(n² log n). Use suffix arrays for O(n log n) or better. For each position i, the BWT output is input[(suffix_array[i] - 1) mod n]. Store the position of the original string (where suffix_array[i] == 0) for decoding. For inverse BWT: build the “LF mapping” (first column to last column correspondence). Start at the original position and follow the mapping n times. MTF: maintain a list of recent characters, output the index, move that character to front. Combine with run-length coding for zeros and Huffman.

Learning milestones:

BWT transforms and inverts correctly → You understand the algorithm
Suffix array construction is efficient → You’ve implemented practical sorting
MTF reduces entropy significantly → You understand the clustering effect
Full pipeline matches bzip2 ratio → You’ve built a real compressor

Project 8: LZ4/Snappy-style Fast Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: High-Speed Compression
Software or Tool: Fast LZ Compressor
Main Book: “LZ4 Explained” by Yann Collet (Blog Series)

What you’ll build: A compressor optimized for speed over ratio. LZ4 compresses at 500+ MB/s while still achieving 2:1 compression on typical data. Used in Linux kernel, databases, games.

Why it teaches compression fundamentals: Speed matters. Sometimes you need compression that’s faster than disk I/O. LZ4 shows how to strip LZ77 to its essentials: simple hash lookup, minimal format overhead, branchless decode loops.

Core challenges you’ll face:

Single-probe hash table (one lookup, no chains) → maps to speed vs ratio trade-off
Minimal format (no Huffman, just variable-length integers) → maps to format simplicity
Cache-friendly access (everything fits in L1/L2) → maps to memory hierarchy
Branchless decoding (predictable loops) → maps to CPU pipeline optimization
Block independence (parallel encode/decode) → maps to parallelization

Key Concepts:

LZ4 Specification: LZ4 Block Format
Snappy Design: Snappy Documentation
Speed Tricks: Yann Collet’s Blog
Comparison: LZ4 vs DEFLATE

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: LZ77 understanding, performance optimization, low-level C

Real world outcome:

$ ./fastlz compress linux-kernel.tar kernel.lz4
Original: 850,000,000 bytes
Compressed: 340,000,000 bytes (60% reduction)
Speed: 650 MB/s (limited by memory bandwidth!)

$ ./fastlz decompress kernel.lz4 restored.tar
Speed: 2,800 MB/s (faster than SSD read!)

$ ./fastlz benchmark corpus.tar
Compressor    Ratio    Compress    Decompress
gzip          35%      45 MB/s     250 MB/s
LZ4           60%      650 MB/s    2,800 MB/s
LZ4-HC        45%      35 MB/s     2,800 MB/s
Snappy        58%      450 MB/s    1,200 MB/s

The Core Question You’re Answering

“How do you design a compressor that is faster than disk I/O while still shrinking data meaningfully?”

Concepts You Must Understand First

LZ4 token format and variable-length lengths
Hash-based match finding without chains
Overlapping match copy in decoder
Cache-friendly loops and branch prediction

Questions to Guide Your Design

What hash size balances collisions vs cache use?
How will you encode long literal or match lengths?
How will you ensure decoding is branch-light?
How will you validate block independence?

Thinking Exercise

Encode the string “ABABABAB” using the LZ4 token format. What literals and matches appear?

The Interview Questions They’ll Ask

Why does LZ4 skip entropy coding?
How does LZ4 remain so fast?
What are the trade-offs vs DEFLATE?
How do you design branchless decoders?

Hints in Layers

Hint 1: Implement the Token Format Get literal/match lengths correct first.

Hint 2: Add a Simple Hash One lookup, no chains.

Hint 3: Optimize the Decode Loop Use memcpy and tight loops.

Hint 4: Benchmark Constantly Measure speed on large buffers.

Books That Will Help

Topic	Book	Chapter
Low-level optimization	Computer Systems: A Programmer’s Perspective	Ch. 5-6
Hash tables	Algorithms, 4th Edition	Ch. 3.4
Algorithm analysis	Algorithms, 4th Edition	Ch. 1

Common Pitfalls & Debugging

Problem: “Decoder crashes on long lengths”

Why: Length continuation bytes misparsed.
Fix: Implement length extension loops per spec.
Quick test: Encode a long literal run (>255 bytes).

Problem: “Compression ratio is terrible”

Why: Hash collisions and missed matches.
Fix: Improve hash and add short match extension.
Quick test: Compare against LZ4 on a text file.

Definition of Done

Format encodes/decodes correctly
Decode speed > 1 GB/s on large buffers
Compression ratio within 10% of LZ4 reference
Block independence verified

Implementation Hints: Use a hash table with 4KB-64KB entries (power of 2). Hash the next 4 bytes, single-probe lookup—if match, extend; if not, emit literal and update table. Format: literal length (4 bits) + match length (4 bits) in one byte, followed by literals, followed by 2-byte match offset. Lengths ≥15 use continuation bytes. For speed: use memcpy for literals (compilers optimize this), unroll the match copy loop, use pointer arithmetic instead of index math. The decode loop should be <20 instructions.

Learning milestones:

Basic LZ4 format encodes/decodes → You understand the format
Compression speed exceeds 200 MB/s → You’ve optimized encoding
Decompression speed exceeds 1 GB/s → You’ve optimized decoding
Ratio matches reference LZ4 → You’ve matched the algorithm

Project 9: Zstandard-style Compressor

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: State-of-the-Art Compression
Software or Tool: Modern Compressor
Main Book: “RFC 8878: Zstandard Compression”

What you’ll build: A compressor combining LZ77, Huffman, and FSE (ANS variant) with configurable compression levels. zstd achieves gzip-like ratios at LZ4-like speeds.

Why it teaches compression fundamentals: Zstandard represents the state of the art. It shows how to combine all the techniques: dictionary matching (LZ77), entropy coding (FSE), multiple compression levels, and modern CPU optimization. Understanding zstd means understanding modern compression.

Core challenges you’ll face:

Multiple entropy backends (Huffman for literals, FSE for sequences) → maps to hybrid coding
Sequence coding (literal length, match length, offset as tuples) → maps to structured encoding
Repeat offsets (caching recent match distances) → maps to statistical optimization
Compression levels (trading speed for ratio) → maps to configurable algorithms
Dictionary compression (training on sample data) → maps to prior knowledge

Key Concepts:

Zstandard Specification: RFC 8878
FSE Algorithm: Finite State Entropy
Implementation: zstd Source
Comparison: Zstd vs Brotli

Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects, deep understanding of entropy coding

Real world outcome:

$ ./myzstd -1 compress linux-kernel.tar kernel.zst
Original: 850,000,000 bytes
Compressed: 150,000,000 bytes (82% reduction)
Speed: 400 MB/s

$ ./myzstd -19 compress linux-kernel.tar kernel-max.zst
Compressed: 95,000,000 bytes (89% reduction)
Speed: 5 MB/s (but still decompresses at 800 MB/s!)

$ ./myzstd --train-dict samples/ -o custom.dict
Trained dictionary on 1000 samples
Dictionary size: 112 KB

$ ./myzstd -D custom.dict compress small_files/*.json output.zst
Compression ratio improved by 40% with dictionary!

The Core Question You’re Answering

“How do you combine LZ77, entropy coding, and dictionaries into a configurable, state-of-the-art compressor?”

Concepts You Must Understand First

Sequence coding (literal length, match length, offset)
Huffman vs FSE for different symbol streams
Repeat offsets and recent match caching
Dictionary training and usage

Questions to Guide Your Design

Which parts of the stream use Huffman vs FSE?
How will you implement compression levels cleanly?
How do you integrate dictionaries into the LZ window?
How will you validate ratio vs speed trade-offs?

Thinking Exercise

Given a sequence list of literals and matches, manually encode one tuple using length and offset codes.

The Interview Questions They’ll Ask

Why does zstd outperform gzip on both ratio and speed?
What are repeat offsets and why do they help?
How does dictionary compression work for small files?
Why are multiple compression levels useful?

Hints in Layers

Hint 1: Implement a Minimal Level 1 Mode Start with fast LZ matching + Huffman.

Hint 2: Add FSE for Sequences Use FSE for match lengths and offsets.

Hint 3: Add Repeat Offsets Track the last three offsets.

Hint 4: Add Dictionary Training Preload the hash table with dictionary data.

Books That Will Help

Topic	Book	Chapter
Huffman coding	Algorithms, 4th Edition	Ch. 5
Hash tables	Algorithms, 4th Edition	Ch. 3.4
Bit-level formats	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “High compression levels are too slow”

Why: Optimal parsing is expensive.
Fix: Limit search depth or use heuristics.
Quick test: Benchmark each level on the same corpus.

Problem: “Dictionary makes files larger”

Why: Dictionary not representative of data.
Fix: Train on a matching corpus and cap size.
Quick test: Compare ratio with/without dictionary on small files.

Definition of Done

Basic compression/decompression works
Level 1 vs level 10 show clear trade-offs
Dictionary improves small-file ratio
Ratio matches or exceeds gzip at similar speed

Implementation Hints: zstd encodes “sequences”: (literal_length, match_length, offset) tuples. Use FSE for the match/literal length codes, and Huffman or FSE for literals. Repeat offsets: track last 3 match offsets; if new offset matches one, use special codes. For compression levels: level 1 uses fast hash matching (like LZ4), high levels use optimal parsing with lazy evaluation. Dictionary: prepopulate the hash table and LZ77 window with trained data. Reference the actual zstd source—it’s well-commented.

Learning milestones:

Basic compression works → You’ve integrated the components
Ratio matches gzip at 5x speed → You’ve achieved the zstd value proposition
Multiple levels work → You’ve implemented configurable compression
Dictionary training improves small files → You understand the full system

Project 10: JPEG Encoder from Scratch

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Python
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Image Compression / DCT
Software or Tool: JPEG Encoder
Main Book: “JPEG Still Image Data Compression Standard” by Pennebaker & Mitchell

What you’ll build: A JPEG encoder that converts raw RGB images to JPEG files, implementing color space conversion, DCT, quantization, and entropy coding.

Why it teaches compression fundamentals: JPEG is the canonical lossy compressor. It demonstrates transform coding (DCT), quantization (controlled quality loss), and exploiting human perception (we don’t see high frequencies well). These same principles appear in video (H.264), audio (MP3), and modern image formats (AVIF).

Core challenges you’ll face:

Color space conversion (RGB → YCbCr, chroma subsampling) → maps to perceptual coding
DCT computation (8x8 blocks, frequency domain) → maps to transform coding
Quantization tables (quality factor, DC/AC treatment) → maps to lossy trade-offs
Zig-zag ordering (low to high frequency) → maps to run-length optimization
Huffman coding (DC deltas, AC run-length pairs) → maps to entropy coding

Key Concepts:

DCT Explained: Understanding DCT
JPEG Standard: “JPEG Still Image Data Compression Standard” - Pennebaker & Mitchell
Quantization: Stanford JPEG Tutorial
Implementation Guide: JPEG Explained (Computerphile)

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Linear algebra (matrices), bit manipulation, Huffman coding

Real world outcome:

$ ./myjpeg encode photo.ppm photo.jpg --quality 85
Original: 12,000,000 bytes (4000x3000 RGB)
Compressed: 1,234,567 bytes (90% reduction)
PSNR: 38.5 dB (excellent quality)

$ ./myjpeg --visualize-dct photo.ppm
Block (10, 20):
  DC: 1024 (average brightness)
  AC[0,1]: 156 (horizontal edge)
  AC[1,0]: -89 (vertical edge)
  AC[7,7]: 2 (high frequency, will be quantized to 0)

$ diff <(djpeg photo.jpg) <(djpeg reference.jpg)
# Visually identical to libjpeg output!

$ ./myjpeg encode photo.ppm photo-q10.jpg --quality 10
Compressed: 123,456 bytes
PSNR: 28.1 dB (visible artifacts, but recognizable)

The Core Question You’re Answering

“How does JPEG turn pixels into frequency coefficients, then discard detail humans barely notice?”

Concepts You Must Understand First

RGB → YCbCr conversion and chroma subsampling
8x8 DCT and quantization tables
Zig-zag ordering and run-length encoding
Huffman coding of DC/AC coefficients

Questions to Guide Your Design

How will you implement a fast DCT?
What quality factor mapping will you use?
How will you handle edge padding for non-multiple-of-8 images?
Will you use standard Huffman tables or custom ones?

Thinking Exercise

Take a simple 8x8 block with constant value. What do the DCT coefficients look like?

The Interview Questions They’ll Ask

Why does JPEG use YCbCr instead of RGB?
What is the role of quantization in JPEG?
Why are coefficients zig-zag ordered?
Why does JPEG introduce blocking artifacts?

Hints in Layers

Hint 1: Build the Pipeline Step by Step RGB → YCbCr → DCT → Quantize → Zigzag → RLE.

Hint 2: Use Standard Tables First Start with baseline quantization tables.

Hint 3: Validate With a Tiny Image Encode an 8x8 test image and inspect output.

Hint 4: Add Huffman Coding Only after the coefficient pipeline is correct.

Books That Will Help

Topic	Book	Chapter
Transforms	Computer Graphics from Scratch	Ch. 2
Bit streams	Computer Systems: A Programmer’s Perspective	Ch. 2
Greedy coding	Algorithms, 4th Edition	Ch. 5

Common Pitfalls & Debugging

Problem: “Colors look wrong”

Why: Incorrect YCbCr conversion or subsampling offsets.
Fix: Validate conversion with known formulas.
Quick test: Encode a single-color image and check output.

Problem: “Decoder can’t read file”

Why: Missing or malformed JPEG markers.
Fix: Follow the JPEG file structure strictly.
Quick test: Open output in a standard viewer.

Definition of Done

Encoded JPEG opens in standard viewers
Quality factor changes file size predictably
DCT/quantization pipeline validated on test blocks
Compression ratio improves as expected

Implementation Hints: RGB to YCbCr: Y = 0.299R + 0.587G + 0.114B, Cb/Cr similar. Subsample Cb/Cr 2:1 horizontally and vertically (4:2:0). For DCT: use the standard 8x8 transform matrix (or fast algorithm). Quantization: divide DCT coefficients by quantization table, round. Quality factor scales the table. Zig-zag: reorder 64 coefficients from low to high frequency. Encode DC as delta from previous block (Huffman). Encode AC as (run of zeros, value) pairs (Huffman with special tables). Write JFIF markers and headers.

Learning milestones:

DCT/IDCT produces correct output → You understand the transform
Quantization shows quality trade-off → You understand lossy compression
Output opens in image viewers → You’ve got the format right
Quality matches libjpeg → You’ve built a real encoder

Project 11: PNG Encoder from Scratch

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Lossless Image Compression
Software or Tool: PNG Encoder
Main Book: “PNG: The Definitive Guide” by Greg Roelofs

What you’ll build: A PNG encoder implementing filtering, DEFLATE compression, and the PNG file format. Unlike JPEG, PNG is lossless—perfect for screenshots, diagrams, and images with text.

Why it teaches compression fundamentals: PNG shows how general-purpose compression (DEFLATE) can be optimized for images using preprocessing (filtering). The filter step exploits image-specific redundancy: pixels near each other are usually similar.

Core challenges you’ll face:

PNG filtering (None, Sub, Up, Average, Paeth) → maps to prediction-based coding
Filter selection (per-row optimal filter) → maps to adaptive preprocessing
Interlacing (Adam7 progressive display) → maps to incremental rendering
CRC checksums (per-chunk validation) → maps to data integrity
Color types (grayscale, RGB, palette, alpha) → maps to format flexibility

Key Concepts:

PNG Specification: PNG RFC
Filtering Explained: PNG Filtering
Optimization: OptiPNG Guide
DEFLATE in PNG: “PNG: The Definitive Guide” Chapter 9 - Roelofs

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: DEFLATE implementation, image basics

Real world outcome:

$ ./mypng encode screenshot.ppm screenshot.png
Original: 8,000,000 bytes (2000x1000 RGB)
Compressed: 234,567 bytes (97% reduction)
Filter stats: Sub 40%, Up 35%, Average 20%, Paeth 5%

$ ./mypng --analyze screenshot.png
Chunk breakdown:
  IHDR: 13 bytes (header)
  IDAT: 234,500 bytes (compressed data)
  IEND: 0 bytes (end marker)

$ pngcheck screenshot.png
OK: screenshot.png (2000x1000, 24-bit RGB, non-interlaced)

$ ./mypng encode diagram.ppm diagram.png --interlace
Interlaced output for progressive display

The Core Question You’re Answering

“How do you build a lossless image format that supports filters, metadata, and reliable integrity checks?”

Concepts You Must Understand First

PNG chunk structure and CRCs
Scanline filters (None/Sub/Up/Average/Paeth)
DEFLATE integration
Color types and bit depths

Questions to Guide Your Design

Which filter selection strategy will you use?
How will you handle different color formats?
How will you structure chunks and CRC calculation?
Can you stream-encode without holding the full image?

Thinking Exercise

Apply the Sub filter to a single scanline [10, 13, 15, 20]. What output do you get?

The Interview Questions They’ll Ask

Why does PNG use filters before DEFLATE?
What is the Paeth predictor?
How does PNG ensure file integrity?
Why is PNG lossless but often larger than JPEG?

Hints in Layers

Hint 1: Write PNG Chunks First Build IHDR, IDAT, IEND with CRC.

Hint 2: Add Filters Implement all five and choose the best per line.

Hint 3: Plug in DEFLATE Reuse your Project 4 encoder if possible.

Hint 4: Validate With a Viewer Open output in standard image tools.

Books That Will Help

Topic	Book	Chapter
Bit-level formats	Computer Systems: A Programmer’s Perspective	Ch. 2
Compression basics	Algorithms, 4th Edition	Ch. 5
Data integrity	Computer Networks	Ch. 3

Common Pitfalls & Debugging

Problem: “PNG viewer rejects file”

Why: CRC mismatch or chunk length error.
Fix: Verify CRC calculation and big-endian lengths.
Quick test: Recalculate CRC with a known library.

Problem: “Filters make file bigger”

Why: Poor filter selection.
Fix: Choose filter with smallest absolute sum.
Quick test: Compare filter outputs on a gradient.

Definition of Done

PNG opens in standard viewers
Filters implemented and chosen per scanline
CRCs validate correctly
Output is lossless vs source

Implementation Hints: For each row, compute all 5 filter outputs: None (raw), Sub (pixel - left), Up (pixel - above), Average ((left + above) / 2), Paeth (nonlinear predictor). Choose filter that minimizes sum of absolute differences (heuristic). Prepend filter byte to each row. DEFLATE-compress the result. Write PNG chunks: IHDR (width, height, bit depth, color type), IDAT (compressed data), IEND (marker). Compute CRC32 for each chunk. For optimization, try different DEFLATE compression levels and filter strategies.

Learning milestones:

Basic PNG encodes/decodes → You understand the format
Filters improve compression ratio → You see prediction value
Output validates with pngcheck → You’ve got the format right
Ratio matches libpng → You’ve optimized filter selection

Project 12: MP3 Encoder (Simplified)

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 1. The “Resume Gold”
Difficulty: Level 5: Master
Knowledge Area: Audio Compression / Psychoacoustics
Software or Tool: Audio Encoder
Main Book: “Introduction to Digital Audio Coding and Standards” by Bosi & Goldberg

What you’ll build: A simplified MP3-like encoder implementing MDCT, psychoacoustic modeling, quantization, and Huffman coding. You’ll hear the trade-off between quality and file size.

Why it teaches compression fundamentals: Audio compression exploits human hearing limitations: we can’t hear quiet sounds masked by loud ones, and we’re less sensitive to certain frequencies. MP3 pioneered this “perceptual coding” approach now used everywhere.

Core challenges you’ll face:

MDCT transform (time to frequency domain) → maps to transform coding
Psychoacoustic model (masking thresholds) → maps to perceptual coding
Bit allocation (which frequencies need more bits) → maps to resource allocation
Huffman tables (MP3’s predefined 32 tables) → maps to standardized coding
Frame structure (headers, side info, main data) → maps to format design

Key Concepts:

MDCT in MP3: MP3 MDCT Tutorial
Psychoacoustic Model: Stanford MP3 Psychoacoustics
MP3 Format: MP3 Technical Spec
LAME Source: LAME MP3 Encoder

Difficulty: Master Time estimate: 1-2 months Prerequisites: Signal processing basics, DCT understanding, Huffman coding

Real world outcome:

$ ./mymp3 encode song.wav song.mp3 --bitrate 128
Original: 45,000,000 bytes (5 minutes, 44.1kHz stereo)
Compressed: 4,800,000 bytes (128 kbps)
Compression ratio: 9.4:1

$ ./mymp3 --analyze song.wav
Frame 1000:
  Masking threshold at 1kHz: -45 dB
  Signal at 1kHz: -20 dB (25 dB above threshold, needs bits)
  Signal at 15kHz: -48 dB (below threshold, can remove!)

$ mpg123 song.mp3  # Plays in standard MP3 player!

$ ./mymp3 encode song.wav song-32.mp3 --bitrate 32
Audible artifacts (expected at low bitrate)

The Core Question You’re Answering

“How can you compress audio by removing what humans cannot hear?”

Concepts You Must Understand First

Psychoacoustic models and masking
MDCT transforms and band splitting
Quantization and bit allocation
Bit reservoir and frame structure

Questions to Guide Your Design

How will you approximate masking without a full psychoacoustic model?
Which sample rates and bitrates will you support?
How will you encode side information and main data?
How will you validate quality vs size?

Thinking Exercise

Take a 1 kHz sine wave and reason about which frequency bins should dominate after MDCT.

The Interview Questions They’ll Ask

Why does MP3 use a psychoacoustic model?
What is the purpose of the bit reservoir?
How does MDCT differ from DCT?
Why do low bitrates sound “swishy”?

Hints in Layers

Hint 1: Start With a Simplified Model Use fixed band thresholds to decide quantization.

Hint 2: Implement MDCT Correctly Verify transform with known signals.

Hint 3: Add Bit Allocation Allocate more bits to loud bands.

Hint 4: Build a Decoder Round-trip is the only real validation.

Books That Will Help

Topic	Book	Chapter
Transforms	Computer Graphics from Scratch	Ch. 2
Signal basics	Math for Programmers	Ch. 8
Bit streams	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Audio sounds metallic”

Why: Quantization too aggressive in key bands.
Fix: Increase bits for low-frequency bands.
Quick test: Listen to pure tones at multiple bitrates.

Problem: “Decoder output clips”

Why: Incorrect scaling after inverse transform.
Fix: Normalize output and verify against input amplitude.
Quick test: Encode a sine wave and check peak values.

Definition of Done

Encoded audio decodes correctly
Quality improves when bitrate increases
Frames and side info parsed correctly
Output size aligns with bitrate target

Implementation Hints: Use 1152 samples per frame (576 per granule). Apply polyphase filter bank (32 subbands) then MDCT (18 coefficients per subband = 576 total). Psychoacoustic model: compute FFT, find masking tones, calculate masking thresholds. Quantize coefficients: lower precision where masked. Select Huffman table per block that minimizes bits. Write frame header (sync, bitrate, sample rate), side info (block types, scale factors), and main data (Huffman-coded coefficients). Start with existing psychoacoustic model code—the math is complex.

Learning milestones:

MDCT transforms correctly → You understand the frequency domain
Output plays in MP3 players → You’ve got the format right
Quality is acceptable at 128kbps → Your quantization works
Low bitrate sounds worse (as expected) → You understand the trade-offs

Project 13: Video Codec Fundamentals (H.264-lite)

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Video Compression
Software or Tool: Video Encoder
Main Book: “H.264 and MPEG-4 Video Compression” by Iain Richardson

What you’ll build: A simplified video encoder implementing I-frames (like JPEG), P-frames (motion-compensated prediction), and basic entropy coding. You’ll see how video achieves 100:1 compression.

Why it teaches compression fundamentals: Video compression extends image compression with temporal prediction: frames are similar to previous frames. This “motion compensation” is why video files are so small despite having 30+ images per second.

Core challenges you’ll face:

Motion estimation (finding where blocks moved) → maps to search algorithms
Motion compensation (predicting from previous frame) → maps to prediction-based coding
Residual coding (encoding the difference) → maps to transform coding (DCT)
Frame types (I/P/B frames) → maps to temporal dependencies
Rate control (hitting target bitrate) → maps to resource allocation

Key Concepts:

H.264 Overview: Wikipedia H.264
Motion Estimation: “H.264 and MPEG-4 Video Compression” Chapter 6 - Richardson
GOP Structure: FastPix Codec Guide
x264 Source: x264 Encoder

Difficulty: Master Time estimate: 2-3 months Prerequisites: JPEG encoder, motion vectors concept, significant time investment

Real world outcome:

$ ./myvideo encode clip.y4m clip.264 --bitrate 2000
Original: 500,000,000 bytes (30 sec, 1080p, 30fps)
Compressed: 7,500,000 bytes (2 Mbps)
Compression ratio: 67:1

$ ./myvideo --analyze clip.264
Frame types: I P P P P I P P P...
GOP size: 5
Average motion vector: (2.3, 1.1) pixels
Residual bits: 40% of total (motion saves 60%!)

$ ffplay clip.264  # Plays in ffmpeg!

$ ./myvideo encode clip.y4m clip-lowrate.264 --bitrate 500
Lower quality but still watchable

The Core Question You’re Answering

“How do modern video codecs exploit spatial and temporal redundancy?”

Concepts You Must Understand First

Intra prediction within frames
Inter prediction (motion estimation/compensation)
Macroblocks and block transforms
Rate-control and GOP structure

Questions to Guide Your Design

How will you simplify motion estimation to keep it tractable?
What block sizes will you support?
How will you encode motion vectors and residuals?
How will you structure keyframes (I/P frames)?

Thinking Exercise

Given two frames where an object shifts right by 2 pixels, what motion vector would you encode?

The Interview Questions They’ll Ask

Why do video codecs use keyframes?
What is motion compensation and why does it work?
Why are smaller blocks better for motion but slower to encode?
What are typical trade-offs in rate control?

Hints in Layers

Hint 1: Start With Intra-Only Encode each frame as a JPEG-like image.

Hint 2: Add Motion Vectors Track block matches between frames.

Hint 3: Add Residual Encoding Encode only the difference after prediction.

Hint 4: Add GOP Structure Insert I-frames periodically.

Books That Will Help

Topic	Book	Chapter
Transforms	Computer Graphics from Scratch	Ch. 2
Data structures	Algorithms in C	Part 3
Bit-level formats	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Motion vectors explode”

Why: Search range too large or bad cost function.
Fix: Limit range and use SAD/SSD cost.
Quick test: Encode a simple translation sequence.

Problem: “Video drifts over time”

Why: Prediction error accumulates without keyframes.
Fix: Insert I-frames at fixed intervals.
Quick test: Compare frame 0 vs frame 100 for drift.

Definition of Done

Intra-only mode produces valid frames
Motion estimation reduces residual size
GOP structure implemented
Bitrate scales with configuration

Implementation Hints: Start with I-frames only (just JPEG). Add P-frames: divide into 16x16 macroblocks, search previous frame for best match (minimize sum of absolute differences), encode motion vector and residual. For motion search: start with full search in ±16 pixels, then optimize with diamond or hexagonal search. Residual: DCT, quantize, entropy code (CAVLC is simpler than CABAC). For B-frames (advanced): search both previous and next frames. Use x264 as reference—it’s well-documented.

Learning milestones:

I-frames encode correctly → You’ve built on JPEG
P-frames with motion compensation work → You understand video compression
Output plays in video players → You’ve got the format right
Quality is reasonable at target bitrate → Your rate control works

Project 14: Delta Encoding & Diff Algorithm

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 2: Intermediate
Knowledge Area: Differential Compression
Software or Tool: Delta Compressor
Main Book: “Managing Gigabytes” by Witten, Moffat, Bell

What you’ll build: A delta encoder that compresses a file by storing only the differences from a reference file. This is how Git stores changes, rsync syncs files, and game patches work.

Why it teaches compression fundamentals: Delta encoding exploits a different kind of redundancy: similarity between related files. A new version of a file is often 99% identical to the old version—delta encoding transmits only the 1%.

Core challenges you’ll face:

Common subsequence finding (longest match detection) → maps to diff algorithms
Edit operations (insert, delete, copy) → maps to operation encoding
Block hashing (rsync’s rolling hash) → maps to incremental hashing
Reference management (base file dependency) → maps to versioning systems
Combining with compression (delta + LZ) → maps to layered compression

Key Concepts:

Rsync Algorithm: rsync Technical Report
Myers Diff: An O(ND) Difference Algorithm
VCDIFF Format: RFC 3284
Git Packfiles: Git Internals - Packfiles

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Hash tables, basic LZ understanding

Real world outcome:

$ ./delta create old_version.bin new_version.bin patch.delta
Old size: 10,000,000 bytes
New size: 10,050,000 bytes
Delta size: 125,000 bytes (1.2% of new file!)

$ ./delta apply old_version.bin patch.delta restored.bin
Applied 47 copy operations, 12 insert operations
Restored file matches new_version.bin: YES

$ ./delta analyze old_version.bin new_version.bin
Common blocks: 98.7%
Insertions: 45KB at offsets 1234, 56789, ...
Deletions: 5KB at offsets 9999, ...

# Works for game patches:
$ ./delta create game_v1.0.bin game_v1.1.bin update.patch
Patch size: 50MB (vs 5GB full download!)

The Core Question You’re Answering

“How can you encode differences between sequences so patches are small and reversible?”

Concepts You Must Understand First

Delta encoding and rolling hashes
Longest common subsequence (LCS)
Patch formats and metadata
Streaming diff generation

Questions to Guide Your Design

Will you use LCS, rolling hash, or both?
How will you encode insertions/deletions efficiently?
How will you ensure patch application is safe?
How will you handle binary data vs text?

Thinking Exercise

Compute a diff between “banana” and “bandana” and describe the operations.

The Interview Questions They’ll Ask

Why is rolling hash useful for diff?
What is the trade-off between LCS accuracy and speed?
How do binary diffs differ from text diffs?
How do you ensure patch integrity?

Hints in Layers

Hint 1: Start With Simple LCS Make it correct before fast.

Hint 2: Add Rolling Hash Use a fixed block size to find matches.

Hint 3: Design a Patch Format Include offsets and lengths explicitly.

Hint 4: Add Validation Checksum before and after patch.

Books That Will Help

Topic	Book	Chapter
String algorithms	Algorithms in C	Part 5
Hashing	Algorithms, 4th Edition	Ch. 3.4
Data integrity	Computer Networks	Ch. 3

Common Pitfalls & Debugging

Problem: “Patch fails on binary files”

Why: Treating binary as text.
Fix: Use byte-based diff and avoid line endings.
Quick test: Diff two small binary blobs.

Problem: “Patch larger than original”

Why: Poor match detection.
Fix: Increase block size or add rolling hash.
Quick test: Diff similar files and compare sizes.

Definition of Done

Patch apply reverses the diff exactly
Binary and text diffs both supported
Rolling hash improves patch size
Patch integrity validated with checksum

Implementation Hints: Split the reference file into fixed-size blocks (e.g., 4KB). Compute hash of each block, store in hash table mapping hash → offset. For new file: use rolling hash (rsync’s adler32-like) to find blocks matching reference. Output sequence of: COPY(offset, length) for matches, INSERT(data) for new content. For better results, use variable-length matching like LZ77, but match against reference instead of recent data. VCDIFF is the standard delta format.

Learning milestones:

Basic delta works for similar files → You understand the concept
Rolling hash enables efficient matching → You can handle large files
Patch size is proportional to differences → Your algorithm is correct
Patches work for binary files (not just text) → You handle all data types

Project 15: Context-Based Compression (PPM)

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Statistical Modeling
Software or Tool: Context Compressor
Main Book: “Text Compression” by Bell, Cleary, Witten

What you’ll build: A Prediction by Partial Matching (PPM) compressor that uses context (previous bytes) to predict the next byte. After “th”, ‘e’ is very likely—PPM exploits this.

Why it teaches compression fundamentals: PPM achieves among the best compression ratios for text. It shows that better statistical modeling = better compression. Modern neural network compressors are essentially sophisticated PPM.

Core challenges you’ll face:

Context modeling (storing predictions for each context) → maps to trie data structures
Escape mechanism (what if context hasn’t seen this symbol) → maps to smoothing techniques
Order blending (combining predictions from different context lengths) → maps to ensemble methods
Memory management (tries grow large) → maps to bounded-memory algorithms
Arithmetic coding integration (variable probabilities) → maps to entropy coding

Key Concepts:

PPM Algorithm: “Text Compression” Chapter 4 - Bell, Cleary, Witten
Escape Methods: PPM Escape Mechanisms
PAQ Compressors: Matt Mahoney’s PAQ
Context Mixing: Context Mixing Explanation

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Arithmetic coding, trie structures, probability

Real world outcome:

$ ./ppm compress book.txt book.ppm
Original: 500,000 bytes
Compressed: 145,000 bytes (71% reduction)
Context order: 5

Comparison:
  gzip:    192,000 bytes
  bzip2:   152,000 bytes
  PPM:     145,000 bytes (best!)

$ ./ppm --analyze book.txt
After "the ": 'a' 23%, 'm' 18%, 'b' 12%...
After "tion": ' ' 45%, 's' 32%, 'a' 8%...
Context "the " has 234 different followers

$ ./ppm decompress book.ppm restored.txt
Files match: YES

The Core Question You’re Answering

“How can context models predict the next symbol so well that entropy coding approaches the theoretical limit?”

Concepts You Must Understand First

Context depth and escape symbols
Probability smoothing (PPM variants)
Adaptive model updates
Memory usage vs accuracy

Questions to Guide Your Design

What max context length will you allow?
Which escape model (A/B/C) will you implement?
How will you prevent memory blowup?
How will you combine with an entropy coder?

Thinking Exercise

Build a context tree for the string “abracadabra” and predict the probability of the next ‘a’.

The Interview Questions They’ll Ask

Why does PPM often beat LZ on text?
What are escape symbols used for?
How do you limit memory in context models?
Why is PPM slow to encode but strong in ratio?

Hints in Layers

Hint 1: Start With PPM0 Use zero-length context only.

Hint 2: Add Context Depth Increase depth one level at a time.

Hint 3: Implement Escapes Add escape symbol for unseen contexts.

Hint 4: Add Pruning Drop rare contexts to save memory.

Books That Will Help

Topic	Book	Chapter
Probability models	Concrete Mathematics	Ch. 2
Trees	Algorithms in C	Part 5
Bit-level coding	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Model grows too large”

Why: No pruning of rare contexts.
Fix: Apply frequency thresholds and limit depth.
Quick test: Log node count over input size.

Problem: “Decoder diverges”

Why: Model updates differ between encode/decode.
Fix: Ensure updates happen in identical order.
Quick test: Step through the first 20 symbols.

Definition of Done

PPM0 and deeper contexts work
Escape handling is correct
Compression ratio beats Huffman on text
Memory usage stays within bounds

Implementation Hints: Build a trie where each node represents a context (sequence of bytes). Each node stores counts for each possible next byte. For encoding: look up current context (last N bytes), get probability distribution, encode with arithmetic coder. If symbol not seen in this context, emit escape symbol and try shorter context. PPM-D uses escape method D (count of distinct symbols). For memory limits, use context hashing or pruning. Start with order 4-5; higher orders need more memory but compress better.

Learning milestones:

Basic PPM compresses/decompresses → You understand context modeling
Escape mechanism handles novel contexts → You’ve solved the zero-probability problem
Compression beats gzip/bzip2 on text → Your model is good
Memory-bounded version works → You can handle large files

Project 16: Neural Network Compressor (Simple)

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: Python
Alternative Programming Languages: C++ with LibTorch, Rust
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Learned Compression
Software or Tool: Neural Compressor
Main Book: “Neural Data Compression” (Various Papers)

What you’ll build: A compressor using a simple neural network (LSTM or Transformer) to predict the next byte, combined with arithmetic coding. This is the frontier of compression research.

Why it teaches compression fundamentals: Neural compressors show that compression = prediction. A perfect predictor gives perfect compression. Modern AI compressors like NNCP and DeepZip achieve state-of-the-art ratios by leveraging deep learning’s pattern recognition.

Core challenges you’ll face:

Probability output (network predicts distribution over next byte) → maps to softmax modeling
Online learning (model adapts while compressing) → maps to adaptive compression
Bit-back coding (efficient latent coding) → maps to VAE compression
Model size overhead (model must ship with data) → maps to break-even point
Speed (neural networks are slow) → maps to practical trade-offs

Key Concepts:

CMIX Compressor: CMIX - Best Compression
NNCP Paper: Neural Network Compression
Arithmetic Coding with NN: DeepZip
Large Text Compression Benchmark: Hutter Prize

Difficulty: Master Time estimate: 2-3 months Prerequisites: Deep learning basics, arithmetic coding, PPM understanding

Real world outcome:

$ ./neural_compress train corpus/*.txt --model model.pt
Training neural predictor...
Epoch 100: perplexity 2.34 (entropy 1.23 bits/byte)

$ ./neural_compress compress book.txt book.nc --model model.pt
Original: 500,000 bytes
Compressed: 125,000 bytes + 5,000,000 bytes (model)
Net size: 5,125,000 bytes (worse than gzip!)

$ ./neural_compress compress large_corpus.txt corpus.nc --model model.pt
Original: 1,000,000,000 bytes
Compressed: 98,000,000 bytes + 5,000,000 bytes (model)
Net compression: 9.7% of original (better than everything!)

# For small files, neural compression loses to model overhead
# For large files or pre-shared model, it wins

The Core Question You’re Answering

“How can a learned model compress data by mapping it into a smaller latent space?”

Concepts You Must Understand First

Autoencoders and latent representations
Quantization and entropy modeling
Loss functions for reconstruction vs size
Inference-time vs training-time constraints

Questions to Guide Your Design

What input format will your model compress?
How will you quantize the latent space?
How do you transmit the model or assume it exists?
What metric measures quality vs size?

Thinking Exercise

Train a tiny autoencoder on grayscale 8x8 images. Compare reconstruction error vs latent size.

The Interview Questions They’ll Ask

Why is entropy modeling still required after a neural encoder?
How do you balance reconstruction loss and bitrate?
What makes neural compression slower than classical methods?
How do you deploy a model in a real system?

Hints in Layers

Hint 1: Start With a Tiny Dataset Use MNIST or synthetic data.

Hint 2: Add Quantization Round or vector-quantize latent values.

Hint 3: Add Entropy Coding Compress the quantized latent values.

Hint 4: Compare to Baselines Benchmark vs PNG or gzip.

Books That Will Help

Topic	Book	Chapter
Math foundations	Math for Programmers	Ch. 5-7
Data representation	Computer Systems: A Programmer’s Perspective	Ch. 2
Algorithms	Algorithms, 4th Edition	Ch. 1

Common Pitfalls & Debugging

Problem: “Model outputs blurry reconstructions”

Why: Latent space too small or loss too smooth.
Fix: Increase latent size or change loss metric.
Quick test: Compare PSNR at different latent sizes.

Problem: “Compressed size larger than baseline”

Why: No entropy coding or poor quantization.
Fix: Add entropy coding for latents.
Quick test: Compare to PNG on the same images.

Definition of Done

Encoder/decoder round-trip works
Latent quantization + entropy coding implemented
Compression beats a baseline on a simple dataset
Model inference runs within memory limits

Implementation Hints: Use a character-level LSTM or small Transformer (1-2 layers). Input: last N bytes (embedded). Output: softmax over 256 possible next bytes. Train on similar data or train online while compressing. For compression: for each byte, get predicted distribution from network, encode with arithmetic coder, update network with actual byte. Key insight: better prediction = lower entropy = smaller output. Start with PyTorch for prototyping. For practical use, the model overhead (~5MB) only pays off for very large files or when model is pre-shared.

Learning milestones:

Neural network predicts next byte → You’ve built the model
Arithmetic coding uses predictions → You’ve integrated with compression
Large files compress better than classical → Neural compression works!
Online adaptation improves results → You understand adaptive compression

Project 17: Compression Benchmarking Suite

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: Python + C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Performance Analysis
Software or Tool: Benchmark Tool
Main Book: “The Art of Computer Systems Performance Analysis” by Raj Jain

What you’ll build: A benchmarking suite that compares compression algorithms across different data types (text, images, binaries), measuring ratio, speed, memory, and CPU usage.

Why it teaches compression fundamentals: Different algorithms excel on different data. Benchmarking reveals the trade-offs: LZ4 for speed, zstd for balance, LZMA for ratio, PPM for text. Understanding when to use each algorithm is as important as understanding how they work.

Core challenges you’ll face:

Fair comparison (same settings, warm cache) → maps to experimental methodology
Data variety (text, binary, images, random) → maps to workload characterization
Metric selection (ratio vs speed vs memory) → maps to multi-objective optimization
Statistical significance (variance between runs) → maps to statistical analysis
Visualization (Pareto frontiers, charts) → maps to data presentation

Key Concepts:

Compression Benchmarks: Squash Benchmark
Silesia Corpus: Standard Test Data
Large Text Benchmark: Matt Mahoney’s Benchmarks
Performance Analysis: “Systems Performance” by Brendan Gregg

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic statistics, scripting, system monitoring

Real world outcome:

$ ./benchmark --corpus silesia/ --algorithms gzip,lz4,zstd,bzip2
Running benchmarks on 12 files, 211 MB total...

Results:
| Algorithm | Ratio | Compress | Decompress | Memory |
|-----------|-------|----------|------------|--------|
| lz4       | 2.1x  | 650 MB/s | 2800 MB/s  | 16 MB  |
| zstd -1   | 2.8x  | 450 MB/s | 1200 MB/s  | 32 MB  |
| zstd -19  | 3.5x  | 3 MB/s   | 1200 MB/s  | 128 MB |
| gzip      | 3.2x  | 45 MB/s  | 280 MB/s   | 256 KB |
| bzip2     | 3.8x  | 15 MB/s  | 55 MB/s    | 7 MB   |

Best for speed: lz4
Best for ratio: bzip2
Best overall: zstd (Pareto optimal)

$ ./benchmark --visualize results.json
[Generates Pareto frontier chart: ratio vs speed]

The Core Question You’re Answering

“How do you compare compressors fairly across ratio, speed, and memory?”

Concepts You Must Understand First

Corpus selection and representativeness
Throughput measurement and CPU pinning
Memory usage tracking
Repeatable benchmark harnesses

Questions to Guide Your Design

What datasets represent text, binary, media, and mixed files?
How will you normalize for CPU frequency changes?
What metrics will you log and visualize?
How will you ensure reproducibility?

Thinking Exercise

Design a benchmark table that compares ratio, compression speed, and decompression speed for three algorithms.

The Interview Questions They’ll Ask

Why is decompression speed often more important than compression?
How do you avoid benchmarking bias?
Why should you use multiple corpora?
How do you compare algorithms with different outputs?

Hints in Layers

Hint 1: Start With a Small Corpus Add a few representative files.

Hint 2: Fix CPU Frequency Disable turbo for consistent timing.

Hint 3: Automate Runs Repeat runs and compute averages.

Hint 4: Visualize Results Plot ratio vs speed.

Books That Will Help

Topic	Book	Chapter
Algorithm analysis	Algorithms, 4th Edition	Ch. 1
Performance	Computer Systems: A Programmer’s Perspective	Ch. 6
Experimental method	The Pragmatic Programmer	Ch. 8

Common Pitfalls & Debugging

Problem: “Results vary wildly”

Why: CPU frequency scaling or cache effects.
Fix: Pin CPU, warm caches, and repeat runs.
Quick test: Run the same benchmark 10 times and compare variance.

Problem: “Algorithm looks worse than expected”

Why: Input corpus not representative.
Fix: Use diverse corpora and document them.
Quick test: Add a text and binary corpus.

Definition of Done

Benchmark harness runs end-to-end
Ratio, compress, decompress metrics captured
Results reproducible across runs
Reports are exportable (CSV/JSON)

Implementation Hints: Use standard test corpora (Silesia, Calgary, Canterbury). For each algorithm, run multiple times, discard first run (warm cache), compute mean and variance. Measure wall-clock time, CPU time, peak memory (via /proc or similar). Compute compression ratio, speed (bytes/second), and memory per byte. Plot Pareto frontier: algorithms on the efficient frontier offer best trade-offs. Include random data (incompressible) to verify no algorithm expands it much.

Learning milestones:

Benchmarks run consistently → You understand experimental methodology
Results match published benchmarks → Your methodology is sound
You can recommend algorithms for use cases → You understand trade-offs
Visualizations show clear patterns → You can communicate results

Project 18: Build Your Own Mini-Compression Library

File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Library Design
Software or Tool: Compression Library
Main Book: “The Practice of Programming” by Kernighan & Pike

What you’ll build: A unified compression library with a clean API, supporting multiple algorithms (LZ4, zstd-lite, Huffman), streaming, and configurable compression levels.

Why it teaches compression fundamentals: Building a library forces you to think about API design, memory management, error handling, and performance. You’ll understand why real libraries (zlib, zstd) are designed the way they are.

Core challenges you’ll face:

API design (simple yet flexible interface) → maps to library design
Streaming (compress without knowing total size) → maps to incremental processing
Memory management (caller-provided vs library-managed) → maps to resource ownership
Error handling (graceful failure, informative errors) → maps to robust code
Thread safety (concurrent compression) → maps to parallel design

Key Concepts:

zlib API Design: zlib Manual
zstd API: zstd Documentation
Library Design: “The Practice of Programming” Chapter 4 - Kernighan & Pike
API Design Principles: “The Design of Everyday Things” - Don Norman

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: All previous compression projects, library design experience

Real world outcome:

#include "minicompress.h"

// Simple one-shot API
mc_compress(src, src_len, dst, &dst_len, MC_ALGO_ZSTD, MC_LEVEL_DEFAULT);

// Streaming API
mc_stream_t *s = mc_stream_new(MC_ALGO_LZ4, MC_COMPRESS);
while (has_more_data) {
    mc_stream_input(s, chunk, chunk_len);
    while (mc_stream_output_available(s)) {
        size_t out_len;
        mc_stream_output(s, buffer, sizeof(buffer), &out_len);
        write(fd, buffer, out_len);
    }
}
mc_stream_finish(s);
mc_stream_free(s);

// Algorithm enumeration
for (mc_algo_t a = mc_algo_first(); a != MC_ALGO_END; a = mc_algo_next(a)) {
    printf("%s: ratio %.1fx, speed %d MB/s\n",
           mc_algo_name(a), mc_algo_typical_ratio(a), mc_algo_typical_speed(a));
}

The Core Question You’re Answering

“How do you design a compression API that is usable, extensible, and safe for real applications?”

Concepts You Must Understand First

API design (streaming vs buffer-based)
File format versioning and headers
Plugin architecture for multiple codecs
Error handling and integrity checks

Questions to Guide Your Design

What is the minimal API surface for encode/decode?
How will you allow streaming compression?
How will you version and identify formats?
How will you test compatibility over time?

Thinking Exercise

Design a minimal API: compress(input, output, options) and decompress(input, output). What errors can each return?

The Interview Questions They’ll Ask

What makes a compression API easy to integrate?
How do you handle backward compatibility in formats?
How do you prevent corrupted output from crashing clients?
How do you expose tuning knobs without overwhelming users?

Hints in Layers

Hint 1: Start With One Codec Implement RLE end-to-end with your API.

Hint 2: Add Stream Interfaces Chunked input/output for large files.

Hint 3: Add Versioned Headers Magic bytes + version + codec ID.

Hint 4: Add a Plugin Registry Map codec IDs to encode/decode functions.

Books That Will Help

Topic	Book	Chapter
API design	Clean Architecture	Ch. 11
Error handling	Code Complete	Ch. 8
Data formats	Computer Systems: A Programmer’s Perspective	Ch. 2

Common Pitfalls & Debugging

Problem: “Decoder crashes on corrupt input”

Why: No bounds checking or checksum.
Fix: Add length checks and optional CRC.
Quick test: Flip random bits in a file and ensure graceful errors.

Problem: “API too rigid for new codecs”

Why: Format assumptions baked in.
Fix: Use a codec registry and versioned headers.
Quick test: Add a dummy codec and verify registration.

Definition of Done

API supports streaming and buffer modes
Format headers include version + codec ID
Corrupt inputs fail safely
At least three codecs integrated

Implementation Hints: Design a clean C API with opaque types. Provide both one-shot (compress(src, dst)) and streaming (stream_input(), stream_output()) interfaces. Use function pointers for algorithm dispatch. Include bounds checking, input validation, and meaningful error codes. For streaming, implement internal buffering. For thread safety, avoid global state—keep all context in the stream object. Document memory ownership clearly. Provide both static and dynamic linking options.

Learning milestones:

One-shot API works for all algorithms → You understand abstraction
Streaming API handles arbitrary input sizes → You understand buffering
Memory management is correct (no leaks) → You’ve verified with valgrind
Library is usable by others → You’ve achieved good API design

Project Comparison Table

#	Project	Difficulty	Time	Depth	Fun Factor	Type
1	RLE	Beginner	1-2 days	★★☆☆☆	★★☆☆☆	Lossless
2	Huffman	Intermediate	1 week	★★★★☆	★★★☆☆	Lossless
3	LZ77	Advanced	2-3 weeks	★★★★☆	★★★★☆	Lossless
4	DEFLATE	Expert	3-4 weeks	★★★★★	★★★★☆	Lossless
5	Arithmetic	Expert	2-3 weeks	★★★★★	★★★☆☆	Lossless
6	ANS	Expert	2-3 weeks	★★★★★	★★★★☆	Lossless
7	BWT	Advanced	2-3 weeks	★★★★☆	★★★★☆	Lossless
8	LZ4-style	Advanced	1-2 weeks	★★★☆☆	★★★★★	Lossless
9	Zstandard	Master	1-2 months	★★★★★	★★★★★	Lossless
10	JPEG	Expert	3-4 weeks	★★★★★	★★★★★	Lossy
11	PNG	Advanced	2-3 weeks	★★★★☆	★★★★☆	Lossless
12	MP3	Master	1-2 months	★★★★★	★★★★★	Lossy
13	Video Codec	Master	2-3 months	★★★★★	★★★★★	Lossy
14	Delta Encoding	Intermediate	1-2 weeks	★★★☆☆	★★★★☆	Lossless
15	PPM	Expert	3-4 weeks	★★★★★	★★★★☆	Lossless
16	Neural Compress	Master	2-3 months	★★★★★	★★★★★	Lossless
17	Benchmarking	Intermediate	1 week	★★☆☆☆	★★★☆☆	Tool
18	Mini Library	Expert	2-3 weeks	★★★★☆	★★★☆☆	Library

Recommended Learning Path

Path A: Understanding the Fundamentals (4-6 weeks)

Best if you’re new to compression and want solid foundations:

Project 1: RLE → Understand the simplest case
Project 2: Huffman → Understand entropy coding
Project 3: LZ77 → Understand dictionary compression
Project 4: DEFLATE → See how they combine

Path B: Modern High-Performance Compression (6-8 weeks)

Best if you want to understand current industry standards:

Project 2: Huffman → Foundation of entropy coding
Project 6: ANS → Modern entropy coding
Project 8: LZ4 → Speed-optimized compression
Project 9: Zstandard → State of the art

Path C: Media Compression Deep Dive (8-12 weeks)

Best if you’re interested in images, audio, or video:

Project 2: Huffman → Needed for JPEG
Project 10: JPEG → Image compression fundamentals
Project 12: MP3 → Audio compression
Project 13: Video Codec → Video compression

Path D: Research-Oriented Path (10-12 weeks)

Best if you want to push compression boundaries:

Project 5: Arithmetic Coding → Optimal entropy coding
Project 7: BWT → Transform-based compression
Project 15: PPM → Context modeling
Project 16: Neural Compressor → Frontier of compression

Path E: Complete Mastery (4-6 months)

For those who want to deeply understand all of compression:

Start with Path A (fundamentals)
Add Path B (modern algorithms)
Add Project 10-11 (image compression)
Add Project 15-16 (advanced modeling)
Finish with Project 18 (library design)

Key Resources Summary

Books

“Introduction to Data Compression” by Khalid Sayood - The textbook
“Data Compression: The Complete Reference” by David Salomon - Comprehensive reference
“Managing Gigabytes” by Witten, Moffat, Bell - Text compression focus
“JPEG Still Image Data Compression Standard” by Pennebaker & Mitchell - Image compression

Online Resources

Squash Compression Benchmark - Algorithm comparisons
Matt Mahoney’s Data Compression Page - Comprehensive resource
encode.su Forum - Active compression community
Hutter Prize - Compression competition

Source Code to Study

zlib - Reference DEFLATE
zstd - Modern, well-documented
lz4 - Speed-optimized
bzip2 - BWT-based
LAME - MP3 encoder
x264 - H.264 encoder

RFCs and Standards

Summary

#	Project	Main Language
1	Run-Length Encoding (RLE) Compressor	C
2	Huffman Coding Compressor	C
3	LZ77 Sliding Window Compressor	C
4	DEFLATE Implementation (LZ77 + Huffman)	C
5	Arithmetic Coding Compressor	C
6	ANS (Asymmetric Numeral Systems) Compressor	C
7	Burrows-Wheeler Transform (BWT) Compressor	C
8	LZ4/Snappy-style Fast Compressor	C
9	Zstandard-style Compressor	C
10	JPEG Encoder from Scratch	C
11	PNG Encoder from Scratch	C
12	MP3 Encoder (Simplified)	C
13	Video Codec Fundamentals (H.264-lite)	C
14	Delta Encoding & Diff Algorithm	C
15	Context-Based Compression (PPM)	C
16	Neural Network Compressor (Simple)	Python
17	Compression Benchmarking Suite	Python + C
18	Build Your Own Mini-Compression Library	C