Compression Algorithms Deep Dive: Project-Based Learning

Understanding How Data Compression Really Works

Goal: Build a complete mental model of compression by implementing the core lossless and lossy families from first principles. You will learn how entropy sets theoretical limits, how transforms change the statistical structure of data, and how real compressors balance ratio, speed, and memory. By the end, you will be able to design a compression pipeline, justify its trade-offs, and validate it with measurable benchmarks.

Compression is everywhere: ZIP files, JPEG images, MP3 audio, Netflix streaming, website loading. Yet most developers treat it as a black box. By building compressors yourself, you’ll understand the mathematical foundations, the engineering trade-offs, and why certain algorithms dominate specific domains.


Why Compression Matters (Beyond Smaller Files)

Compression is fundamentally about bandwidth, latency, and energy:

  • Storage: Smaller files reduce disk usage and improve cache hit rates.
  • Networking: Fewer bytes on the wire means faster transfers and lower egress costs.
  • Performance: Some systems trade CPU for I/O speed (fast compression to reduce disk reads).
  • Reliability: Formats encode structure, checks, and framing for streaming and error detection.
  • Energy: Less I/O can save power on mobile devices and large data centers.

Compression is also an algorithm design masterclass. It is one of the few domains where theory (entropy limits) meets ruthless engineering constraints (bit-level encoders, CPU caches, and streaming requirements).

Think of compression as a negotiation between data structure and distribution:

Data Type        Typical Redundancy           Winning Strategy
---------        -------------------          ------------------------
Text / Logs      Repeated tokens, skew         Huffman + LZ
Images (raw)     Spatial correlation           Transform + Quantize
Video/Audio      Temporal + perceptual         Motion + psychoacoustics
Binary blobs     Mixed patterns                Hybrid (LZ + entropy)

The Big Picture: Compression as a Pipeline

Most real compressors are pipelines with distinct stages. Understanding the pipeline lets you mix and match techniques:

Raw Data
  ├─ Preprocess (delta, RLE, MTF)
  ├─ Transform (BWT, DCT, Wavelets)
  ├─ Modeling (context, dictionaries)
  └─ Entropy Coding (Huffman, Arithmetic, ANS)

Lossless pipelines focus on removing redundancy.
Lossy pipelines focus on removing perceptually irrelevant data.

Another useful mental model is modeling vs coding:

Modeling: "What patterns are in the data?"
Coding:   "How do I assign fewer bits to common patterns?"

Raw Bytes -> Model (probabilities / matches) -> Coder (bitstream)

If compression fails, one of these is broken:

  1. The model does not capture the true structure.
  2. The coder is inefficient or too much metadata is stored.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

  • Solid C or C++ file I/O and bit manipulation
  • Basic probability and logarithms
  • Comfort with arrays, trees, and hash tables

Helpful But Not Required

  • Familiarity with FFT/DCT for lossy codecs
  • Basic DSP concepts (sampling, windows)
  • Some exposure to networking and file formats

Self-Assessment Questions

  • Can you pack bits into a byte stream without losing order?
  • Can you build and traverse a binary tree?
  • Can you explain why Shannon entropy limits compression?

Development Environment Setup

  • C compiler with -Wall -Wextra
  • A hex viewer (e.g., xxd) for debugging bitstreams
  • A plotting tool (optional) for benchmark charts

Core Concept Analysis

To truly understand compression, you need to grasp these fundamental building blocks:

1. Information Theory Foundations

  • Entropy: The theoretical minimum bits needed to represent data (Shannon’s limit)
  • Redundancy: The gap between actual data size and theoretical minimum
  • Symbol probability: How often each byte/character appears

2. Lossless Compression Techniques

  • Statistical coding: Huffman, Arithmetic, ANS (encode based on frequency)
  • Dictionary coding: LZ77, LZ78, LZW (encode based on repetition)
  • Transform coding: BWT, MTF (rearrange data to be more compressible)

3. Lossy Compression Techniques

  • Transform domain: DCT, DFT, Wavelets (move to frequency space)
  • Quantization: Reduce precision strategically
  • Perceptual models: Exploit human sensory limitations

4. The Compression Pipeline

Most real compressors combine multiple techniques:

Input → Transform → Statistical Model → Entropy Coder → Output
        (BWT)       (Context)           (ANS/Huffman)

5. Format Anatomy & Bitstream Hygiene

  • Framing: How a decoder finds block boundaries in a stream
  • Headers: Metadata that enables decoding (tables, window sizes, checksums)
  • Alignment: Bit packing, padding, and byte order considerations
  • Error detection: CRCs and sentinel markers for robustness

6. Speed vs Ratio vs Memory

  • Speed: Fewer branches, cache-friendly tables, SIMD-friendly loops
  • Ratio: Better modeling usually costs more CPU and memory
  • Memory: Dictionary size and context length directly impact RAM

7. What Compresses Well (and Why)

High Redundancy                         Low Redundancy
AAAAAAAAAAAAA....                       3F 98 12 7C ...
RLE / LZ shines                         Entropy coder can't help much

If the data already looks random, your compressor should detect that and avoid expansion.


Compression Vocabulary & Metrics

Key terms you will use constantly:

  • Compression ratio = original_size / compressed_size
  • Bits per symbol (bps) = (compressed_bits / symbols)
  • Throughput = bytes processed per second
  • Window size = how far back a match can reference
  • Block size = chunk boundaries for resets and random access
  • Streaming = ability to decode without full file in memory

Sanity targets for each project:

  • Ratio: At least beats raw storage on data with obvious redundancy
  • Speed: Decompress faster than compress (most codecs follow this)
  • Correctness: Bit-identical round-trip for lossless codecs

Concept Summary Table

Concept Cluster What You Need to Internalize
Information Theory Entropy defines the lower bound; redundancy is what you remove.
Entropy Coding Huffman is optimal per-symbol; arithmetic/ANS approach optimality for any distribution.
Dictionary Coding LZ-style algorithms exploit repeated sequences with back-references.
Transforms BWT/MTF rearrange data to make entropy coding more effective.
Lossy Compression Transform + quantization trades fidelity for size.
Bitstream Engineering Formats, headers, alignment, and metadata dominate real-world correctness.
Performance Tuning Branch prediction and cache behavior often dominate throughput.
Streaming Framing and block design enable progressive decoding and error recovery.

Success Metrics (What “Mastery” Looks Like)

You can consider this track successful when you can:

  1. Explain why a given dataset compresses well or poorly.
  2. Implement a compressor + decompressor pair and prove correctness.
  3. Measure compression ratio, throughput, and memory usage across algorithms.
  4. Design a pipeline that balances ratio vs speed.
  5. Debug bitstreams by inspecting headers and code tables.

How to Use This Guide

  1. Start with Project 1–3 to build the lossless foundation.
  2. Always build a decompressor before trusting your compressor.
  3. Use the benchmark suite (Project 17) after every milestone.
  4. Compare results against theory (entropy bounds, expected ratios).

Quick Start (First 48 Hours)

  1. Implement a minimal RLE encoder/decoder.
  2. Write a histogram tool that prints symbol frequencies.
  3. Pack bits into a buffer and verify with a hex dump.
  4. Compress a text file and explain the ratio using entropy estimates.

Path A: Lossless Foundations

  1. RLE → Huffman → LZ77 → DEFLATE
  2. Arithmetic → ANS → BWT
  3. Fast compressors (LZ4) → Zstd

Path B: Media Codecs

  1. JPEG → PNG → MP3 → H.264-lite
  2. Then add benchmarking + library integration

Path C: Research/Experimental

  1. Context-based (PPM) → Neural compression
  2. Benchmarking suite + custom mini-library

Deep Dive Reading by Concept

Concept Supporting Book
Entropy & probability Concrete Mathematics Ch. 1-3
Trees & greedy coding Algorithms, 4th Edition Ch. 5
Data structures for LZ Algorithms in C Parts 1-3
Bit-level encoding Computer Systems: A Programmer’s Perspective Ch. 2
Recursion & transforms The Recursive Book of Recursion Ch. 7-9

Project 1: Run-Length Encoding (RLE) Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Basic Compression / Encoding
  • Software or Tool: Simple Compressor
  • Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 2)

What you’ll build: A compressor that replaces repeated consecutive bytes with a count and the byte value. Input “AAABBCCCC” becomes “3A2B4C”.

Why it teaches compression fundamentals: RLE is the simplest compression algorithm that actually works. It teaches the core insight: compression finds and eliminates redundancy. You’ll immediately see when compression helps (repeated data) and when it hurts (random data expands!).

Core challenges you’ll face:

  • Encoding format design (how to represent count + byte) → maps to format trade-offs
  • Handling non-repeated bytes (1A1B1C is worse than ABC) → maps to expansion problem
  • Count overflow (what if 300 A’s in a row?) → maps to boundary conditions
  • Binary vs text (works differently on different data) → maps to data type awareness
  • Escape sequences (distinguishing counts from data) → maps to encoding ambiguity

Key Concepts:

Difficulty: Beginner Time estimate: 1-2 days Prerequisites: Basic C programming, file I/O

Real world outcome:

$ ./rle compress image.bmp image.rle
Original: 1,234,567 bytes
Compressed: 234,567 bytes (81% reduction)

$ ./rle compress random.bin random.rle
Original: 1,000 bytes
Compressed: 1,847 bytes (EXPANSION - random data!)

$ ./rle decompress image.rle restored.bmp
Decompressed successfully. Files match: YES

The Core Question You’re Answering

“When does a simple run-based encoding actually reduce size, and how do you design a format that never becomes ambiguous?”

Concepts You Must Understand First

  • Run boundaries and thresholds (when to emit literals vs runs)
  • Marker/escape byte design to avoid ambiguity
  • Byte-level I/O and binary-safe streams
  • Worst-case expansion behavior

Questions to Guide Your Design

  1. What minimum run length justifies a run token?
  2. How will you encode runs longer than 255?
  3. What happens when the marker byte appears in data?
  4. Will your format be streamable without a header?

Thinking Exercise

Manually RLE-encode: AAAABBBCCDAAAAAA using your proposed format. Does it expand?

The Interview Questions They’ll Ask

  1. Why can RLE expand data, and how do you detect it?
  2. What data types compress well with RLE?
  3. How do you handle long runs in a fixed-width counter?
  4. Why must the encoding be unambiguous?

Hints in Layers

Hint 1: Start With Raw Runs Emit (count, byte) for runs and test on tiny inputs.

Hint 2: Add a Marker Byte Only emit runs with a marker; literals pass through.

Hint 3: Escape the Marker If data contains marker, emit an escape sequence.

Hint 4: Add a “no compression” path If compressed size grows, store raw data.

Books That Will Help

Topic Book Chapter
Entropy & limits Concrete Mathematics Ch. 1-2
Bit-level I/O Computer Systems: A Programmer’s Perspective Ch. 2
Algorithm basics Algorithms, 4th Edition Ch. 1

Common Pitfalls & Debugging

Problem: “Decompressor desyncs after a marker”

  • Why: Marker byte appears in data and is not escaped.
  • Fix: Reserve a marker and escape it in literal streams.
  • Quick test: Compress a file with many 0xFF bytes and verify round-trip.

Problem: “Random data doubles in size”

  • Why: RLE encodes everything as runs.
  • Fix: Use a minimum run threshold and raw-literal path.
  • Quick test: Compress a random buffer and ensure size does not exceed a cap.

Definition of Done

  • Compression/decompression is byte-identical for binary files
  • Marker escape works correctly for all byte values
  • Long runs split without overflow
  • Random input does not expand beyond a defined threshold

Implementation Hints: The naive approach “3A2B4C” fails for text that contains digits. Instead, use a binary format: for runs ≥4, output a special marker byte (e.g., 0xFF), then count, then the byte. For runs <4, output bytes literally. Handle the marker byte appearing in data by escaping it (output 0xFF 0x01 0xFF). For count overflow, output multiple run records. Test on BMP images (large uniform regions) vs compressed files (no runs).

Learning milestones:

  1. Basic RLE works on simple input → You understand the core algorithm
  2. Binary files compress/decompress correctly → You handle edge cases
  3. You understand why random data expands → You grasp entropy limits
  4. BMP images compress well, PNGs don’t → You understand data characteristics

Project 2: Huffman Coding Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Entropy Coding / Trees
  • Software or Tool: Entropy Compressor
  • Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 3)

What you’ll build: A compressor that assigns shorter bit sequences to more frequent bytes. If ‘e’ appears 1000 times and ‘z’ appears once, ‘e’ might be encoded as “10” (2 bits) and ‘z’ as “1101110” (7 bits).

Why it teaches compression fundamentals: Huffman coding is the foundation of entropy coding. It’s optimal for symbol-by-symbol coding and appears inside DEFLATE (ZIP), JPEG, MP3, and more. Understanding Huffman means understanding why some data compresses better than others.

Core challenges you’ll face:

  • Frequency counting (scanning input to build histogram) → maps to statistical modeling
  • Tree construction (building optimal prefix-free tree) → maps to greedy algorithms
  • Canonical codes (standardizing tree representation) → maps to format efficiency
  • Bit manipulation (packing variable-length codes into bytes) → maps to low-level coding
  • Header storage (transmitting the tree to decoder) → maps to metadata overhead

Key Concepts:

  • Huffman Algorithm: “Algorithms” by Sedgewick & Wayne - Chapter 5.5
  • Prefix-Free Codes: “Introduction to Data Compression” Chapter 3 - Khalid Sayood
  • Canonical Huffman: Managing Gigabytes
  • Information Entropy: “Elements of Information Theory” Chapter 2 - Cover & Thomas

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Binary trees, bit manipulation, file I/O

Real world outcome:

$ ./huffman compress book.txt book.huf
Original: 500,000 bytes
Compressed: 285,000 bytes (43% reduction)
Entropy: 4.56 bits/byte (theoretical limit: 285,000 bytes)

$ ./huffman analyze book.txt
Symbol frequencies:
  ' ' (space): 98,234 (19.6%) → code: 00 (2 bits)
  'e': 62,145 (12.4%) → code: 010 (3 bits)
  't': 45,231 (9.0%) → code: 0110 (4 bits)
  ...
  'z': 234 (0.05%) → code: 11111110 (8 bits)

$ ./huffman decompress book.huf restored.txt
Decompressed successfully. Files match: YES

The Core Question You’re Answering

“How do you assign variable-length codes so frequent symbols cost fewer bits without losing decodability?”

Concepts You Must Understand First

  • Prefix-free codes and instantaneous decoding
  • Frequency histograms and entropy estimates
  • Canonical Huffman codes (compact headers)
  • Bit packing and byte alignment

Questions to Guide Your Design

  1. How will you handle the single-symbol edge case?
  2. Will you store a full tree or canonical code lengths?
  3. How do you pack variable-length codes into bytes?
  4. What happens if your input has only a few symbols?

Thinking Exercise

Given frequencies: A=5, B=2, C=1, D=1. Build a Huffman tree and list the codes.

The Interview Questions They’ll Ask

  1. Why are Huffman codes prefix-free?
  2. What makes canonical Huffman better for file formats?
  3. How close can Huffman get to entropy?
  4. What is the complexity of building the Huffman tree?

Hints in Layers

Hint 1: Build the Tree First Use a priority queue and merge the two lowest nodes.

Hint 2: Generate Codes by Traversal Left=0, right=1. Store lengths and codes.

Hint 3: Implement Canonical Codes Sort by length, then symbol value, assign sequentially.

Hint 4: Add Bit-Packing Use a bit buffer and flush when you have 8+ bits.

Books That Will Help

Topic Book Chapter
Huffman coding Algorithms, 4th Edition Ch. 5
Greedy algorithms Algorithms in C Part 3
Bit manipulation Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Decoding fails on a one-symbol file”

  • Why: Tree has one node and no path bits.
  • Fix: Emit a 1-bit code for the single symbol.
  • Quick test: Compress a file of 1000 identical bytes.

Problem: “Header too large”

  • Why: Storing full tree structure.
  • Fix: Use canonical code lengths only.
  • Quick test: Compare header size for 256 symbols.

Definition of Done

  • Encodes/decodes binary files losslessly
  • Canonical codes implemented and header minimized
  • Compression ratio approaches entropy on skewed data
  • Single-symbol and empty-file edge cases work

Implementation Hints: First pass: count byte frequencies. Build tree: create leaf nodes for each byte with its frequency, then repeatedly merge the two lowest-frequency nodes into a parent. Assign codes: traverse tree, appending 0 for left and 1 for right. For canonical Huffman: sort codes by length, then alphabetically within length, and assign sequential codes. Store only code lengths in header (much smaller). For bit packing, accumulate bits in a buffer and flush full bytes.

Learning milestones:

  1. Tree builds correctly → You understand the greedy algorithm
  2. Encoding produces correct bit stream → You can do bit manipulation
  3. Compression ratio approaches entropy → You understand optimality
  4. Canonical codes reduce header size → You understand practical optimization

Project 3: LZ77 Sliding Window Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Dictionary Compression
  • Software or Tool: LZ Compressor
  • Main Book: “Data Compression: The Complete Reference” by David Salomon

What you’ll build: A compressor that replaces repeated sequences with back-references. “ABCDEFABCDEF” becomes “ABCDEF(6,6)” meaning “go back 6 bytes, copy 6 bytes.”

Why it teaches compression fundamentals: LZ77 is the foundation of modern compression (ZIP, gzip, PNG, zstd). Unlike Huffman which exploits single-byte frequencies, LZ77 exploits repeated patterns of any length. This is why text (repeated words) and code (repeated patterns) compress so well.

Core challenges you’ll face:

  • Match finding (finding longest match in window) → maps to search algorithms
  • Hash chains (efficient duplicate detection) → maps to hash table design
  • Lazy matching (sometimes waiting finds better match) → maps to optimization strategies
  • Output format (literal vs match distinction) → maps to encoding design
  • Window size trade-offs (bigger = better compression, more memory) → maps to resource trade-offs

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Hash tables, string matching, understanding of Huffman

Real world outcome:

$ ./lz77 compress linux-kernel.tar linux-kernel.lz77
Original: 850,000,000 bytes
Compressed: 125,000,000 bytes (85% reduction)
Compression speed: 45 MB/s
Window size: 32 KB

$ ./lz77 --analyze linux-kernel.tar
Match statistics:
  Literal bytes: 15%
  Match references: 85%
  Average match length: 12.3 bytes
  Longest match: 258 bytes

$ ./lz77 decompress linux-kernel.lz77 restored.tar
Decompression speed: 350 MB/s (asymmetric!)
Files match: YES

The Core Question You’re Answering

“How do you exploit repeated substrings efficiently without slowing compression to a crawl?”

Concepts You Must Understand First

  • Sliding window and back-reference semantics
  • Minimum match length and literals
  • Hash table / hash chain match finding
  • Overlapping match copy during decode

Questions to Guide Your Design

  1. What window size balances ratio and speed?
  2. What minimum match length makes sense for your format?
  3. How will you encode literals vs matches?
  4. How will you prevent matches from crossing block boundaries?

Thinking Exercise

Encode “ABCABCABC” as LZ77 triples. How many bytes do you save?

The Interview Questions They’ll Ask

  1. Why is LZ77 asymmetric (decode faster than encode)?
  2. How do you handle overlapping matches on decode?
  3. What does “lazy matching” improve?
  4. How does window size affect compression ratio?

Hints in Layers

Hint 1: Implement a Naive Matcher Start with a small window and brute-force matches.

Hint 2: Add Hashing Hash 3- or 4-byte sequences for quick candidate matches.

Hint 3: Implement Lazy Matching Look ahead one position for a better match.

Hint 4: Add LZSS Flags Use a flag bit to distinguish literals from matches.

Books That Will Help

Topic Book Chapter
String matching Algorithms in C Part 5
Hash tables Algorithms, 4th Edition Ch. 3.4
Bit streams Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Decoded output is corrupted”

  • Why: Overlapping copy not handled correctly.
  • Fix: Copy byte-by-byte when match overlaps output.
  • Quick test: Compress sequences like “AAAAAA”.

Problem: “Compression is too slow”

  • Why: Naive match search is O(n^2).
  • Fix: Use hash chains and limit search depth.
  • Quick test: Benchmark with and without hash chains.

Definition of Done

  • Compress/decompress round-trip correct
  • Hash-based matching yields significant speedup
  • Lazy matching improves ratio without breaking decode
  • Window size is configurable and documented

Implementation Hints: Use a hash table keyed on 3-byte sequences. For each position, compute hash of next 3 bytes and check for matches at stored positions. Chain collisions (multiple positions with same hash). For each potential match, extend forward to find longest match. Emit literals for unmatched bytes, emit (distance, length) for matches. Minimum match length is typically 3 (shorter matches aren’t worth the overhead). LZSS improvement: use a flag bit to distinguish literals from matches.

Learning milestones:

  1. Basic LZ77 compresses/decompresses → You understand the core algorithm
  2. Hash chains speed up matching → You understand practical implementation
  3. Lazy matching improves ratio → You understand optimization trade-offs
  4. Decompression is much faster than compression → You understand asymmetry

Project 4: DEFLATE Implementation (LZ77 + Huffman)

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compression Pipeline
  • Software or Tool: DEFLATE Compressor
  • Main Book: “RFC 1951: DEFLATE Compressed Data Format Specification”

What you’ll build: A compressor combining LZ77 and Huffman coding—the algorithm behind ZIP, gzip, PNG, and HTTP compression. LZ77 finds matches, then Huffman encodes the literals, lengths, and distances.

Why it teaches compression fundamentals: DEFLATE shows how compression algorithms compose. LZ77 removes redundancy, then Huffman compresses the residual entropy. This two-stage approach is the template for most modern compressors.

Core challenges you’ll face:

  • Block structure (stored, static Huffman, dynamic Huffman) → maps to adaptive algorithms
  • Dynamic tree building (per-block optimal trees) → maps to local optimization
  • Length/distance alphabets (special symbols for back-references) → maps to extended coding
  • Bit-level alignment (variable-length codes packed into bytes) → maps to bit stream handling
  • Compatibility (producing gzip/zlib compatible output) → maps to format compliance

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: LZ77, Huffman, bit manipulation mastery

Real world outcome:

$ ./mydeflate compress document.pdf document.gz
Original: 1,234,567 bytes
Compressed: 987,654 bytes (20% reduction)
Compatible with: gzip, zlib, 7-zip

$ gzip -d document.gz  # Verify with standard tool
$ diff document.pdf document  # Identical!

$ ./mydeflate --stats document.pdf
Block analysis:
  Block 1: Dynamic Huffman, 45KB → 32KB
  Block 2: Dynamic Huffman, 45KB → 31KB
  ...
  Literal symbols: 123,456
  Match references: 45,678
  Average match length: 8.2 bytes

The Core Question You’re Answering

“How do modern compressors combine dictionary and entropy coding to reach strong ratios with practical speed?”

Concepts You Must Understand First

  • DEFLATE block structure and bit ordering
  • Length/distance code tables
  • Dynamic Huffman code encoding
  • Stored vs static vs dynamic blocks

Questions to Guide Your Design

  1. When should you emit stored blocks vs compressed blocks?
  2. How will you build and encode dynamic trees per block?
  3. How do you pack bits in the correct order (LSB-first)?
  4. How will you validate compatibility with gzip/zlib?

Thinking Exercise

Manually encode a short literal-only DEFLATE block and identify the end-of-block symbol.

The Interview Questions They’ll Ask

  1. Why does DEFLATE need three block types?
  2. What are length/distance pairs and why encode them separately?
  3. How do code length codes work?
  4. How would you debug a corrupted DEFLATE stream?

Hints in Layers

Hint 1: Implement Stored Blocks Get a working format before compression.

Hint 2: Add Static Huffman Blocks Use fixed trees from the RFC.

Hint 3: Add Dynamic Huffman Build trees from per-block frequencies.

Hint 4: Validate With gzip Round-trip using external tools.

Books That Will Help

Topic Book Chapter
Huffman coding Algorithms, 4th Edition Ch. 5
Bit-level formats Computer Systems: A Programmer’s Perspective Ch. 2
Greedy trees Algorithms in C Part 3

Common Pitfalls & Debugging

Problem: “gzip refuses to decode”

  • Why: Bit ordering or block header is wrong.
  • Fix: Verify LSB-first bit packing and block header bits.
  • Quick test: Encode a stored-only file and compare with zlib.

Problem: “Dynamic trees decode incorrectly”

  • Why: Code length codes not encoded in correct order.
  • Fix: Follow RFC 1951 ordering exactly.
  • Quick test: Dump code lengths and compare with a reference encoder.

Definition of Done

  • Stored, static, and dynamic blocks decode correctly
  • Output is gzip/zlib compatible
  • Length/distance coding matches RFC tables
  • Compression beats Huffman-only on repetitive data

Implementation Hints: DEFLATE has three block types: stored (no compression), static Huffman (predefined trees), dynamic Huffman (per-block trees). For dynamic blocks: after LZ77, count frequencies of literals (0-255), end-of-block (256), and length codes (257-285). Build Huffman tree. Similarly for distance codes (0-29). Encode the trees themselves using another layer of Huffman (code length codes). The RFC specifies exact bit orderings—follow it precisely. Use zlib’s output as reference to verify correctness.

Learning milestones:

  1. Stored blocks work → You understand the format structure
  2. Static Huffman blocks compress → You’ve integrated LZ77 + Huffman
  3. Dynamic Huffman improves compression → You understand adaptive coding
  4. Output is gzip-compatible → You’ve achieved format compliance

Project 5: Arithmetic Coding Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Entropy Coding / Mathematics
  • Software or Tool: Arithmetic Coder
  • Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 4)

What you’ll build: An encoder that represents the entire message as a single fraction between 0 and 1. Unlike Huffman (integer bits per symbol), arithmetic coding can use 0.3 bits for a very common symbol.

Why it teaches compression fundamentals: Arithmetic coding achieves near-optimal compression, reaching the theoretical entropy limit. It handles symbol probabilities that Huffman can’t efficiently encode. Understanding arithmetic coding explains why modern coders (ANS, range coding) exist.

Core challenges you’ll face:

  • Interval subdivision (shrinking range based on probabilities) → maps to mathematical precision
  • Fixed-point arithmetic (avoiding floating-point errors) → maps to numerical stability
  • Renormalization (outputting bits when range narrows) → maps to bit streaming
  • End-of-message (knowing when to stop decoding) → maps to termination handling
  • Underflow prevention (middle-of-range scaling) → maps to edge case handling

Key Concepts:

  • Arithmetic Coding Theory: “Introduction to Data Compression” Chapter 4 - Sayood
  • Practical Implementation: “Data Compression: The Complete Reference” Chapter 4.4 - Salomon
  • Fixed-Point Tricks: Arithmetic Coding Tutorial
  • Comparison to Huffman: DeepRender Entropy Coding

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Strong math, bit manipulation, probability concepts

Real world outcome:

$ ./arithmetic compress skewed_data.bin skewed.ac
Original: 1,000,000 bytes
Compressed: 125,456 bytes
Huffman would give: 127,890 bytes (1.9% worse)

$ ./arithmetic --verbose compress test.txt test.ac
Symbol 'a' (prob 0.4): interval [0.0, 0.4)
Symbol 'b' (prob 0.3): interval [0.0, 0.12)
Symbol 'c' (prob 0.2): interval [0.024, 0.048)
Symbol 'a' (prob 0.4): interval [0.024, 0.0336)
...
Final interval: [0.02847, 0.02848)
Output: 0x4A3B... (17 bits)

Entropy: 1.846 bits/symbol
Arithmetic coding: 1.848 bits/symbol (optimal!)
Huffman coding: 1.9 bits/symbol

The Core Question You’re Answering

“How can you code data at fractional bits per symbol without floating-point error?”

Concepts You Must Understand First

  • Range subdivision and cumulative probabilities
  • Fixed-point arithmetic for stability
  • Renormalization and underflow handling
  • Model updates (static vs adaptive)

Questions to Guide Your Design

  1. How many bits of precision will your range use?
  2. Will your model be static or adaptive?
  3. How will you terminate decoding reliably?
  4. How will you handle low-probability symbols?

Thinking Exercise

Encode the symbol sequence “ABAC” with probabilities A=0.5, B=0.25, C=0.25 using interval arithmetic.

The Interview Questions They’ll Ask

  1. Why does arithmetic coding beat Huffman for skewed data?
  2. What is renormalization and why is it needed?
  3. How do encoder and decoder stay synchronized?
  4. What is the difference between arithmetic coding and range coding?

Hints in Layers

Hint 1: Start With a Static Model Use a fixed histogram and verify decode.

Hint 2: Implement Renormalization Output top bits when high and low converge.

Hint 3: Add Underflow Handling Track pending bits for the mid-range case.

Hint 4: Add Adaptive Updates Update counts after each symbol.

Books That Will Help

Topic Book Chapter
Probability basics Concrete Mathematics Ch. 1
Numerical stability Math for Programmers Ch. 7
Bit-level coding Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Decoder diverges after a few symbols”

  • Why: Off-by-one in range updates.
  • Fix: Use inclusive/exclusive bounds consistently.
  • Quick test: Encode a tiny message and step through each update.

Problem: “Underflow causes infinite loop”

  • Why: Pending bits not flushed correctly.
  • Fix: Implement E3 scaling with a counter of pending bits.
  • Quick test: Force the mid-range case with skewed probabilities.

Definition of Done

  • Encoder/decoder round-trip on random inputs
  • Compression beats Huffman on skewed data
  • Adaptive and static modes both work
  • No floating-point usage in core loop

Implementation Hints: Use 32-bit integers for low/high range (e.g., 0x00000000 to 0xFFFFFFFF). For each symbol, subdivide current range proportionally to symbol probability. When high and low share the same top bit, output that bit and shift both left. Underflow: when range straddles 0.5 (low < 0.5 < high), track pending bits and scale around 0.5. For decoding, maintain a “code” value read from input, and determine which symbol’s subrange contains it. The tricky part is matching encoder/decoder range updates exactly.

Learning milestones:

  1. Basic encoding produces correct output → You understand the algorithm
  2. Decoding recovers original → You’ve matched encoder/decoder precisely
  3. Compression beats Huffman on skewed data → You see the advantage
  4. Fixed-point implementation is stable → You’ve handled precision issues

Project 6: ANS (Asymmetric Numeral Systems) Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Modern Entropy Coding
  • Software or Tool: ANS Compressor
  • Main Book: “Asymmetric Numeral Systems” by Jarek Duda (arXiv paper)

What you’ll build: A modern entropy coder that combines arithmetic coding’s compression ratio with Huffman’s speed. ANS is used in Facebook’s zstd, Apple’s LZFSE, and JPEG XL.

Why it teaches compression fundamentals: ANS represents the cutting edge of entropy coding. By building it, you’ll understand why zstd replaced gzip in performance, and why the compression community moved beyond arithmetic coding.

Core challenges you’ll face:

  • State machine encoding (single integer instead of interval) → maps to simplified state
  • Table construction (spreading symbols across state space) → maps to precomputation
  • Streaming (outputting bits as state grows) → maps to bit stream management
  • Reverse order (ANS encodes backward, decodes forward) → maps to stack semantics
  • FSE variant (tabled ANS for speed) → maps to lookup table optimization

Key Concepts:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Arithmetic coding understanding, probability, state machines

Real world outcome:

$ ./ans compress book.txt book.ans
Original: 500,000 bytes
Compressed: 284,500 bytes (same as arithmetic coding!)
Speed: 450 MB/s (10x faster than arithmetic coding!)

$ ./ans benchmark book.txt
Algorithm      Ratio    Encode    Decode
Huffman        56.8%    320MB/s   380MB/s
Arithmetic     56.9%    45MB/s    48MB/s
ANS (rANS)     56.9%    450MB/s   480MB/s
ANS (tANS)     56.9%    520MB/s   600MB/s

$ ./ans decompress book.ans restored.txt
Files match: YES

The Core Question You’re Answering

“How can you achieve arithmetic-coding ratios with Huffman-like speed?”

Concepts You Must Understand First

  • ANS state transitions and renormalization
  • Frequency table construction
  • rANS vs tANS differences
  • Reverse-order encoding constraints

Questions to Guide Your Design

  1. How large is your state space and why?
  2. How will you normalize symbol frequencies?
  3. How will you encode in reverse order safely?
  4. Which variant (rANS/tANS) is your target?

Thinking Exercise

Build a tiny ANS table with symbols {A:3, B:1} and simulate two encode steps.

The Interview Questions They’ll Ask

  1. Why does ANS decode forward but encode backward?
  2. What makes tANS faster than rANS?
  3. How does ANS compare to arithmetic coding?
  4. What is renormalization in ANS?

Hints in Layers

Hint 1: Implement rANS First It is simpler and avoids large tables.

Hint 2: Add Renormalization Emit low bits when state grows too large.

Hint 3: Build tANS Tables Precompute state transitions.

Hint 4: Benchmark vs Huffman Compare both speed and ratio.

Books That Will Help

Topic Book Chapter
Probability Concrete Mathematics Ch. 1
Bit-level encoding Computer Systems: A Programmer’s Perspective Ch. 2
Algorithms Algorithms, 4th Edition Ch. 1

Common Pitfalls & Debugging

Problem: “Decoder outputs reversed symbols”

  • Why: Encoding order not reversed.
  • Fix: Buffer symbols or encode backwards.
  • Quick test: Encode a short known sequence and verify decode order.

Problem: “State overflows”

  • Why: Renormalization thresholds too high.
  • Fix: Choose a maximum state and renormalize earlier.
  • Quick test: Log state range during long encoding runs.

Definition of Done

  • rANS encoder/decoder correct
  • tANS table builds and decodes correctly
  • Ratio matches arithmetic coding on test data
  • Encode/decode speeds exceed Huffman

Implementation Hints: For rANS (range ANS): state is a large integer. To encode symbol s with probability p: state' = (state / freq[s]) * total + cumfreq[s] + (state % freq[s]). Renormalize when state exceeds threshold by outputting low bits. Decoding is the inverse. For tANS: precompute a table of state transitions. For each (state, symbol) pair, compute next state and output bits. This avoids division during encode/decode. The tricky insight: encoding must happen in reverse order of intended decode order, so buffer symbols or encode backward.

Learning milestones:

  1. rANS encodes/decodes correctly → You understand the core algorithm
  2. tANS table construction works → You understand the optimization
  3. Speed matches theoretical → You’ve implemented efficiently
  4. Compression ratio matches arithmetic coding → You’ve achieved optimality

Project 7: Burrows-Wheeler Transform (BWT) Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Transform Coding
  • Software or Tool: BWT Compressor
  • Main Book: “Managing Gigabytes” by Witten, Moffat, Bell (Chapter 4)

What you’ll build: A compressor using the Burrows-Wheeler Transform: rearrange input to cluster similar characters, then compress with Move-to-Front + entropy coding. This is how bzip2 works.

Why it teaches compression fundamentals: BWT shows that compression can be improved by transforming data before coding. Unlike LZ77 (looks for exact matches), BWT exploits the statistical structure of text—letters that follow similar contexts cluster together.

Core challenges you’ll face:

  • Suffix sorting (sorting all rotations efficiently) → maps to string algorithms
  • Inverse transform (recovering original from BWT) → maps to reversible transforms
  • Move-to-Front (converting to small integers) → maps to locality exploitation
  • Block sizing (larger blocks = better compression) → maps to resource trade-offs
  • Zero-run coding (RLE for the many zeros after MTF) → maps to secondary coding

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sorting algorithms, understanding of suffix arrays

Real world outcome:

$ ./bwt compress enwik8 enwik8.bwt
Original: 100,000,000 bytes
After BWT + MTF: many zeros and small values
After Huffman: 29,008,758 bytes (71% reduction)
Compression time: 45 seconds

$ ./bwt --analyze enwik8
BWT clustering effectiveness:
  Before BWT: entropy 4.56 bits/byte
  After BWT: entropy 3.21 bits/byte
  After MTF: entropy 2.32 bits/byte

$ bunzip2 -c enwik8.bz2 | diff - restored
# Compare with real bzip2 output - similar ratio!

The Core Question You’re Answering

“How can a reversible transform make data statistically easier to compress?”

Concepts You Must Understand First

  • Suffix arrays or suffix sorting
  • LF mapping for inverse BWT
  • Move-to-Front and run-length coding
  • Block size trade-offs

Questions to Guide Your Design

  1. Which suffix array algorithm will you use?
  2. How will you store the primary index for inverse transform?
  3. How large should blocks be for good compression?
  4. How will you handle non-text binary data?

Thinking Exercise

Compute the BWT of “banana$” and identify the primary index.

The Interview Questions They’ll Ask

  1. Why does BWT improve compression for text?
  2. What is LF mapping and why does it work?
  3. How does block size affect memory and ratio?
  4. Why combine BWT with MTF and RLE?

Hints in Layers

Hint 1: Implement BWT on Small Strings Use naive rotation sorting to validate.

Hint 2: Add Inverse BWT Use LF mapping to reconstruct.

Hint 3: Add MTF + RLE Most output becomes zeros after MTF.

Hint 4: Replace Naive Sorting Switch to suffix arrays for speed.

Books That Will Help

Topic Book Chapter
String algorithms Algorithms in C Part 5
Sorting Algorithms, 4th Edition Ch. 2
Recursion patterns The Recursive Book of Recursion Ch. 8

Common Pitfalls & Debugging

Problem: “Inverse transform returns garbage”

  • Why: Primary index not stored correctly.
  • Fix: Save the row where the original string appears.
  • Quick test: BWT + inverse on short strings.

Problem: “Compression is too slow”

  • Why: O(n^2) rotation sorting.
  • Fix: Use suffix arrays or suffix sorting library.
  • Quick test: Compare runtime on 1MB vs 10MB.

Definition of Done

  • BWT/inverse round-trip works
  • MTF+RLE reduces entropy
  • Suffix array implementation is efficient
  • Output ratio comparable to bzip2 on text

Implementation Hints: Naive BWT sorts all rotations—O(n² log n). Use suffix arrays for O(n log n) or better. For each position i, the BWT output is input[(suffix_array[i] - 1) mod n]. Store the position of the original string (where suffix_array[i] == 0) for decoding. For inverse BWT: build the “LF mapping” (first column to last column correspondence). Start at the original position and follow the mapping n times. MTF: maintain a list of recent characters, output the index, move that character to front. Combine with run-length coding for zeros and Huffman.

Learning milestones:

  1. BWT transforms and inverts correctly → You understand the algorithm
  2. Suffix array construction is efficient → You’ve implemented practical sorting
  3. MTF reduces entropy significantly → You understand the clustering effect
  4. Full pipeline matches bzip2 ratio → You’ve built a real compressor

Project 8: LZ4/Snappy-style Fast Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: High-Speed Compression
  • Software or Tool: Fast LZ Compressor
  • Main Book: “LZ4 Explained” by Yann Collet (Blog Series)

What you’ll build: A compressor optimized for speed over ratio. LZ4 compresses at 500+ MB/s while still achieving 2:1 compression on typical data. Used in Linux kernel, databases, games.

Why it teaches compression fundamentals: Speed matters. Sometimes you need compression that’s faster than disk I/O. LZ4 shows how to strip LZ77 to its essentials: simple hash lookup, minimal format overhead, branchless decode loops.

Core challenges you’ll face:

  • Single-probe hash table (one lookup, no chains) → maps to speed vs ratio trade-off
  • Minimal format (no Huffman, just variable-length integers) → maps to format simplicity
  • Cache-friendly access (everything fits in L1/L2) → maps to memory hierarchy
  • Branchless decoding (predictable loops) → maps to CPU pipeline optimization
  • Block independence (parallel encode/decode) → maps to parallelization

Key Concepts:

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: LZ77 understanding, performance optimization, low-level C

Real world outcome:

$ ./fastlz compress linux-kernel.tar kernel.lz4
Original: 850,000,000 bytes
Compressed: 340,000,000 bytes (60% reduction)
Speed: 650 MB/s (limited by memory bandwidth!)

$ ./fastlz decompress kernel.lz4 restored.tar
Speed: 2,800 MB/s (faster than SSD read!)

$ ./fastlz benchmark corpus.tar
Compressor    Ratio    Compress    Decompress
gzip          35%      45 MB/s     250 MB/s
LZ4           60%      650 MB/s    2,800 MB/s
LZ4-HC        45%      35 MB/s     2,800 MB/s
Snappy        58%      450 MB/s    1,200 MB/s

The Core Question You’re Answering

“How do you design a compressor that is faster than disk I/O while still shrinking data meaningfully?”

Concepts You Must Understand First

  • LZ4 token format and variable-length lengths
  • Hash-based match finding without chains
  • Overlapping match copy in decoder
  • Cache-friendly loops and branch prediction

Questions to Guide Your Design

  1. What hash size balances collisions vs cache use?
  2. How will you encode long literal or match lengths?
  3. How will you ensure decoding is branch-light?
  4. How will you validate block independence?

Thinking Exercise

Encode the string “ABABABAB” using the LZ4 token format. What literals and matches appear?

The Interview Questions They’ll Ask

  1. Why does LZ4 skip entropy coding?
  2. How does LZ4 remain so fast?
  3. What are the trade-offs vs DEFLATE?
  4. How do you design branchless decoders?

Hints in Layers

Hint 1: Implement the Token Format Get literal/match lengths correct first.

Hint 2: Add a Simple Hash One lookup, no chains.

Hint 3: Optimize the Decode Loop Use memcpy and tight loops.

Hint 4: Benchmark Constantly Measure speed on large buffers.

Books That Will Help

Topic Book Chapter
Low-level optimization Computer Systems: A Programmer’s Perspective Ch. 5-6
Hash tables Algorithms, 4th Edition Ch. 3.4
Algorithm analysis Algorithms, 4th Edition Ch. 1

Common Pitfalls & Debugging

Problem: “Decoder crashes on long lengths”

  • Why: Length continuation bytes misparsed.
  • Fix: Implement length extension loops per spec.
  • Quick test: Encode a long literal run (>255 bytes).

Problem: “Compression ratio is terrible”

  • Why: Hash collisions and missed matches.
  • Fix: Improve hash and add short match extension.
  • Quick test: Compare against LZ4 on a text file.

Definition of Done

  • Format encodes/decodes correctly
  • Decode speed > 1 GB/s on large buffers
  • Compression ratio within 10% of LZ4 reference
  • Block independence verified

Implementation Hints: Use a hash table with 4KB-64KB entries (power of 2). Hash the next 4 bytes, single-probe lookup—if match, extend; if not, emit literal and update table. Format: literal length (4 bits) + match length (4 bits) in one byte, followed by literals, followed by 2-byte match offset. Lengths ≥15 use continuation bytes. For speed: use memcpy for literals (compilers optimize this), unroll the match copy loop, use pointer arithmetic instead of index math. The decode loop should be <20 instructions.

Learning milestones:

  1. Basic LZ4 format encodes/decodes → You understand the format
  2. Compression speed exceeds 200 MB/s → You’ve optimized encoding
  3. Decompression speed exceeds 1 GB/s → You’ve optimized decoding
  4. Ratio matches reference LZ4 → You’ve matched the algorithm

Project 9: Zstandard-style Compressor

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: State-of-the-Art Compression
  • Software or Tool: Modern Compressor
  • Main Book: “RFC 8878: Zstandard Compression”

What you’ll build: A compressor combining LZ77, Huffman, and FSE (ANS variant) with configurable compression levels. zstd achieves gzip-like ratios at LZ4-like speeds.

Why it teaches compression fundamentals: Zstandard represents the state of the art. It shows how to combine all the techniques: dictionary matching (LZ77), entropy coding (FSE), multiple compression levels, and modern CPU optimization. Understanding zstd means understanding modern compression.

Core challenges you’ll face:

  • Multiple entropy backends (Huffman for literals, FSE for sequences) → maps to hybrid coding
  • Sequence coding (literal length, match length, offset as tuples) → maps to structured encoding
  • Repeat offsets (caching recent match distances) → maps to statistical optimization
  • Compression levels (trading speed for ratio) → maps to configurable algorithms
  • Dictionary compression (training on sample data) → maps to prior knowledge

Key Concepts:

Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects, deep understanding of entropy coding

Real world outcome:

$ ./myzstd -1 compress linux-kernel.tar kernel.zst
Original: 850,000,000 bytes
Compressed: 150,000,000 bytes (82% reduction)
Speed: 400 MB/s

$ ./myzstd -19 compress linux-kernel.tar kernel-max.zst
Compressed: 95,000,000 bytes (89% reduction)
Speed: 5 MB/s (but still decompresses at 800 MB/s!)

$ ./myzstd --train-dict samples/ -o custom.dict
Trained dictionary on 1000 samples
Dictionary size: 112 KB

$ ./myzstd -D custom.dict compress small_files/*.json output.zst
Compression ratio improved by 40% with dictionary!

The Core Question You’re Answering

“How do you combine LZ77, entropy coding, and dictionaries into a configurable, state-of-the-art compressor?”

Concepts You Must Understand First

  • Sequence coding (literal length, match length, offset)
  • Huffman vs FSE for different symbol streams
  • Repeat offsets and recent match caching
  • Dictionary training and usage

Questions to Guide Your Design

  1. Which parts of the stream use Huffman vs FSE?
  2. How will you implement compression levels cleanly?
  3. How do you integrate dictionaries into the LZ window?
  4. How will you validate ratio vs speed trade-offs?

Thinking Exercise

Given a sequence list of literals and matches, manually encode one tuple using length and offset codes.

The Interview Questions They’ll Ask

  1. Why does zstd outperform gzip on both ratio and speed?
  2. What are repeat offsets and why do they help?
  3. How does dictionary compression work for small files?
  4. Why are multiple compression levels useful?

Hints in Layers

Hint 1: Implement a Minimal Level 1 Mode Start with fast LZ matching + Huffman.

Hint 2: Add FSE for Sequences Use FSE for match lengths and offsets.

Hint 3: Add Repeat Offsets Track the last three offsets.

Hint 4: Add Dictionary Training Preload the hash table with dictionary data.

Books That Will Help

Topic Book Chapter
Huffman coding Algorithms, 4th Edition Ch. 5
Hash tables Algorithms, 4th Edition Ch. 3.4
Bit-level formats Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “High compression levels are too slow”

  • Why: Optimal parsing is expensive.
  • Fix: Limit search depth or use heuristics.
  • Quick test: Benchmark each level on the same corpus.

Problem: “Dictionary makes files larger”

  • Why: Dictionary not representative of data.
  • Fix: Train on a matching corpus and cap size.
  • Quick test: Compare ratio with/without dictionary on small files.

Definition of Done

  • Basic compression/decompression works
  • Level 1 vs level 10 show clear trade-offs
  • Dictionary improves small-file ratio
  • Ratio matches or exceeds gzip at similar speed

Implementation Hints: zstd encodes “sequences”: (literal_length, match_length, offset) tuples. Use FSE for the match/literal length codes, and Huffman or FSE for literals. Repeat offsets: track last 3 match offsets; if new offset matches one, use special codes. For compression levels: level 1 uses fast hash matching (like LZ4), high levels use optimal parsing with lazy evaluation. Dictionary: prepopulate the hash table and LZ77 window with trained data. Reference the actual zstd source—it’s well-commented.

Learning milestones:

  1. Basic compression works → You’ve integrated the components
  2. Ratio matches gzip at 5x speed → You’ve achieved the zstd value proposition
  3. Multiple levels work → You’ve implemented configurable compression
  4. Dictionary training improves small files → You understand the full system

Project 10: JPEG Encoder from Scratch

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Python
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Image Compression / DCT
  • Software or Tool: JPEG Encoder
  • Main Book: “JPEG Still Image Data Compression Standard” by Pennebaker & Mitchell

What you’ll build: A JPEG encoder that converts raw RGB images to JPEG files, implementing color space conversion, DCT, quantization, and entropy coding.

Why it teaches compression fundamentals: JPEG is the canonical lossy compressor. It demonstrates transform coding (DCT), quantization (controlled quality loss), and exploiting human perception (we don’t see high frequencies well). These same principles appear in video (H.264), audio (MP3), and modern image formats (AVIF).

Core challenges you’ll face:

  • Color space conversion (RGB → YCbCr, chroma subsampling) → maps to perceptual coding
  • DCT computation (8x8 blocks, frequency domain) → maps to transform coding
  • Quantization tables (quality factor, DC/AC treatment) → maps to lossy trade-offs
  • Zig-zag ordering (low to high frequency) → maps to run-length optimization
  • Huffman coding (DC deltas, AC run-length pairs) → maps to entropy coding

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Linear algebra (matrices), bit manipulation, Huffman coding

Real world outcome:

$ ./myjpeg encode photo.ppm photo.jpg --quality 85
Original: 12,000,000 bytes (4000x3000 RGB)
Compressed: 1,234,567 bytes (90% reduction)
PSNR: 38.5 dB (excellent quality)

$ ./myjpeg --visualize-dct photo.ppm
Block (10, 20):
  DC: 1024 (average brightness)
  AC[0,1]: 156 (horizontal edge)
  AC[1,0]: -89 (vertical edge)
  AC[7,7]: 2 (high frequency, will be quantized to 0)

$ diff <(djpeg photo.jpg) <(djpeg reference.jpg)
# Visually identical to libjpeg output!

$ ./myjpeg encode photo.ppm photo-q10.jpg --quality 10
Compressed: 123,456 bytes
PSNR: 28.1 dB (visible artifacts, but recognizable)

The Core Question You’re Answering

“How does JPEG turn pixels into frequency coefficients, then discard detail humans barely notice?”

Concepts You Must Understand First

  • RGB → YCbCr conversion and chroma subsampling
  • 8x8 DCT and quantization tables
  • Zig-zag ordering and run-length encoding
  • Huffman coding of DC/AC coefficients

Questions to Guide Your Design

  1. How will you implement a fast DCT?
  2. What quality factor mapping will you use?
  3. How will you handle edge padding for non-multiple-of-8 images?
  4. Will you use standard Huffman tables or custom ones?

Thinking Exercise

Take a simple 8x8 block with constant value. What do the DCT coefficients look like?

The Interview Questions They’ll Ask

  1. Why does JPEG use YCbCr instead of RGB?
  2. What is the role of quantization in JPEG?
  3. Why are coefficients zig-zag ordered?
  4. Why does JPEG introduce blocking artifacts?

Hints in Layers

Hint 1: Build the Pipeline Step by Step RGB → YCbCr → DCT → Quantize → Zigzag → RLE.

Hint 2: Use Standard Tables First Start with baseline quantization tables.

Hint 3: Validate With a Tiny Image Encode an 8x8 test image and inspect output.

Hint 4: Add Huffman Coding Only after the coefficient pipeline is correct.

Books That Will Help

Topic Book Chapter
Transforms Computer Graphics from Scratch Ch. 2
Bit streams Computer Systems: A Programmer’s Perspective Ch. 2
Greedy coding Algorithms, 4th Edition Ch. 5

Common Pitfalls & Debugging

Problem: “Colors look wrong”

  • Why: Incorrect YCbCr conversion or subsampling offsets.
  • Fix: Validate conversion with known formulas.
  • Quick test: Encode a single-color image and check output.

Problem: “Decoder can’t read file”

  • Why: Missing or malformed JPEG markers.
  • Fix: Follow the JPEG file structure strictly.
  • Quick test: Open output in a standard viewer.

Definition of Done

  • Encoded JPEG opens in standard viewers
  • Quality factor changes file size predictably
  • DCT/quantization pipeline validated on test blocks
  • Compression ratio improves as expected

Implementation Hints: RGB to YCbCr: Y = 0.299R + 0.587G + 0.114B, Cb/Cr similar. Subsample Cb/Cr 2:1 horizontally and vertically (4:2:0). For DCT: use the standard 8x8 transform matrix (or fast algorithm). Quantization: divide DCT coefficients by quantization table, round. Quality factor scales the table. Zig-zag: reorder 64 coefficients from low to high frequency. Encode DC as delta from previous block (Huffman). Encode AC as (run of zeros, value) pairs (Huffman with special tables). Write JFIF markers and headers.

Learning milestones:

  1. DCT/IDCT produces correct output → You understand the transform
  2. Quantization shows quality trade-off → You understand lossy compression
  3. Output opens in image viewers → You’ve got the format right
  4. Quality matches libjpeg → You’ve built a real encoder

Project 11: PNG Encoder from Scratch

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Lossless Image Compression
  • Software or Tool: PNG Encoder
  • Main Book: “PNG: The Definitive Guide” by Greg Roelofs

What you’ll build: A PNG encoder implementing filtering, DEFLATE compression, and the PNG file format. Unlike JPEG, PNG is lossless—perfect for screenshots, diagrams, and images with text.

Why it teaches compression fundamentals: PNG shows how general-purpose compression (DEFLATE) can be optimized for images using preprocessing (filtering). The filter step exploits image-specific redundancy: pixels near each other are usually similar.

Core challenges you’ll face:

  • PNG filtering (None, Sub, Up, Average, Paeth) → maps to prediction-based coding
  • Filter selection (per-row optimal filter) → maps to adaptive preprocessing
  • Interlacing (Adam7 progressive display) → maps to incremental rendering
  • CRC checksums (per-chunk validation) → maps to data integrity
  • Color types (grayscale, RGB, palette, alpha) → maps to format flexibility

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: DEFLATE implementation, image basics

Real world outcome:

$ ./mypng encode screenshot.ppm screenshot.png
Original: 8,000,000 bytes (2000x1000 RGB)
Compressed: 234,567 bytes (97% reduction)
Filter stats: Sub 40%, Up 35%, Average 20%, Paeth 5%

$ ./mypng --analyze screenshot.png
Chunk breakdown:
  IHDR: 13 bytes (header)
  IDAT: 234,500 bytes (compressed data)
  IEND: 0 bytes (end marker)

$ pngcheck screenshot.png
OK: screenshot.png (2000x1000, 24-bit RGB, non-interlaced)

$ ./mypng encode diagram.ppm diagram.png --interlace
Interlaced output for progressive display

The Core Question You’re Answering

“How do you build a lossless image format that supports filters, metadata, and reliable integrity checks?”

Concepts You Must Understand First

  • PNG chunk structure and CRCs
  • Scanline filters (None/Sub/Up/Average/Paeth)
  • DEFLATE integration
  • Color types and bit depths

Questions to Guide Your Design

  1. Which filter selection strategy will you use?
  2. How will you handle different color formats?
  3. How will you structure chunks and CRC calculation?
  4. Can you stream-encode without holding the full image?

Thinking Exercise

Apply the Sub filter to a single scanline [10, 13, 15, 20]. What output do you get?

The Interview Questions They’ll Ask

  1. Why does PNG use filters before DEFLATE?
  2. What is the Paeth predictor?
  3. How does PNG ensure file integrity?
  4. Why is PNG lossless but often larger than JPEG?

Hints in Layers

Hint 1: Write PNG Chunks First Build IHDR, IDAT, IEND with CRC.

Hint 2: Add Filters Implement all five and choose the best per line.

Hint 3: Plug in DEFLATE Reuse your Project 4 encoder if possible.

Hint 4: Validate With a Viewer Open output in standard image tools.

Books That Will Help

Topic Book Chapter
Bit-level formats Computer Systems: A Programmer’s Perspective Ch. 2
Compression basics Algorithms, 4th Edition Ch. 5
Data integrity Computer Networks Ch. 3

Common Pitfalls & Debugging

Problem: “PNG viewer rejects file”

  • Why: CRC mismatch or chunk length error.
  • Fix: Verify CRC calculation and big-endian lengths.
  • Quick test: Recalculate CRC with a known library.

Problem: “Filters make file bigger”

  • Why: Poor filter selection.
  • Fix: Choose filter with smallest absolute sum.
  • Quick test: Compare filter outputs on a gradient.

Definition of Done

  • PNG opens in standard viewers
  • Filters implemented and chosen per scanline
  • CRCs validate correctly
  • Output is lossless vs source

Implementation Hints: For each row, compute all 5 filter outputs: None (raw), Sub (pixel - left), Up (pixel - above), Average ((left + above) / 2), Paeth (nonlinear predictor). Choose filter that minimizes sum of absolute differences (heuristic). Prepend filter byte to each row. DEFLATE-compress the result. Write PNG chunks: IHDR (width, height, bit depth, color type), IDAT (compressed data), IEND (marker). Compute CRC32 for each chunk. For optimization, try different DEFLATE compression levels and filter strategies.

Learning milestones:

  1. Basic PNG encodes/decodes → You understand the format
  2. Filters improve compression ratio → You see prediction value
  3. Output validates with pngcheck → You’ve got the format right
  4. Ratio matches libpng → You’ve optimized filter selection

Project 12: MP3 Encoder (Simplified)

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 5: Master
  • Knowledge Area: Audio Compression / Psychoacoustics
  • Software or Tool: Audio Encoder
  • Main Book: “Introduction to Digital Audio Coding and Standards” by Bosi & Goldberg

What you’ll build: A simplified MP3-like encoder implementing MDCT, psychoacoustic modeling, quantization, and Huffman coding. You’ll hear the trade-off between quality and file size.

Why it teaches compression fundamentals: Audio compression exploits human hearing limitations: we can’t hear quiet sounds masked by loud ones, and we’re less sensitive to certain frequencies. MP3 pioneered this “perceptual coding” approach now used everywhere.

Core challenges you’ll face:

  • MDCT transform (time to frequency domain) → maps to transform coding
  • Psychoacoustic model (masking thresholds) → maps to perceptual coding
  • Bit allocation (which frequencies need more bits) → maps to resource allocation
  • Huffman tables (MP3’s predefined 32 tables) → maps to standardized coding
  • Frame structure (headers, side info, main data) → maps to format design

Key Concepts:

Difficulty: Master Time estimate: 1-2 months Prerequisites: Signal processing basics, DCT understanding, Huffman coding

Real world outcome:

$ ./mymp3 encode song.wav song.mp3 --bitrate 128
Original: 45,000,000 bytes (5 minutes, 44.1kHz stereo)
Compressed: 4,800,000 bytes (128 kbps)
Compression ratio: 9.4:1

$ ./mymp3 --analyze song.wav
Frame 1000:
  Masking threshold at 1kHz: -45 dB
  Signal at 1kHz: -20 dB (25 dB above threshold, needs bits)
  Signal at 15kHz: -48 dB (below threshold, can remove!)

$ mpg123 song.mp3  # Plays in standard MP3 player!

$ ./mymp3 encode song.wav song-32.mp3 --bitrate 32
Audible artifacts (expected at low bitrate)

The Core Question You’re Answering

“How can you compress audio by removing what humans cannot hear?”

Concepts You Must Understand First

  • Psychoacoustic models and masking
  • MDCT transforms and band splitting
  • Quantization and bit allocation
  • Bit reservoir and frame structure

Questions to Guide Your Design

  1. How will you approximate masking without a full psychoacoustic model?
  2. Which sample rates and bitrates will you support?
  3. How will you encode side information and main data?
  4. How will you validate quality vs size?

Thinking Exercise

Take a 1 kHz sine wave and reason about which frequency bins should dominate after MDCT.

The Interview Questions They’ll Ask

  1. Why does MP3 use a psychoacoustic model?
  2. What is the purpose of the bit reservoir?
  3. How does MDCT differ from DCT?
  4. Why do low bitrates sound “swishy”?

Hints in Layers

Hint 1: Start With a Simplified Model Use fixed band thresholds to decide quantization.

Hint 2: Implement MDCT Correctly Verify transform with known signals.

Hint 3: Add Bit Allocation Allocate more bits to loud bands.

Hint 4: Build a Decoder Round-trip is the only real validation.

Books That Will Help

Topic Book Chapter
Transforms Computer Graphics from Scratch Ch. 2
Signal basics Math for Programmers Ch. 8
Bit streams Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Audio sounds metallic”

  • Why: Quantization too aggressive in key bands.
  • Fix: Increase bits for low-frequency bands.
  • Quick test: Listen to pure tones at multiple bitrates.

Problem: “Decoder output clips”

  • Why: Incorrect scaling after inverse transform.
  • Fix: Normalize output and verify against input amplitude.
  • Quick test: Encode a sine wave and check peak values.

Definition of Done

  • Encoded audio decodes correctly
  • Quality improves when bitrate increases
  • Frames and side info parsed correctly
  • Output size aligns with bitrate target

Implementation Hints: Use 1152 samples per frame (576 per granule). Apply polyphase filter bank (32 subbands) then MDCT (18 coefficients per subband = 576 total). Psychoacoustic model: compute FFT, find masking tones, calculate masking thresholds. Quantize coefficients: lower precision where masked. Select Huffman table per block that minimizes bits. Write frame header (sync, bitrate, sample rate), side info (block types, scale factors), and main data (Huffman-coded coefficients). Start with existing psychoacoustic model code—the math is complex.

Learning milestones:

  1. MDCT transforms correctly → You understand the frequency domain
  2. Output plays in MP3 players → You’ve got the format right
  3. Quality is acceptable at 128kbps → Your quantization works
  4. Low bitrate sounds worse (as expected) → You understand the trade-offs

Project 13: Video Codec Fundamentals (H.264-lite)

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Video Compression
  • Software or Tool: Video Encoder
  • Main Book: “H.264 and MPEG-4 Video Compression” by Iain Richardson

What you’ll build: A simplified video encoder implementing I-frames (like JPEG), P-frames (motion-compensated prediction), and basic entropy coding. You’ll see how video achieves 100:1 compression.

Why it teaches compression fundamentals: Video compression extends image compression with temporal prediction: frames are similar to previous frames. This “motion compensation” is why video files are so small despite having 30+ images per second.

Core challenges you’ll face:

  • Motion estimation (finding where blocks moved) → maps to search algorithms
  • Motion compensation (predicting from previous frame) → maps to prediction-based coding
  • Residual coding (encoding the difference) → maps to transform coding (DCT)
  • Frame types (I/P/B frames) → maps to temporal dependencies
  • Rate control (hitting target bitrate) → maps to resource allocation

Key Concepts:

Difficulty: Master Time estimate: 2-3 months Prerequisites: JPEG encoder, motion vectors concept, significant time investment

Real world outcome:

$ ./myvideo encode clip.y4m clip.264 --bitrate 2000
Original: 500,000,000 bytes (30 sec, 1080p, 30fps)
Compressed: 7,500,000 bytes (2 Mbps)
Compression ratio: 67:1

$ ./myvideo --analyze clip.264
Frame types: I P P P P I P P P...
GOP size: 5
Average motion vector: (2.3, 1.1) pixels
Residual bits: 40% of total (motion saves 60%!)

$ ffplay clip.264  # Plays in ffmpeg!

$ ./myvideo encode clip.y4m clip-lowrate.264 --bitrate 500
Lower quality but still watchable

The Core Question You’re Answering

“How do modern video codecs exploit spatial and temporal redundancy?”

Concepts You Must Understand First

  • Intra prediction within frames
  • Inter prediction (motion estimation/compensation)
  • Macroblocks and block transforms
  • Rate-control and GOP structure

Questions to Guide Your Design

  1. How will you simplify motion estimation to keep it tractable?
  2. What block sizes will you support?
  3. How will you encode motion vectors and residuals?
  4. How will you structure keyframes (I/P frames)?

Thinking Exercise

Given two frames where an object shifts right by 2 pixels, what motion vector would you encode?

The Interview Questions They’ll Ask

  1. Why do video codecs use keyframes?
  2. What is motion compensation and why does it work?
  3. Why are smaller blocks better for motion but slower to encode?
  4. What are typical trade-offs in rate control?

Hints in Layers

Hint 1: Start With Intra-Only Encode each frame as a JPEG-like image.

Hint 2: Add Motion Vectors Track block matches between frames.

Hint 3: Add Residual Encoding Encode only the difference after prediction.

Hint 4: Add GOP Structure Insert I-frames periodically.

Books That Will Help

Topic Book Chapter
Transforms Computer Graphics from Scratch Ch. 2
Data structures Algorithms in C Part 3
Bit-level formats Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Motion vectors explode”

  • Why: Search range too large or bad cost function.
  • Fix: Limit range and use SAD/SSD cost.
  • Quick test: Encode a simple translation sequence.

Problem: “Video drifts over time”

  • Why: Prediction error accumulates without keyframes.
  • Fix: Insert I-frames at fixed intervals.
  • Quick test: Compare frame 0 vs frame 100 for drift.

Definition of Done

  • Intra-only mode produces valid frames
  • Motion estimation reduces residual size
  • GOP structure implemented
  • Bitrate scales with configuration

Implementation Hints: Start with I-frames only (just JPEG). Add P-frames: divide into 16x16 macroblocks, search previous frame for best match (minimize sum of absolute differences), encode motion vector and residual. For motion search: start with full search in ±16 pixels, then optimize with diamond or hexagonal search. Residual: DCT, quantize, entropy code (CAVLC is simpler than CABAC). For B-frames (advanced): search both previous and next frames. Use x264 as reference—it’s well-documented.

Learning milestones:

  1. I-frames encode correctly → You’ve built on JPEG
  2. P-frames with motion compensation work → You understand video compression
  3. Output plays in video players → You’ve got the format right
  4. Quality is reasonable at target bitrate → Your rate control works

Project 14: Delta Encoding & Diff Algorithm

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Differential Compression
  • Software or Tool: Delta Compressor
  • Main Book: “Managing Gigabytes” by Witten, Moffat, Bell

What you’ll build: A delta encoder that compresses a file by storing only the differences from a reference file. This is how Git stores changes, rsync syncs files, and game patches work.

Why it teaches compression fundamentals: Delta encoding exploits a different kind of redundancy: similarity between related files. A new version of a file is often 99% identical to the old version—delta encoding transmits only the 1%.

Core challenges you’ll face:

  • Common subsequence finding (longest match detection) → maps to diff algorithms
  • Edit operations (insert, delete, copy) → maps to operation encoding
  • Block hashing (rsync’s rolling hash) → maps to incremental hashing
  • Reference management (base file dependency) → maps to versioning systems
  • Combining with compression (delta + LZ) → maps to layered compression

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Hash tables, basic LZ understanding

Real world outcome:

$ ./delta create old_version.bin new_version.bin patch.delta
Old size: 10,000,000 bytes
New size: 10,050,000 bytes
Delta size: 125,000 bytes (1.2% of new file!)

$ ./delta apply old_version.bin patch.delta restored.bin
Applied 47 copy operations, 12 insert operations
Restored file matches new_version.bin: YES

$ ./delta analyze old_version.bin new_version.bin
Common blocks: 98.7%
Insertions: 45KB at offsets 1234, 56789, ...
Deletions: 5KB at offsets 9999, ...

# Works for game patches:
$ ./delta create game_v1.0.bin game_v1.1.bin update.patch
Patch size: 50MB (vs 5GB full download!)

The Core Question You’re Answering

“How can you encode differences between sequences so patches are small and reversible?”

Concepts You Must Understand First

  • Delta encoding and rolling hashes
  • Longest common subsequence (LCS)
  • Patch formats and metadata
  • Streaming diff generation

Questions to Guide Your Design

  1. Will you use LCS, rolling hash, or both?
  2. How will you encode insertions/deletions efficiently?
  3. How will you ensure patch application is safe?
  4. How will you handle binary data vs text?

Thinking Exercise

Compute a diff between “banana” and “bandana” and describe the operations.

The Interview Questions They’ll Ask

  1. Why is rolling hash useful for diff?
  2. What is the trade-off between LCS accuracy and speed?
  3. How do binary diffs differ from text diffs?
  4. How do you ensure patch integrity?

Hints in Layers

Hint 1: Start With Simple LCS Make it correct before fast.

Hint 2: Add Rolling Hash Use a fixed block size to find matches.

Hint 3: Design a Patch Format Include offsets and lengths explicitly.

Hint 4: Add Validation Checksum before and after patch.

Books That Will Help

Topic Book Chapter
String algorithms Algorithms in C Part 5
Hashing Algorithms, 4th Edition Ch. 3.4
Data integrity Computer Networks Ch. 3

Common Pitfalls & Debugging

Problem: “Patch fails on binary files”

  • Why: Treating binary as text.
  • Fix: Use byte-based diff and avoid line endings.
  • Quick test: Diff two small binary blobs.

Problem: “Patch larger than original”

  • Why: Poor match detection.
  • Fix: Increase block size or add rolling hash.
  • Quick test: Diff similar files and compare sizes.

Definition of Done

  • Patch apply reverses the diff exactly
  • Binary and text diffs both supported
  • Rolling hash improves patch size
  • Patch integrity validated with checksum

Implementation Hints: Split the reference file into fixed-size blocks (e.g., 4KB). Compute hash of each block, store in hash table mapping hash → offset. For new file: use rolling hash (rsync’s adler32-like) to find blocks matching reference. Output sequence of: COPY(offset, length) for matches, INSERT(data) for new content. For better results, use variable-length matching like LZ77, but match against reference instead of recent data. VCDIFF is the standard delta format.

Learning milestones:

  1. Basic delta works for similar files → You understand the concept
  2. Rolling hash enables efficient matching → You can handle large files
  3. Patch size is proportional to differences → Your algorithm is correct
  4. Patches work for binary files (not just text) → You handle all data types

Project 15: Context-Based Compression (PPM)

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Statistical Modeling
  • Software or Tool: Context Compressor
  • Main Book: “Text Compression” by Bell, Cleary, Witten

What you’ll build: A Prediction by Partial Matching (PPM) compressor that uses context (previous bytes) to predict the next byte. After “th”, ‘e’ is very likely—PPM exploits this.

Why it teaches compression fundamentals: PPM achieves among the best compression ratios for text. It shows that better statistical modeling = better compression. Modern neural network compressors are essentially sophisticated PPM.

Core challenges you’ll face:

  • Context modeling (storing predictions for each context) → maps to trie data structures
  • Escape mechanism (what if context hasn’t seen this symbol) → maps to smoothing techniques
  • Order blending (combining predictions from different context lengths) → maps to ensemble methods
  • Memory management (tries grow large) → maps to bounded-memory algorithms
  • Arithmetic coding integration (variable probabilities) → maps to entropy coding

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Arithmetic coding, trie structures, probability

Real world outcome:

$ ./ppm compress book.txt book.ppm
Original: 500,000 bytes
Compressed: 145,000 bytes (71% reduction)
Context order: 5

Comparison:
  gzip:    192,000 bytes
  bzip2:   152,000 bytes
  PPM:     145,000 bytes (best!)

$ ./ppm --analyze book.txt
After "the ": 'a' 23%, 'm' 18%, 'b' 12%...
After "tion": ' ' 45%, 's' 32%, 'a' 8%...
Context "the " has 234 different followers

$ ./ppm decompress book.ppm restored.txt
Files match: YES

The Core Question You’re Answering

“How can context models predict the next symbol so well that entropy coding approaches the theoretical limit?”

Concepts You Must Understand First

  • Context depth and escape symbols
  • Probability smoothing (PPM variants)
  • Adaptive model updates
  • Memory usage vs accuracy

Questions to Guide Your Design

  1. What max context length will you allow?
  2. Which escape model (A/B/C) will you implement?
  3. How will you prevent memory blowup?
  4. How will you combine with an entropy coder?

Thinking Exercise

Build a context tree for the string “abracadabra” and predict the probability of the next ‘a’.

The Interview Questions They’ll Ask

  1. Why does PPM often beat LZ on text?
  2. What are escape symbols used for?
  3. How do you limit memory in context models?
  4. Why is PPM slow to encode but strong in ratio?

Hints in Layers

Hint 1: Start With PPM0 Use zero-length context only.

Hint 2: Add Context Depth Increase depth one level at a time.

Hint 3: Implement Escapes Add escape symbol for unseen contexts.

Hint 4: Add Pruning Drop rare contexts to save memory.

Books That Will Help

Topic Book Chapter
Probability models Concrete Mathematics Ch. 2
Trees Algorithms in C Part 5
Bit-level coding Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Model grows too large”

  • Why: No pruning of rare contexts.
  • Fix: Apply frequency thresholds and limit depth.
  • Quick test: Log node count over input size.

Problem: “Decoder diverges”

  • Why: Model updates differ between encode/decode.
  • Fix: Ensure updates happen in identical order.
  • Quick test: Step through the first 20 symbols.

Definition of Done

  • PPM0 and deeper contexts work
  • Escape handling is correct
  • Compression ratio beats Huffman on text
  • Memory usage stays within bounds

Implementation Hints: Build a trie where each node represents a context (sequence of bytes). Each node stores counts for each possible next byte. For encoding: look up current context (last N bytes), get probability distribution, encode with arithmetic coder. If symbol not seen in this context, emit escape symbol and try shorter context. PPM-D uses escape method D (count of distinct symbols). For memory limits, use context hashing or pruning. Start with order 4-5; higher orders need more memory but compress better.

Learning milestones:

  1. Basic PPM compresses/decompresses → You understand context modeling
  2. Escape mechanism handles novel contexts → You’ve solved the zero-probability problem
  3. Compression beats gzip/bzip2 on text → Your model is good
  4. Memory-bounded version works → You can handle large files

Project 16: Neural Network Compressor (Simple)

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++ with LibTorch, Rust
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Learned Compression
  • Software or Tool: Neural Compressor
  • Main Book: “Neural Data Compression” (Various Papers)

What you’ll build: A compressor using a simple neural network (LSTM or Transformer) to predict the next byte, combined with arithmetic coding. This is the frontier of compression research.

Why it teaches compression fundamentals: Neural compressors show that compression = prediction. A perfect predictor gives perfect compression. Modern AI compressors like NNCP and DeepZip achieve state-of-the-art ratios by leveraging deep learning’s pattern recognition.

Core challenges you’ll face:

  • Probability output (network predicts distribution over next byte) → maps to softmax modeling
  • Online learning (model adapts while compressing) → maps to adaptive compression
  • Bit-back coding (efficient latent coding) → maps to VAE compression
  • Model size overhead (model must ship with data) → maps to break-even point
  • Speed (neural networks are slow) → maps to practical trade-offs

Key Concepts:

Difficulty: Master Time estimate: 2-3 months Prerequisites: Deep learning basics, arithmetic coding, PPM understanding

Real world outcome:

$ ./neural_compress train corpus/*.txt --model model.pt
Training neural predictor...
Epoch 100: perplexity 2.34 (entropy 1.23 bits/byte)

$ ./neural_compress compress book.txt book.nc --model model.pt
Original: 500,000 bytes
Compressed: 125,000 bytes + 5,000,000 bytes (model)
Net size: 5,125,000 bytes (worse than gzip!)

$ ./neural_compress compress large_corpus.txt corpus.nc --model model.pt
Original: 1,000,000,000 bytes
Compressed: 98,000,000 bytes + 5,000,000 bytes (model)
Net compression: 9.7% of original (better than everything!)

# For small files, neural compression loses to model overhead
# For large files or pre-shared model, it wins

The Core Question You’re Answering

“How can a learned model compress data by mapping it into a smaller latent space?”

Concepts You Must Understand First

  • Autoencoders and latent representations
  • Quantization and entropy modeling
  • Loss functions for reconstruction vs size
  • Inference-time vs training-time constraints

Questions to Guide Your Design

  1. What input format will your model compress?
  2. How will you quantize the latent space?
  3. How do you transmit the model or assume it exists?
  4. What metric measures quality vs size?

Thinking Exercise

Train a tiny autoencoder on grayscale 8x8 images. Compare reconstruction error vs latent size.

The Interview Questions They’ll Ask

  1. Why is entropy modeling still required after a neural encoder?
  2. How do you balance reconstruction loss and bitrate?
  3. What makes neural compression slower than classical methods?
  4. How do you deploy a model in a real system?

Hints in Layers

Hint 1: Start With a Tiny Dataset Use MNIST or synthetic data.

Hint 2: Add Quantization Round or vector-quantize latent values.

Hint 3: Add Entropy Coding Compress the quantized latent values.

Hint 4: Compare to Baselines Benchmark vs PNG or gzip.

Books That Will Help

Topic Book Chapter
Math foundations Math for Programmers Ch. 5-7
Data representation Computer Systems: A Programmer’s Perspective Ch. 2
Algorithms Algorithms, 4th Edition Ch. 1

Common Pitfalls & Debugging

Problem: “Model outputs blurry reconstructions”

  • Why: Latent space too small or loss too smooth.
  • Fix: Increase latent size or change loss metric.
  • Quick test: Compare PSNR at different latent sizes.

Problem: “Compressed size larger than baseline”

  • Why: No entropy coding or poor quantization.
  • Fix: Add entropy coding for latents.
  • Quick test: Compare to PNG on the same images.

Definition of Done

  • Encoder/decoder round-trip works
  • Latent quantization + entropy coding implemented
  • Compression beats a baseline on a simple dataset
  • Model inference runs within memory limits

Implementation Hints: Use a character-level LSTM or small Transformer (1-2 layers). Input: last N bytes (embedded). Output: softmax over 256 possible next bytes. Train on similar data or train online while compressing. For compression: for each byte, get predicted distribution from network, encode with arithmetic coder, update network with actual byte. Key insight: better prediction = lower entropy = smaller output. Start with PyTorch for prototyping. For practical use, the model overhead (~5MB) only pays off for very large files or when model is pre-shared.

Learning milestones:

  1. Neural network predicts next byte → You’ve built the model
  2. Arithmetic coding uses predictions → You’ve integrated with compression
  3. Large files compress better than classical → Neural compression works!
  4. Online adaptation improves results → You understand adaptive compression

Project 17: Compression Benchmarking Suite

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python + C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Performance Analysis
  • Software or Tool: Benchmark Tool
  • Main Book: “The Art of Computer Systems Performance Analysis” by Raj Jain

What you’ll build: A benchmarking suite that compares compression algorithms across different data types (text, images, binaries), measuring ratio, speed, memory, and CPU usage.

Why it teaches compression fundamentals: Different algorithms excel on different data. Benchmarking reveals the trade-offs: LZ4 for speed, zstd for balance, LZMA for ratio, PPM for text. Understanding when to use each algorithm is as important as understanding how they work.

Core challenges you’ll face:

  • Fair comparison (same settings, warm cache) → maps to experimental methodology
  • Data variety (text, binary, images, random) → maps to workload characterization
  • Metric selection (ratio vs speed vs memory) → maps to multi-objective optimization
  • Statistical significance (variance between runs) → maps to statistical analysis
  • Visualization (Pareto frontiers, charts) → maps to data presentation

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic statistics, scripting, system monitoring

Real world outcome:

$ ./benchmark --corpus silesia/ --algorithms gzip,lz4,zstd,bzip2
Running benchmarks on 12 files, 211 MB total...

Results:
| Algorithm | Ratio | Compress | Decompress | Memory |
|-----------|-------|----------|------------|--------|
| lz4       | 2.1x  | 650 MB/s | 2800 MB/s  | 16 MB  |
| zstd -1   | 2.8x  | 450 MB/s | 1200 MB/s  | 32 MB  |
| zstd -19  | 3.5x  | 3 MB/s   | 1200 MB/s  | 128 MB |
| gzip      | 3.2x  | 45 MB/s  | 280 MB/s   | 256 KB |
| bzip2     | 3.8x  | 15 MB/s  | 55 MB/s    | 7 MB   |

Best for speed: lz4
Best for ratio: bzip2
Best overall: zstd (Pareto optimal)

$ ./benchmark --visualize results.json
[Generates Pareto frontier chart: ratio vs speed]

The Core Question You’re Answering

“How do you compare compressors fairly across ratio, speed, and memory?”

Concepts You Must Understand First

  • Corpus selection and representativeness
  • Throughput measurement and CPU pinning
  • Memory usage tracking
  • Repeatable benchmark harnesses

Questions to Guide Your Design

  1. What datasets represent text, binary, media, and mixed files?
  2. How will you normalize for CPU frequency changes?
  3. What metrics will you log and visualize?
  4. How will you ensure reproducibility?

Thinking Exercise

Design a benchmark table that compares ratio, compression speed, and decompression speed for three algorithms.

The Interview Questions They’ll Ask

  1. Why is decompression speed often more important than compression?
  2. How do you avoid benchmarking bias?
  3. Why should you use multiple corpora?
  4. How do you compare algorithms with different outputs?

Hints in Layers

Hint 1: Start With a Small Corpus Add a few representative files.

Hint 2: Fix CPU Frequency Disable turbo for consistent timing.

Hint 3: Automate Runs Repeat runs and compute averages.

Hint 4: Visualize Results Plot ratio vs speed.

Books That Will Help

Topic Book Chapter
Algorithm analysis Algorithms, 4th Edition Ch. 1
Performance Computer Systems: A Programmer’s Perspective Ch. 6
Experimental method The Pragmatic Programmer Ch. 8

Common Pitfalls & Debugging

Problem: “Results vary wildly”

  • Why: CPU frequency scaling or cache effects.
  • Fix: Pin CPU, warm caches, and repeat runs.
  • Quick test: Run the same benchmark 10 times and compare variance.

Problem: “Algorithm looks worse than expected”

  • Why: Input corpus not representative.
  • Fix: Use diverse corpora and document them.
  • Quick test: Add a text and binary corpus.

Definition of Done

  • Benchmark harness runs end-to-end
  • Ratio, compress, decompress metrics captured
  • Results reproducible across runs
  • Reports are exportable (CSV/JSON)

Implementation Hints: Use standard test corpora (Silesia, Calgary, Canterbury). For each algorithm, run multiple times, discard first run (warm cache), compute mean and variance. Measure wall-clock time, CPU time, peak memory (via /proc or similar). Compute compression ratio, speed (bytes/second), and memory per byte. Plot Pareto frontier: algorithms on the efficient frontier offer best trade-offs. Include random data (incompressible) to verify no algorithm expands it much.

Learning milestones:

  1. Benchmarks run consistently → You understand experimental methodology
  2. Results match published benchmarks → Your methodology is sound
  3. You can recommend algorithms for use cases → You understand trade-offs
  4. Visualizations show clear patterns → You can communicate results

Project 18: Build Your Own Mini-Compression Library

  • File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Library Design
  • Software or Tool: Compression Library
  • Main Book: “The Practice of Programming” by Kernighan & Pike

What you’ll build: A unified compression library with a clean API, supporting multiple algorithms (LZ4, zstd-lite, Huffman), streaming, and configurable compression levels.

Why it teaches compression fundamentals: Building a library forces you to think about API design, memory management, error handling, and performance. You’ll understand why real libraries (zlib, zstd) are designed the way they are.

Core challenges you’ll face:

  • API design (simple yet flexible interface) → maps to library design
  • Streaming (compress without knowing total size) → maps to incremental processing
  • Memory management (caller-provided vs library-managed) → maps to resource ownership
  • Error handling (graceful failure, informative errors) → maps to robust code
  • Thread safety (concurrent compression) → maps to parallel design

Key Concepts:

  • zlib API Design: zlib Manual
  • zstd API: zstd Documentation
  • Library Design: “The Practice of Programming” Chapter 4 - Kernighan & Pike
  • API Design Principles: “The Design of Everyday Things” - Don Norman

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: All previous compression projects, library design experience

Real world outcome:

#include "minicompress.h"

// Simple one-shot API
mc_compress(src, src_len, dst, &dst_len, MC_ALGO_ZSTD, MC_LEVEL_DEFAULT);

// Streaming API
mc_stream_t *s = mc_stream_new(MC_ALGO_LZ4, MC_COMPRESS);
while (has_more_data) {
    mc_stream_input(s, chunk, chunk_len);
    while (mc_stream_output_available(s)) {
        size_t out_len;
        mc_stream_output(s, buffer, sizeof(buffer), &out_len);
        write(fd, buffer, out_len);
    }
}
mc_stream_finish(s);
mc_stream_free(s);

// Algorithm enumeration
for (mc_algo_t a = mc_algo_first(); a != MC_ALGO_END; a = mc_algo_next(a)) {
    printf("%s: ratio %.1fx, speed %d MB/s\n",
           mc_algo_name(a), mc_algo_typical_ratio(a), mc_algo_typical_speed(a));
}

The Core Question You’re Answering

“How do you design a compression API that is usable, extensible, and safe for real applications?”

Concepts You Must Understand First

  • API design (streaming vs buffer-based)
  • File format versioning and headers
  • Plugin architecture for multiple codecs
  • Error handling and integrity checks

Questions to Guide Your Design

  1. What is the minimal API surface for encode/decode?
  2. How will you allow streaming compression?
  3. How will you version and identify formats?
  4. How will you test compatibility over time?

Thinking Exercise

Design a minimal API: compress(input, output, options) and decompress(input, output). What errors can each return?

The Interview Questions They’ll Ask

  1. What makes a compression API easy to integrate?
  2. How do you handle backward compatibility in formats?
  3. How do you prevent corrupted output from crashing clients?
  4. How do you expose tuning knobs without overwhelming users?

Hints in Layers

Hint 1: Start With One Codec Implement RLE end-to-end with your API.

Hint 2: Add Stream Interfaces Chunked input/output for large files.

Hint 3: Add Versioned Headers Magic bytes + version + codec ID.

Hint 4: Add a Plugin Registry Map codec IDs to encode/decode functions.

Books That Will Help

Topic Book Chapter
API design Clean Architecture Ch. 11
Error handling Code Complete Ch. 8
Data formats Computer Systems: A Programmer’s Perspective Ch. 2

Common Pitfalls & Debugging

Problem: “Decoder crashes on corrupt input”

  • Why: No bounds checking or checksum.
  • Fix: Add length checks and optional CRC.
  • Quick test: Flip random bits in a file and ensure graceful errors.

Problem: “API too rigid for new codecs”

  • Why: Format assumptions baked in.
  • Fix: Use a codec registry and versioned headers.
  • Quick test: Add a dummy codec and verify registration.

Definition of Done

  • API supports streaming and buffer modes
  • Format headers include version + codec ID
  • Corrupt inputs fail safely
  • At least three codecs integrated

Implementation Hints: Design a clean C API with opaque types. Provide both one-shot (compress(src, dst)) and streaming (stream_input(), stream_output()) interfaces. Use function pointers for algorithm dispatch. Include bounds checking, input validation, and meaningful error codes. For streaming, implement internal buffering. For thread safety, avoid global state—keep all context in the stream object. Document memory ownership clearly. Provide both static and dynamic linking options.

Learning milestones:

  1. One-shot API works for all algorithms → You understand abstraction
  2. Streaming API handles arbitrary input sizes → You understand buffering
  3. Memory management is correct (no leaks) → You’ve verified with valgrind
  4. Library is usable by others → You’ve achieved good API design

Project Comparison Table

# Project Difficulty Time Depth Fun Factor Type
1 RLE Beginner 1-2 days ★★☆☆☆ ★★☆☆☆ Lossless
2 Huffman Intermediate 1 week ★★★★☆ ★★★☆☆ Lossless
3 LZ77 Advanced 2-3 weeks ★★★★☆ ★★★★☆ Lossless
4 DEFLATE Expert 3-4 weeks ★★★★★ ★★★★☆ Lossless
5 Arithmetic Expert 2-3 weeks ★★★★★ ★★★☆☆ Lossless
6 ANS Expert 2-3 weeks ★★★★★ ★★★★☆ Lossless
7 BWT Advanced 2-3 weeks ★★★★☆ ★★★★☆ Lossless
8 LZ4-style Advanced 1-2 weeks ★★★☆☆ ★★★★★ Lossless
9 Zstandard Master 1-2 months ★★★★★ ★★★★★ Lossless
10 JPEG Expert 3-4 weeks ★★★★★ ★★★★★ Lossy
11 PNG Advanced 2-3 weeks ★★★★☆ ★★★★☆ Lossless
12 MP3 Master 1-2 months ★★★★★ ★★★★★ Lossy
13 Video Codec Master 2-3 months ★★★★★ ★★★★★ Lossy
14 Delta Encoding Intermediate 1-2 weeks ★★★☆☆ ★★★★☆ Lossless
15 PPM Expert 3-4 weeks ★★★★★ ★★★★☆ Lossless
16 Neural Compress Master 2-3 months ★★★★★ ★★★★★ Lossless
17 Benchmarking Intermediate 1 week ★★☆☆☆ ★★★☆☆ Tool
18 Mini Library Expert 2-3 weeks ★★★★☆ ★★★☆☆ Library

Path A: Understanding the Fundamentals (4-6 weeks)

Best if you’re new to compression and want solid foundations:

  1. Project 1: RLE → Understand the simplest case
  2. Project 2: Huffman → Understand entropy coding
  3. Project 3: LZ77 → Understand dictionary compression
  4. Project 4: DEFLATE → See how they combine

Path B: Modern High-Performance Compression (6-8 weeks)

Best if you want to understand current industry standards:

  1. Project 2: Huffman → Foundation of entropy coding
  2. Project 6: ANS → Modern entropy coding
  3. Project 8: LZ4 → Speed-optimized compression
  4. Project 9: Zstandard → State of the art

Path C: Media Compression Deep Dive (8-12 weeks)

Best if you’re interested in images, audio, or video:

  1. Project 2: Huffman → Needed for JPEG
  2. Project 10: JPEG → Image compression fundamentals
  3. Project 12: MP3 → Audio compression
  4. Project 13: Video Codec → Video compression

Path D: Research-Oriented Path (10-12 weeks)

Best if you want to push compression boundaries:

  1. Project 5: Arithmetic Coding → Optimal entropy coding
  2. Project 7: BWT → Transform-based compression
  3. Project 15: PPM → Context modeling
  4. Project 16: Neural Compressor → Frontier of compression

Path E: Complete Mastery (4-6 months)

For those who want to deeply understand all of compression:

  1. Start with Path A (fundamentals)
  2. Add Path B (modern algorithms)
  3. Add Project 10-11 (image compression)
  4. Add Project 15-16 (advanced modeling)
  5. Finish with Project 18 (library design)

Key Resources Summary

Books

  • “Introduction to Data Compression” by Khalid Sayood - The textbook
  • “Data Compression: The Complete Reference” by David Salomon - Comprehensive reference
  • “Managing Gigabytes” by Witten, Moffat, Bell - Text compression focus
  • “JPEG Still Image Data Compression Standard” by Pennebaker & Mitchell - Image compression

Online Resources

Source Code to Study

  • zlib - Reference DEFLATE
  • zstd - Modern, well-documented
  • lz4 - Speed-optimized
  • bzip2 - BWT-based
  • LAME - MP3 encoder
  • x264 - H.264 encoder

RFCs and Standards


Summary

# Project Main Language
1 Run-Length Encoding (RLE) Compressor C
2 Huffman Coding Compressor C
3 LZ77 Sliding Window Compressor C
4 DEFLATE Implementation (LZ77 + Huffman) C
5 Arithmetic Coding Compressor C
6 ANS (Asymmetric Numeral Systems) Compressor C
7 Burrows-Wheeler Transform (BWT) Compressor C
8 LZ4/Snappy-style Fast Compressor C
9 Zstandard-style Compressor C
10 JPEG Encoder from Scratch C
11 PNG Encoder from Scratch C
12 MP3 Encoder (Simplified) C
13 Video Codec Fundamentals (H.264-lite) C
14 Delta Encoding & Diff Algorithm C
15 Context-Based Compression (PPM) C
16 Neural Network Compressor (Simple) Python
17 Compression Benchmarking Suite Python + C
18 Build Your Own Mini-Compression Library C