COMPRESSION ALGORITHMS DEEP DIVE PROJECTS
Compression Algorithms Deep Dive: Project-Based Learning
Understanding How Data Compression Really Works
Compression is everywhere: ZIP files, JPEG images, MP3 audio, Netflix streaming, website loading. Yet most developers treat it as a black box. By building compressors yourself, you’ll understand the mathematical foundations, the engineering trade-offs, and why certain algorithms dominate specific domains.
Core Concept Analysis
To truly understand compression, you need to grasp these fundamental building blocks:
1. Information Theory Foundations
- Entropy: The theoretical minimum bits needed to represent data (Shannon’s limit)
- Redundancy: The gap between actual data size and theoretical minimum
- Symbol probability: How often each byte/character appears
2. Lossless Compression Techniques
- Statistical coding: Huffman, Arithmetic, ANS (encode based on frequency)
- Dictionary coding: LZ77, LZ78, LZW (encode based on repetition)
- Transform coding: BWT, MTF (rearrange data to be more compressible)
3. Lossy Compression Techniques
- Transform domain: DCT, DFT, Wavelets (move to frequency space)
- Quantization: Reduce precision strategically
- Perceptual models: Exploit human sensory limitations
4. The Compression Pipeline
Most real compressors combine multiple techniques:
Input → Transform → Statistical Model → Entropy Coder → Output
(BWT) (Context) (ANS/Huffman)
Project 1: Run-Length Encoding (RLE) Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Python, Rust, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Basic Compression / Encoding
- Software or Tool: Simple Compressor
- Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 2)
What you’ll build: A compressor that replaces repeated consecutive bytes with a count and the byte value. Input “AAABBCCCC” becomes “3A2B4C”.
Why it teaches compression fundamentals: RLE is the simplest compression algorithm that actually works. It teaches the core insight: compression finds and eliminates redundancy. You’ll immediately see when compression helps (repeated data) and when it hurts (random data expands!).
Core challenges you’ll face:
- Encoding format design (how to represent count + byte) → maps to format trade-offs
- Handling non-repeated bytes (1A1B1C is worse than ABC) → maps to expansion problem
- Count overflow (what if 300 A’s in a row?) → maps to boundary conditions
- Binary vs text (works differently on different data) → maps to data type awareness
- Escape sequences (distinguishing counts from data) → maps to encoding ambiguity
Key Concepts:
- Run-Length Encoding: Wikipedia RLE
- Entropy Basics: “Introduction to Data Compression” Chapter 1 - Khalid Sayood
- When RLE Works: Hydrolix RLE Guide
Difficulty: Beginner Time estimate: 1-2 days Prerequisites: Basic C programming, file I/O
Real world outcome:
$ ./rle compress image.bmp image.rle
Original: 1,234,567 bytes
Compressed: 234,567 bytes (81% reduction)
$ ./rle compress random.bin random.rle
Original: 1,000 bytes
Compressed: 1,847 bytes (EXPANSION - random data!)
$ ./rle decompress image.rle restored.bmp
Decompressed successfully. Files match: YES
Implementation Hints: The naive approach “3A2B4C” fails for text that contains digits. Instead, use a binary format: for runs ≥4, output a special marker byte (e.g., 0xFF), then count, then the byte. For runs <4, output bytes literally. Handle the marker byte appearing in data by escaping it (output 0xFF 0x01 0xFF). For count overflow, output multiple run records. Test on BMP images (large uniform regions) vs compressed files (no runs).
Learning milestones:
- Basic RLE works on simple input → You understand the core algorithm
- Binary files compress/decompress correctly → You handle edge cases
- You understand why random data expands → You grasp entropy limits
- BMP images compress well, PNGs don’t → You understand data characteristics
Project 2: Huffman Coding Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Entropy Coding / Trees
- Software or Tool: Entropy Compressor
- Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 3)
What you’ll build: A compressor that assigns shorter bit sequences to more frequent bytes. If ‘e’ appears 1000 times and ‘z’ appears once, ‘e’ might be encoded as “10” (2 bits) and ‘z’ as “1101110” (7 bits).
Why it teaches compression fundamentals: Huffman coding is the foundation of entropy coding. It’s optimal for symbol-by-symbol coding and appears inside DEFLATE (ZIP), JPEG, MP3, and more. Understanding Huffman means understanding why some data compresses better than others.
Core challenges you’ll face:
- Frequency counting (scanning input to build histogram) → maps to statistical modeling
- Tree construction (building optimal prefix-free tree) → maps to greedy algorithms
- Canonical codes (standardizing tree representation) → maps to format efficiency
- Bit manipulation (packing variable-length codes into bytes) → maps to low-level coding
- Header storage (transmitting the tree to decoder) → maps to metadata overhead
Key Concepts:
- Huffman Algorithm: “Algorithms” by Sedgewick & Wayne - Chapter 5.5
- Prefix-Free Codes: “Introduction to Data Compression” Chapter 3 - Khalid Sayood
- Canonical Huffman: Managing Gigabytes
- Information Entropy: “Elements of Information Theory” Chapter 2 - Cover & Thomas
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Binary trees, bit manipulation, file I/O
Real world outcome:
$ ./huffman compress book.txt book.huf
Original: 500,000 bytes
Compressed: 285,000 bytes (43% reduction)
Entropy: 4.56 bits/byte (theoretical limit: 285,000 bytes)
$ ./huffman analyze book.txt
Symbol frequencies:
' ' (space): 98,234 (19.6%) → code: 00 (2 bits)
'e': 62,145 (12.4%) → code: 010 (3 bits)
't': 45,231 (9.0%) → code: 0110 (4 bits)
...
'z': 234 (0.05%) → code: 11111110 (8 bits)
$ ./huffman decompress book.huf restored.txt
Decompressed successfully. Files match: YES
Implementation Hints: First pass: count byte frequencies. Build tree: create leaf nodes for each byte with its frequency, then repeatedly merge the two lowest-frequency nodes into a parent. Assign codes: traverse tree, appending 0 for left and 1 for right. For canonical Huffman: sort codes by length, then alphabetically within length, and assign sequential codes. Store only code lengths in header (much smaller). For bit packing, accumulate bits in a buffer and flush full bytes.
Learning milestones:
- Tree builds correctly → You understand the greedy algorithm
- Encoding produces correct bit stream → You can do bit manipulation
- Compression ratio approaches entropy → You understand optimality
- Canonical codes reduce header size → You understand practical optimization
Project 3: LZ77 Sliding Window Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Dictionary Compression
- Software or Tool: LZ Compressor
- Main Book: “Data Compression: The Complete Reference” by David Salomon
What you’ll build: A compressor that replaces repeated sequences with back-references. “ABCDEFABCDEF” becomes “ABCDEF(6,6)” meaning “go back 6 bytes, copy 6 bytes.”
Why it teaches compression fundamentals: LZ77 is the foundation of modern compression (ZIP, gzip, PNG, zstd). Unlike Huffman which exploits single-byte frequencies, LZ77 exploits repeated patterns of any length. This is why text (repeated words) and code (repeated patterns) compress so well.
Core challenges you’ll face:
- Match finding (finding longest match in window) → maps to search algorithms
- Hash chains (efficient duplicate detection) → maps to hash table design
- Lazy matching (sometimes waiting finds better match) → maps to optimization strategies
- Output format (literal vs match distinction) → maps to encoding design
- Window size trade-offs (bigger = better compression, more memory) → maps to resource trade-offs
Key Concepts:
- LZ77 Algorithm: Stanford LZ77 Tutorial
- Hash Chains: “Data Compression: The Complete Reference” Chapter 4 - Salomon
- Lazy Matching: zlib Algorithm
- LZSS Variant: Wikipedia LZSS
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Hash tables, string matching, understanding of Huffman
Real world outcome:
$ ./lz77 compress linux-kernel.tar linux-kernel.lz77
Original: 850,000,000 bytes
Compressed: 125,000,000 bytes (85% reduction)
Compression speed: 45 MB/s
Window size: 32 KB
$ ./lz77 --analyze linux-kernel.tar
Match statistics:
Literal bytes: 15%
Match references: 85%
Average match length: 12.3 bytes
Longest match: 258 bytes
$ ./lz77 decompress linux-kernel.lz77 restored.tar
Decompression speed: 350 MB/s (asymmetric!)
Files match: YES
Implementation Hints: Use a hash table keyed on 3-byte sequences. For each position, compute hash of next 3 bytes and check for matches at stored positions. Chain collisions (multiple positions with same hash). For each potential match, extend forward to find longest match. Emit literals for unmatched bytes, emit (distance, length) for matches. Minimum match length is typically 3 (shorter matches aren’t worth the overhead). LZSS improvement: use a flag bit to distinguish literals from matches.
Learning milestones:
- Basic LZ77 compresses/decompresses → You understand the core algorithm
- Hash chains speed up matching → You understand practical implementation
- Lazy matching improves ratio → You understand optimization trade-offs
- Decompression is much faster than compression → You understand asymmetry
Project 4: DEFLATE Implementation (LZ77 + Huffman)
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Compression Pipeline
- Software or Tool: DEFLATE Compressor
- Main Book: “RFC 1951: DEFLATE Compressed Data Format Specification”
What you’ll build: A compressor combining LZ77 and Huffman coding—the algorithm behind ZIP, gzip, PNG, and HTTP compression. LZ77 finds matches, then Huffman encodes the literals, lengths, and distances.
Why it teaches compression fundamentals: DEFLATE shows how compression algorithms compose. LZ77 removes redundancy, then Huffman compresses the residual entropy. This two-stage approach is the template for most modern compressors.
Core challenges you’ll face:
- Block structure (stored, static Huffman, dynamic Huffman) → maps to adaptive algorithms
- Dynamic tree building (per-block optimal trees) → maps to local optimization
- Length/distance alphabets (special symbols for back-references) → maps to extended coding
- Bit-level alignment (variable-length codes packed into bytes) → maps to bit stream handling
- Compatibility (producing gzip/zlib compatible output) → maps to format compliance
Key Concepts:
- DEFLATE Specification: RFC 1951
- zlib/gzip Headers: RFC 1950 (zlib), RFC 1952 (gzip)
- Implementation Guide: zlib Algorithm
- PNG Usage: PNG Compression
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: LZ77, Huffman, bit manipulation mastery
Real world outcome:
$ ./mydeflate compress document.pdf document.gz
Original: 1,234,567 bytes
Compressed: 987,654 bytes (20% reduction)
Compatible with: gzip, zlib, 7-zip
$ gzip -d document.gz # Verify with standard tool
$ diff document.pdf document # Identical!
$ ./mydeflate --stats document.pdf
Block analysis:
Block 1: Dynamic Huffman, 45KB → 32KB
Block 2: Dynamic Huffman, 45KB → 31KB
...
Literal symbols: 123,456
Match references: 45,678
Average match length: 8.2 bytes
Implementation Hints: DEFLATE has three block types: stored (no compression), static Huffman (predefined trees), dynamic Huffman (per-block trees). For dynamic blocks: after LZ77, count frequencies of literals (0-255), end-of-block (256), and length codes (257-285). Build Huffman tree. Similarly for distance codes (0-29). Encode the trees themselves using another layer of Huffman (code length codes). The RFC specifies exact bit orderings—follow it precisely. Use zlib’s output as reference to verify correctness.
Learning milestones:
- Stored blocks work → You understand the format structure
- Static Huffman blocks compress → You’ve integrated LZ77 + Huffman
- Dynamic Huffman improves compression → You understand adaptive coding
- Output is gzip-compatible → You’ve achieved format compliance
Project 5: Arithmetic Coding Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Entropy Coding / Mathematics
- Software or Tool: Arithmetic Coder
- Main Book: “Introduction to Data Compression” by Khalid Sayood (Chapter 4)
What you’ll build: An encoder that represents the entire message as a single fraction between 0 and 1. Unlike Huffman (integer bits per symbol), arithmetic coding can use 0.3 bits for a very common symbol.
Why it teaches compression fundamentals: Arithmetic coding achieves near-optimal compression, reaching the theoretical entropy limit. It handles symbol probabilities that Huffman can’t efficiently encode. Understanding arithmetic coding explains why modern coders (ANS, range coding) exist.
Core challenges you’ll face:
- Interval subdivision (shrinking range based on probabilities) → maps to mathematical precision
- Fixed-point arithmetic (avoiding floating-point errors) → maps to numerical stability
- Renormalization (outputting bits when range narrows) → maps to bit streaming
- End-of-message (knowing when to stop decoding) → maps to termination handling
- Underflow prevention (middle-of-range scaling) → maps to edge case handling
Key Concepts:
- Arithmetic Coding Theory: “Introduction to Data Compression” Chapter 4 - Sayood
- Practical Implementation: “Data Compression: The Complete Reference” Chapter 4.4 - Salomon
- Fixed-Point Tricks: Arithmetic Coding Tutorial
- Comparison to Huffman: DeepRender Entropy Coding
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Strong math, bit manipulation, probability concepts
Real world outcome:
$ ./arithmetic compress skewed_data.bin skewed.ac
Original: 1,000,000 bytes
Compressed: 125,456 bytes
Huffman would give: 127,890 bytes (1.9% worse)
$ ./arithmetic --verbose compress test.txt test.ac
Symbol 'a' (prob 0.4): interval [0.0, 0.4)
Symbol 'b' (prob 0.3): interval [0.0, 0.12)
Symbol 'c' (prob 0.2): interval [0.024, 0.048)
Symbol 'a' (prob 0.4): interval [0.024, 0.0336)
...
Final interval: [0.02847, 0.02848)
Output: 0x4A3B... (17 bits)
Entropy: 1.846 bits/symbol
Arithmetic coding: 1.848 bits/symbol (optimal!)
Huffman coding: 1.9 bits/symbol
Implementation Hints: Use 32-bit integers for low/high range (e.g., 0x00000000 to 0xFFFFFFFF). For each symbol, subdivide current range proportionally to symbol probability. When high and low share the same top bit, output that bit and shift both left. Underflow: when range straddles 0.5 (low < 0.5 < high), track pending bits and scale around 0.5. For decoding, maintain a “code” value read from input, and determine which symbol’s subrange contains it. The tricky part is matching encoder/decoder range updates exactly.
Learning milestones:
- Basic encoding produces correct output → You understand the algorithm
- Decoding recovers original → You’ve matched encoder/decoder precisely
- Compression beats Huffman on skewed data → You see the advantage
- Fixed-point implementation is stable → You’ve handled precision issues
Project 6: ANS (Asymmetric Numeral Systems) Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 4: Expert
- Knowledge Area: Modern Entropy Coding
- Software or Tool: ANS Compressor
- Main Book: “Asymmetric Numeral Systems” by Jarek Duda (arXiv paper)
What you’ll build: A modern entropy coder that combines arithmetic coding’s compression ratio with Huffman’s speed. ANS is used in Facebook’s zstd, Apple’s LZFSE, and JPEG XL.
Why it teaches compression fundamentals: ANS represents the cutting edge of entropy coding. By building it, you’ll understand why zstd replaced gzip in performance, and why the compression community moved beyond arithmetic coding.
Core challenges you’ll face:
- State machine encoding (single integer instead of interval) → maps to simplified state
- Table construction (spreading symbols across state space) → maps to precomputation
- Streaming (outputting bits as state grows) → maps to bit stream management
- Reverse order (ANS encodes backward, decodes forward) → maps to stack semantics
- FSE variant (tabled ANS for speed) → maps to lookup table optimization
Key Concepts:
- ANS Paper: arXiv 1311.2540 - Jarek Duda
- Tutorial: Understanding ANS
- FSE (Finite State Entropy): Yann Collet’s Blog
- Patent-Free: ANS is Public Domain
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Arithmetic coding understanding, probability, state machines
Real world outcome:
$ ./ans compress book.txt book.ans
Original: 500,000 bytes
Compressed: 284,500 bytes (same as arithmetic coding!)
Speed: 450 MB/s (10x faster than arithmetic coding!)
$ ./ans benchmark book.txt
Algorithm Ratio Encode Decode
Huffman 56.8% 320MB/s 380MB/s
Arithmetic 56.9% 45MB/s 48MB/s
ANS (rANS) 56.9% 450MB/s 480MB/s
ANS (tANS) 56.9% 520MB/s 600MB/s
$ ./ans decompress book.ans restored.txt
Files match: YES
Implementation Hints:
For rANS (range ANS): state is a large integer. To encode symbol s with probability p: state' = (state / freq[s]) * total + cumfreq[s] + (state % freq[s]). Renormalize when state exceeds threshold by outputting low bits. Decoding is the inverse. For tANS: precompute a table of state transitions. For each (state, symbol) pair, compute next state and output bits. This avoids division during encode/decode. The tricky insight: encoding must happen in reverse order of intended decode order, so buffer symbols or encode backward.
Learning milestones:
- rANS encodes/decodes correctly → You understand the core algorithm
- tANS table construction works → You understand the optimization
- Speed matches theoretical → You’ve implemented efficiently
- Compression ratio matches arithmetic coding → You’ve achieved optimality
Project 7: Burrows-Wheeler Transform (BWT) Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Transform Coding
- Software or Tool: BWT Compressor
- Main Book: “Managing Gigabytes” by Witten, Moffat, Bell (Chapter 4)
What you’ll build: A compressor using the Burrows-Wheeler Transform: rearrange input to cluster similar characters, then compress with Move-to-Front + entropy coding. This is how bzip2 works.
Why it teaches compression fundamentals: BWT shows that compression can be improved by transforming data before coding. Unlike LZ77 (looks for exact matches), BWT exploits the statistical structure of text—letters that follow similar contexts cluster together.
Core challenges you’ll face:
- Suffix sorting (sorting all rotations efficiently) → maps to string algorithms
- Inverse transform (recovering original from BWT) → maps to reversible transforms
- Move-to-Front (converting to small integers) → maps to locality exploitation
- Block sizing (larger blocks = better compression) → maps to resource trade-offs
- Zero-run coding (RLE for the many zeros after MTF) → maps to secondary coding
Key Concepts:
- BWT Algorithm: Wikipedia BWT
- Suffix Arrays: “Algorithms on Strings” by Crochemore et al.
- Move-to-Front: Wikipedia MTF
- bzip2 Implementation: bzip2 Source
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sorting algorithms, understanding of suffix arrays
Real world outcome:
$ ./bwt compress enwik8 enwik8.bwt
Original: 100,000,000 bytes
After BWT + MTF: many zeros and small values
After Huffman: 29,008,758 bytes (71% reduction)
Compression time: 45 seconds
$ ./bwt --analyze enwik8
BWT clustering effectiveness:
Before BWT: entropy 4.56 bits/byte
After BWT: entropy 3.21 bits/byte
After MTF: entropy 2.32 bits/byte
$ bunzip2 -c enwik8.bz2 | diff - restored
# Compare with real bzip2 output - similar ratio!
Implementation Hints:
Naive BWT sorts all rotations—O(n² log n). Use suffix arrays for O(n log n) or better. For each position i, the BWT output is input[(suffix_array[i] - 1) mod n]. Store the position of the original string (where suffix_array[i] == 0) for decoding. For inverse BWT: build the “LF mapping” (first column to last column correspondence). Start at the original position and follow the mapping n times. MTF: maintain a list of recent characters, output the index, move that character to front. Combine with run-length coding for zeros and Huffman.
Learning milestones:
- BWT transforms and inverts correctly → You understand the algorithm
- Suffix array construction is efficient → You’ve implemented practical sorting
- MTF reduces entropy significantly → You understand the clustering effect
- Full pipeline matches bzip2 ratio → You’ve built a real compressor
Project 8: LZ4/Snappy-style Fast Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: High-Speed Compression
- Software or Tool: Fast LZ Compressor
- Main Book: “LZ4 Explained” by Yann Collet (Blog Series)
What you’ll build: A compressor optimized for speed over ratio. LZ4 compresses at 500+ MB/s while still achieving 2:1 compression on typical data. Used in Linux kernel, databases, games.
Why it teaches compression fundamentals: Speed matters. Sometimes you need compression that’s faster than disk I/O. LZ4 shows how to strip LZ77 to its essentials: simple hash lookup, minimal format overhead, branchless decode loops.
Core challenges you’ll face:
- Single-probe hash table (one lookup, no chains) → maps to speed vs ratio trade-off
- Minimal format (no Huffman, just variable-length integers) → maps to format simplicity
- Cache-friendly access (everything fits in L1/L2) → maps to memory hierarchy
- Branchless decoding (predictable loops) → maps to CPU pipeline optimization
- Block independence (parallel encode/decode) → maps to parallelization
Key Concepts:
- LZ4 Specification: LZ4 Block Format
- Snappy Design: Snappy Documentation
- Speed Tricks: Yann Collet’s Blog
- Comparison: LZ4 vs DEFLATE
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: LZ77 understanding, performance optimization, low-level C
Real world outcome:
$ ./fastlz compress linux-kernel.tar kernel.lz4
Original: 850,000,000 bytes
Compressed: 340,000,000 bytes (60% reduction)
Speed: 650 MB/s (limited by memory bandwidth!)
$ ./fastlz decompress kernel.lz4 restored.tar
Speed: 2,800 MB/s (faster than SSD read!)
$ ./fastlz benchmark corpus.tar
Compressor Ratio Compress Decompress
gzip 35% 45 MB/s 250 MB/s
LZ4 60% 650 MB/s 2,800 MB/s
LZ4-HC 45% 35 MB/s 2,800 MB/s
Snappy 58% 450 MB/s 1,200 MB/s
Implementation Hints: Use a hash table with 4KB-64KB entries (power of 2). Hash the next 4 bytes, single-probe lookup—if match, extend; if not, emit literal and update table. Format: literal length (4 bits) + match length (4 bits) in one byte, followed by literals, followed by 2-byte match offset. Lengths ≥15 use continuation bytes. For speed: use memcpy for literals (compilers optimize this), unroll the match copy loop, use pointer arithmetic instead of index math. The decode loop should be <20 instructions.
Learning milestones:
- Basic LZ4 format encodes/decodes → You understand the format
- Compression speed exceeds 200 MB/s → You’ve optimized encoding
- Decompression speed exceeds 1 GB/s → You’ve optimized decoding
- Ratio matches reference LZ4 → You’ve matched the algorithm
Project 9: Zstandard-style Compressor
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: State-of-the-Art Compression
- Software or Tool: Modern Compressor
- Main Book: “RFC 8878: Zstandard Compression”
What you’ll build: A compressor combining LZ77, Huffman, and FSE (ANS variant) with configurable compression levels. zstd achieves gzip-like ratios at LZ4-like speeds.
Why it teaches compression fundamentals: Zstandard represents the state of the art. It shows how to combine all the techniques: dictionary matching (LZ77), entropy coding (FSE), multiple compression levels, and modern CPU optimization. Understanding zstd means understanding modern compression.
Core challenges you’ll face:
- Multiple entropy backends (Huffman for literals, FSE for sequences) → maps to hybrid coding
- Sequence coding (literal length, match length, offset as tuples) → maps to structured encoding
- Repeat offsets (caching recent match distances) → maps to statistical optimization
- Compression levels (trading speed for ratio) → maps to configurable algorithms
- Dictionary compression (training on sample data) → maps to prior knowledge
Key Concepts:
- Zstandard Specification: RFC 8878
- FSE Algorithm: Finite State Entropy
- Implementation: zstd Source
- Comparison: Zstd vs Brotli
Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects, deep understanding of entropy coding
Real world outcome:
$ ./myzstd -1 compress linux-kernel.tar kernel.zst
Original: 850,000,000 bytes
Compressed: 150,000,000 bytes (82% reduction)
Speed: 400 MB/s
$ ./myzstd -19 compress linux-kernel.tar kernel-max.zst
Compressed: 95,000,000 bytes (89% reduction)
Speed: 5 MB/s (but still decompresses at 800 MB/s!)
$ ./myzstd --train-dict samples/ -o custom.dict
Trained dictionary on 1000 samples
Dictionary size: 112 KB
$ ./myzstd -D custom.dict compress small_files/*.json output.zst
Compression ratio improved by 40% with dictionary!
Implementation Hints: zstd encodes “sequences”: (literal_length, match_length, offset) tuples. Use FSE for the match/literal length codes, and Huffman or FSE for literals. Repeat offsets: track last 3 match offsets; if new offset matches one, use special codes. For compression levels: level 1 uses fast hash matching (like LZ4), high levels use optimal parsing with lazy evaluation. Dictionary: prepopulate the hash table and LZ77 window with trained data. Reference the actual zstd source—it’s well-commented.
Learning milestones:
- Basic compression works → You’ve integrated the components
- Ratio matches gzip at 5x speed → You’ve achieved the zstd value proposition
- Multiple levels work → You’ve implemented configurable compression
- Dictionary training improves small files → You understand the full system
Project 10: JPEG Encoder from Scratch
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++, Python
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Image Compression / DCT
- Software or Tool: JPEG Encoder
- Main Book: “JPEG Still Image Data Compression Standard” by Pennebaker & Mitchell
What you’ll build: A JPEG encoder that converts raw RGB images to JPEG files, implementing color space conversion, DCT, quantization, and entropy coding.
Why it teaches compression fundamentals: JPEG is the canonical lossy compressor. It demonstrates transform coding (DCT), quantization (controlled quality loss), and exploiting human perception (we don’t see high frequencies well). These same principles appear in video (H.264), audio (MP3), and modern image formats (AVIF).
Core challenges you’ll face:
- Color space conversion (RGB → YCbCr, chroma subsampling) → maps to perceptual coding
- DCT computation (8x8 blocks, frequency domain) → maps to transform coding
- Quantization tables (quality factor, DC/AC treatment) → maps to lossy trade-offs
- Zig-zag ordering (low to high frequency) → maps to run-length optimization
- Huffman coding (DC deltas, AC run-length pairs) → maps to entropy coding
Key Concepts:
- DCT Explained: Understanding DCT
- JPEG Standard: “JPEG Still Image Data Compression Standard” - Pennebaker & Mitchell
- Quantization: Stanford JPEG Tutorial
- Implementation Guide: JPEG Explained (Computerphile)
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Linear algebra (matrices), bit manipulation, Huffman coding
Real world outcome:
$ ./myjpeg encode photo.ppm photo.jpg --quality 85
Original: 12,000,000 bytes (4000x3000 RGB)
Compressed: 1,234,567 bytes (90% reduction)
PSNR: 38.5 dB (excellent quality)
$ ./myjpeg --visualize-dct photo.ppm
Block (10, 20):
DC: 1024 (average brightness)
AC[0,1]: 156 (horizontal edge)
AC[1,0]: -89 (vertical edge)
AC[7,7]: 2 (high frequency, will be quantized to 0)
$ diff <(djpeg photo.jpg) <(djpeg reference.jpg)
# Visually identical to libjpeg output!
$ ./myjpeg encode photo.ppm photo-q10.jpg --quality 10
Compressed: 123,456 bytes
PSNR: 28.1 dB (visible artifacts, but recognizable)
Implementation Hints: RGB to YCbCr: Y = 0.299R + 0.587G + 0.114B, Cb/Cr similar. Subsample Cb/Cr 2:1 horizontally and vertically (4:2:0). For DCT: use the standard 8x8 transform matrix (or fast algorithm). Quantization: divide DCT coefficients by quantization table, round. Quality factor scales the table. Zig-zag: reorder 64 coefficients from low to high frequency. Encode DC as delta from previous block (Huffman). Encode AC as (run of zeros, value) pairs (Huffman with special tables). Write JFIF markers and headers.
Learning milestones:
- DCT/IDCT produces correct output → You understand the transform
- Quantization shows quality trade-off → You understand lossy compression
- Output opens in image viewers → You’ve got the format right
- Quality matches libjpeg → You’ve built a real encoder
Project 11: PNG Encoder from Scratch
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Lossless Image Compression
- Software or Tool: PNG Encoder
- Main Book: “PNG: The Definitive Guide” by Greg Roelofs
What you’ll build: A PNG encoder implementing filtering, DEFLATE compression, and the PNG file format. Unlike JPEG, PNG is lossless—perfect for screenshots, diagrams, and images with text.
Why it teaches compression fundamentals: PNG shows how general-purpose compression (DEFLATE) can be optimized for images using preprocessing (filtering). The filter step exploits image-specific redundancy: pixels near each other are usually similar.
Core challenges you’ll face:
- PNG filtering (None, Sub, Up, Average, Paeth) → maps to prediction-based coding
- Filter selection (per-row optimal filter) → maps to adaptive preprocessing
- Interlacing (Adam7 progressive display) → maps to incremental rendering
- CRC checksums (per-chunk validation) → maps to data integrity
- Color types (grayscale, RGB, palette, alpha) → maps to format flexibility
Key Concepts:
- PNG Specification: PNG RFC
- Filtering Explained: PNG Filtering
- Optimization: OptiPNG Guide
- DEFLATE in PNG: “PNG: The Definitive Guide” Chapter 9 - Roelofs
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: DEFLATE implementation, image basics
Real world outcome:
$ ./mypng encode screenshot.ppm screenshot.png
Original: 8,000,000 bytes (2000x1000 RGB)
Compressed: 234,567 bytes (97% reduction)
Filter stats: Sub 40%, Up 35%, Average 20%, Paeth 5%
$ ./mypng --analyze screenshot.png
Chunk breakdown:
IHDR: 13 bytes (header)
IDAT: 234,500 bytes (compressed data)
IEND: 0 bytes (end marker)
$ pngcheck screenshot.png
OK: screenshot.png (2000x1000, 24-bit RGB, non-interlaced)
$ ./mypng encode diagram.ppm diagram.png --interlace
Interlaced output for progressive display
Implementation Hints: For each row, compute all 5 filter outputs: None (raw), Sub (pixel - left), Up (pixel - above), Average ((left + above) / 2), Paeth (nonlinear predictor). Choose filter that minimizes sum of absolute differences (heuristic). Prepend filter byte to each row. DEFLATE-compress the result. Write PNG chunks: IHDR (width, height, bit depth, color type), IDAT (compressed data), IEND (marker). Compute CRC32 for each chunk. For optimization, try different DEFLATE compression levels and filter strategies.
Learning milestones:
- Basic PNG encodes/decodes → You understand the format
- Filters improve compression ratio → You see prediction value
- Output validates with pngcheck → You’ve got the format right
- Ratio matches libpng → You’ve optimized filter selection
Project 12: MP3 Encoder (Simplified)
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 5: Master
- Knowledge Area: Audio Compression / Psychoacoustics
- Software or Tool: Audio Encoder
- Main Book: “Introduction to Digital Audio Coding and Standards” by Bosi & Goldberg
What you’ll build: A simplified MP3-like encoder implementing MDCT, psychoacoustic modeling, quantization, and Huffman coding. You’ll hear the trade-off between quality and file size.
Why it teaches compression fundamentals: Audio compression exploits human hearing limitations: we can’t hear quiet sounds masked by loud ones, and we’re less sensitive to certain frequencies. MP3 pioneered this “perceptual coding” approach now used everywhere.
Core challenges you’ll face:
- MDCT transform (time to frequency domain) → maps to transform coding
- Psychoacoustic model (masking thresholds) → maps to perceptual coding
- Bit allocation (which frequencies need more bits) → maps to resource allocation
- Huffman tables (MP3’s predefined 32 tables) → maps to standardized coding
- Frame structure (headers, side info, main data) → maps to format design
Key Concepts:
- MDCT in MP3: MP3 MDCT Tutorial
- Psychoacoustic Model: Stanford MP3 Psychoacoustics
- MP3 Format: MP3 Technical Spec
- LAME Source: LAME MP3 Encoder
Difficulty: Master Time estimate: 1-2 months Prerequisites: Signal processing basics, DCT understanding, Huffman coding
Real world outcome:
$ ./mymp3 encode song.wav song.mp3 --bitrate 128
Original: 45,000,000 bytes (5 minutes, 44.1kHz stereo)
Compressed: 4,800,000 bytes (128 kbps)
Compression ratio: 9.4:1
$ ./mymp3 --analyze song.wav
Frame 1000:
Masking threshold at 1kHz: -45 dB
Signal at 1kHz: -20 dB (25 dB above threshold, needs bits)
Signal at 15kHz: -48 dB (below threshold, can remove!)
$ mpg123 song.mp3 # Plays in standard MP3 player!
$ ./mymp3 encode song.wav song-32.mp3 --bitrate 32
Audible artifacts (expected at low bitrate)
Implementation Hints: Use 1152 samples per frame (576 per granule). Apply polyphase filter bank (32 subbands) then MDCT (18 coefficients per subband = 576 total). Psychoacoustic model: compute FFT, find masking tones, calculate masking thresholds. Quantize coefficients: lower precision where masked. Select Huffman table per block that minimizes bits. Write frame header (sync, bitrate, sample rate), side info (block types, scale factors), and main data (Huffman-coded coefficients). Start with existing psychoacoustic model code—the math is complex.
Learning milestones:
- MDCT transforms correctly → You understand the frequency domain
- Output plays in MP3 players → You’ve got the format right
- Quality is acceptable at 128kbps → Your quantization works
- Low bitrate sounds worse (as expected) → You understand the trade-offs
Project 13: Video Codec Fundamentals (H.264-lite)
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Video Compression
- Software or Tool: Video Encoder
- Main Book: “H.264 and MPEG-4 Video Compression” by Iain Richardson
What you’ll build: A simplified video encoder implementing I-frames (like JPEG), P-frames (motion-compensated prediction), and basic entropy coding. You’ll see how video achieves 100:1 compression.
Why it teaches compression fundamentals: Video compression extends image compression with temporal prediction: frames are similar to previous frames. This “motion compensation” is why video files are so small despite having 30+ images per second.
Core challenges you’ll face:
- Motion estimation (finding where blocks moved) → maps to search algorithms
- Motion compensation (predicting from previous frame) → maps to prediction-based coding
- Residual coding (encoding the difference) → maps to transform coding (DCT)
- Frame types (I/P/B frames) → maps to temporal dependencies
- Rate control (hitting target bitrate) → maps to resource allocation
Key Concepts:
- H.264 Overview: Wikipedia H.264
- Motion Estimation: “H.264 and MPEG-4 Video Compression” Chapter 6 - Richardson
- GOP Structure: FastPix Codec Guide
- x264 Source: x264 Encoder
Difficulty: Master Time estimate: 2-3 months Prerequisites: JPEG encoder, motion vectors concept, significant time investment
Real world outcome:
$ ./myvideo encode clip.y4m clip.264 --bitrate 2000
Original: 500,000,000 bytes (30 sec, 1080p, 30fps)
Compressed: 7,500,000 bytes (2 Mbps)
Compression ratio: 67:1
$ ./myvideo --analyze clip.264
Frame types: I P P P P I P P P...
GOP size: 5
Average motion vector: (2.3, 1.1) pixels
Residual bits: 40% of total (motion saves 60%!)
$ ffplay clip.264 # Plays in ffmpeg!
$ ./myvideo encode clip.y4m clip-lowrate.264 --bitrate 500
Lower quality but still watchable
Implementation Hints: Start with I-frames only (just JPEG). Add P-frames: divide into 16x16 macroblocks, search previous frame for best match (minimize sum of absolute differences), encode motion vector and residual. For motion search: start with full search in ±16 pixels, then optimize with diamond or hexagonal search. Residual: DCT, quantize, entropy code (CAVLC is simpler than CABAC). For B-frames (advanced): search both previous and next frames. Use x264 as reference—it’s well-documented.
Learning milestones:
- I-frames encode correctly → You’ve built on JPEG
- P-frames with motion compensation work → You understand video compression
- Output plays in video players → You’ve got the format right
- Quality is reasonable at target bitrate → Your rate control works
Project 14: Delta Encoding & Diff Algorithm
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 2: Intermediate
- Knowledge Area: Differential Compression
- Software or Tool: Delta Compressor
- Main Book: “Managing Gigabytes” by Witten, Moffat, Bell
What you’ll build: A delta encoder that compresses a file by storing only the differences from a reference file. This is how Git stores changes, rsync syncs files, and game patches work.
Why it teaches compression fundamentals: Delta encoding exploits a different kind of redundancy: similarity between related files. A new version of a file is often 99% identical to the old version—delta encoding transmits only the 1%.
Core challenges you’ll face:
- Common subsequence finding (longest match detection) → maps to diff algorithms
- Edit operations (insert, delete, copy) → maps to operation encoding
- Block hashing (rsync’s rolling hash) → maps to incremental hashing
- Reference management (base file dependency) → maps to versioning systems
- Combining with compression (delta + LZ) → maps to layered compression
Key Concepts:
- Rsync Algorithm: rsync Technical Report
- Myers Diff: An O(ND) Difference Algorithm
- VCDIFF Format: RFC 3284
- Git Packfiles: Git Internals - Packfiles
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Hash tables, basic LZ understanding
Real world outcome:
$ ./delta create old_version.bin new_version.bin patch.delta
Old size: 10,000,000 bytes
New size: 10,050,000 bytes
Delta size: 125,000 bytes (1.2% of new file!)
$ ./delta apply old_version.bin patch.delta restored.bin
Applied 47 copy operations, 12 insert operations
Restored file matches new_version.bin: YES
$ ./delta analyze old_version.bin new_version.bin
Common blocks: 98.7%
Insertions: 45KB at offsets 1234, 56789, ...
Deletions: 5KB at offsets 9999, ...
# Works for game patches:
$ ./delta create game_v1.0.bin game_v1.1.bin update.patch
Patch size: 50MB (vs 5GB full download!)
Implementation Hints: Split the reference file into fixed-size blocks (e.g., 4KB). Compute hash of each block, store in hash table mapping hash → offset. For new file: use rolling hash (rsync’s adler32-like) to find blocks matching reference. Output sequence of: COPY(offset, length) for matches, INSERT(data) for new content. For better results, use variable-length matching like LZ77, but match against reference instead of recent data. VCDIFF is the standard delta format.
Learning milestones:
- Basic delta works for similar files → You understand the concept
- Rolling hash enables efficient matching → You can handle large files
- Patch size is proportional to differences → Your algorithm is correct
- Patches work for binary files (not just text) → You handle all data types
Project 15: Context-Based Compression (PPM)
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Statistical Modeling
- Software or Tool: Context Compressor
- Main Book: “Text Compression” by Bell, Cleary, Witten
What you’ll build: A Prediction by Partial Matching (PPM) compressor that uses context (previous bytes) to predict the next byte. After “th”, ‘e’ is very likely—PPM exploits this.
Why it teaches compression fundamentals: PPM achieves among the best compression ratios for text. It shows that better statistical modeling = better compression. Modern neural network compressors are essentially sophisticated PPM.
Core challenges you’ll face:
- Context modeling (storing predictions for each context) → maps to trie data structures
- Escape mechanism (what if context hasn’t seen this symbol) → maps to smoothing techniques
- Order blending (combining predictions from different context lengths) → maps to ensemble methods
- Memory management (tries grow large) → maps to bounded-memory algorithms
- Arithmetic coding integration (variable probabilities) → maps to entropy coding
Key Concepts:
- PPM Algorithm: “Text Compression” Chapter 4 - Bell, Cleary, Witten
- Escape Methods: PPM Escape Mechanisms
- PAQ Compressors: Matt Mahoney’s PAQ
- Context Mixing: Context Mixing Explanation
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Arithmetic coding, trie structures, probability
Real world outcome:
$ ./ppm compress book.txt book.ppm
Original: 500,000 bytes
Compressed: 145,000 bytes (71% reduction)
Context order: 5
Comparison:
gzip: 192,000 bytes
bzip2: 152,000 bytes
PPM: 145,000 bytes (best!)
$ ./ppm --analyze book.txt
After "the ": 'a' 23%, 'm' 18%, 'b' 12%...
After "tion": ' ' 45%, 's' 32%, 'a' 8%...
Context "the " has 234 different followers
$ ./ppm decompress book.ppm restored.txt
Files match: YES
Implementation Hints: Build a trie where each node represents a context (sequence of bytes). Each node stores counts for each possible next byte. For encoding: look up current context (last N bytes), get probability distribution, encode with arithmetic coder. If symbol not seen in this context, emit escape symbol and try shorter context. PPM-D uses escape method D (count of distinct symbols). For memory limits, use context hashing or pruning. Start with order 4-5; higher orders need more memory but compress better.
Learning milestones:
- Basic PPM compresses/decompresses → You understand context modeling
- Escape mechanism handles novel contexts → You’ve solved the zero-probability problem
- Compression beats gzip/bzip2 on text → Your model is good
- Memory-bounded version works → You can handle large files
Project 16: Neural Network Compressor (Simple)
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: C++ with LibTorch, Rust
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Learned Compression
- Software or Tool: Neural Compressor
- Main Book: “Neural Data Compression” (Various Papers)
What you’ll build: A compressor using a simple neural network (LSTM or Transformer) to predict the next byte, combined with arithmetic coding. This is the frontier of compression research.
Why it teaches compression fundamentals: Neural compressors show that compression = prediction. A perfect predictor gives perfect compression. Modern AI compressors like NNCP and DeepZip achieve state-of-the-art ratios by leveraging deep learning’s pattern recognition.
Core challenges you’ll face:
- Probability output (network predicts distribution over next byte) → maps to softmax modeling
- Online learning (model adapts while compressing) → maps to adaptive compression
- Bit-back coding (efficient latent coding) → maps to VAE compression
- Model size overhead (model must ship with data) → maps to break-even point
- Speed (neural networks are slow) → maps to practical trade-offs
Key Concepts:
- CMIX Compressor: CMIX - Best Compression
- NNCP Paper: Neural Network Compression
- Arithmetic Coding with NN: DeepZip
- Large Text Compression Benchmark: Hutter Prize
Difficulty: Master Time estimate: 2-3 months Prerequisites: Deep learning basics, arithmetic coding, PPM understanding
Real world outcome:
$ ./neural_compress train corpus/*.txt --model model.pt
Training neural predictor...
Epoch 100: perplexity 2.34 (entropy 1.23 bits/byte)
$ ./neural_compress compress book.txt book.nc --model model.pt
Original: 500,000 bytes
Compressed: 125,000 bytes + 5,000,000 bytes (model)
Net size: 5,125,000 bytes (worse than gzip!)
$ ./neural_compress compress large_corpus.txt corpus.nc --model model.pt
Original: 1,000,000,000 bytes
Compressed: 98,000,000 bytes + 5,000,000 bytes (model)
Net compression: 9.7% of original (better than everything!)
# For small files, neural compression loses to model overhead
# For large files or pre-shared model, it wins
Implementation Hints: Use a character-level LSTM or small Transformer (1-2 layers). Input: last N bytes (embedded). Output: softmax over 256 possible next bytes. Train on similar data or train online while compressing. For compression: for each byte, get predicted distribution from network, encode with arithmetic coder, update network with actual byte. Key insight: better prediction = lower entropy = smaller output. Start with PyTorch for prototyping. For practical use, the model overhead (~5MB) only pays off for very large files or when model is pre-shared.
Learning milestones:
- Neural network predicts next byte → You’ve built the model
- Arithmetic coding uses predictions → You’ve integrated with compression
- Large files compress better than classical → Neural compression works!
- Online adaptation improves results → You understand adaptive compression
Project 17: Compression Benchmarking Suite
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Python + C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Performance Analysis
- Software or Tool: Benchmark Tool
- Main Book: “The Art of Computer Systems Performance Analysis” by Raj Jain
What you’ll build: A benchmarking suite that compares compression algorithms across different data types (text, images, binaries), measuring ratio, speed, memory, and CPU usage.
Why it teaches compression fundamentals: Different algorithms excel on different data. Benchmarking reveals the trade-offs: LZ4 for speed, zstd for balance, LZMA for ratio, PPM for text. Understanding when to use each algorithm is as important as understanding how they work.
Core challenges you’ll face:
- Fair comparison (same settings, warm cache) → maps to experimental methodology
- Data variety (text, binary, images, random) → maps to workload characterization
- Metric selection (ratio vs speed vs memory) → maps to multi-objective optimization
- Statistical significance (variance between runs) → maps to statistical analysis
- Visualization (Pareto frontiers, charts) → maps to data presentation
Key Concepts:
- Compression Benchmarks: Squash Benchmark
- Silesia Corpus: Standard Test Data
- Large Text Benchmark: Matt Mahoney’s Benchmarks
- Performance Analysis: “Systems Performance” by Brendan Gregg
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic statistics, scripting, system monitoring
Real world outcome:
$ ./benchmark --corpus silesia/ --algorithms gzip,lz4,zstd,bzip2
Running benchmarks on 12 files, 211 MB total...
Results:
| Algorithm | Ratio | Compress | Decompress | Memory |
|-----------|-------|----------|------------|--------|
| lz4 | 2.1x | 650 MB/s | 2800 MB/s | 16 MB |
| zstd -1 | 2.8x | 450 MB/s | 1200 MB/s | 32 MB |
| zstd -19 | 3.5x | 3 MB/s | 1200 MB/s | 128 MB |
| gzip | 3.2x | 45 MB/s | 280 MB/s | 256 KB |
| bzip2 | 3.8x | 15 MB/s | 55 MB/s | 7 MB |
Best for speed: lz4
Best for ratio: bzip2
Best overall: zstd (Pareto optimal)
$ ./benchmark --visualize results.json
[Generates Pareto frontier chart: ratio vs speed]
Implementation Hints: Use standard test corpora (Silesia, Calgary, Canterbury). For each algorithm, run multiple times, discard first run (warm cache), compute mean and variance. Measure wall-clock time, CPU time, peak memory (via /proc or similar). Compute compression ratio, speed (bytes/second), and memory per byte. Plot Pareto frontier: algorithms on the efficient frontier offer best trade-offs. Include random data (incompressible) to verify no algorithm expands it much.
Learning milestones:
- Benchmarks run consistently → You understand experimental methodology
- Results match published benchmarks → Your methodology is sound
- You can recommend algorithms for use cases → You understand trade-offs
- Visualizations show clear patterns → You can communicate results
Project 18: Build Your Own Mini-Compression Library
- File: COMPRESSION_ALGORITHMS_DEEP_DIVE_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Library Design
- Software or Tool: Compression Library
- Main Book: “The Practice of Programming” by Kernighan & Pike
What you’ll build: A unified compression library with a clean API, supporting multiple algorithms (LZ4, zstd-lite, Huffman), streaming, and configurable compression levels.
Why it teaches compression fundamentals: Building a library forces you to think about API design, memory management, error handling, and performance. You’ll understand why real libraries (zlib, zstd) are designed the way they are.
Core challenges you’ll face:
- API design (simple yet flexible interface) → maps to library design
- Streaming (compress without knowing total size) → maps to incremental processing
- Memory management (caller-provided vs library-managed) → maps to resource ownership
- Error handling (graceful failure, informative errors) → maps to robust code
- Thread safety (concurrent compression) → maps to parallel design
Key Concepts:
- zlib API Design: zlib Manual
- zstd API: zstd Documentation
- Library Design: “The Practice of Programming” Chapter 4 - Kernighan & Pike
- API Design Principles: “The Design of Everyday Things” - Don Norman
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: All previous compression projects, library design experience
Real world outcome:
#include "minicompress.h"
// Simple one-shot API
mc_compress(src, src_len, dst, &dst_len, MC_ALGO_ZSTD, MC_LEVEL_DEFAULT);
// Streaming API
mc_stream_t *s = mc_stream_new(MC_ALGO_LZ4, MC_COMPRESS);
while (has_more_data) {
mc_stream_input(s, chunk, chunk_len);
while (mc_stream_output_available(s)) {
size_t out_len;
mc_stream_output(s, buffer, sizeof(buffer), &out_len);
write(fd, buffer, out_len);
}
}
mc_stream_finish(s);
mc_stream_free(s);
// Algorithm enumeration
for (mc_algo_t a = mc_algo_first(); a != MC_ALGO_END; a = mc_algo_next(a)) {
printf("%s: ratio %.1fx, speed %d MB/s\n",
mc_algo_name(a), mc_algo_typical_ratio(a), mc_algo_typical_speed(a));
}
Implementation Hints:
Design a clean C API with opaque types. Provide both one-shot (compress(src, dst)) and streaming (stream_input(), stream_output()) interfaces. Use function pointers for algorithm dispatch. Include bounds checking, input validation, and meaningful error codes. For streaming, implement internal buffering. For thread safety, avoid global state—keep all context in the stream object. Document memory ownership clearly. Provide both static and dynamic linking options.
Learning milestones:
- One-shot API works for all algorithms → You understand abstraction
- Streaming API handles arbitrary input sizes → You understand buffering
- Memory management is correct (no leaks) → You’ve verified with valgrind
- Library is usable by others → You’ve achieved good API design
Project Comparison Table
| # | Project | Difficulty | Time | Depth | Fun Factor | Type |
|---|---|---|---|---|---|---|
| 1 | RLE | Beginner | 1-2 days | ★★☆☆☆ | ★★☆☆☆ | Lossless |
| 2 | Huffman | Intermediate | 1 week | ★★★★☆ | ★★★☆☆ | Lossless |
| 3 | LZ77 | Advanced | 2-3 weeks | ★★★★☆ | ★★★★☆ | Lossless |
| 4 | DEFLATE | Expert | 3-4 weeks | ★★★★★ | ★★★★☆ | Lossless |
| 5 | Arithmetic | Expert | 2-3 weeks | ★★★★★ | ★★★☆☆ | Lossless |
| 6 | ANS | Expert | 2-3 weeks | ★★★★★ | ★★★★☆ | Lossless |
| 7 | BWT | Advanced | 2-3 weeks | ★★★★☆ | ★★★★☆ | Lossless |
| 8 | LZ4-style | Advanced | 1-2 weeks | ★★★☆☆ | ★★★★★ | Lossless |
| 9 | Zstandard | Master | 1-2 months | ★★★★★ | ★★★★★ | Lossless |
| 10 | JPEG | Expert | 3-4 weeks | ★★★★★ | ★★★★★ | Lossy |
| 11 | PNG | Advanced | 2-3 weeks | ★★★★☆ | ★★★★☆ | Lossless |
| 12 | MP3 | Master | 1-2 months | ★★★★★ | ★★★★★ | Lossy |
| 13 | Video Codec | Master | 2-3 months | ★★★★★ | ★★★★★ | Lossy |
| 14 | Delta Encoding | Intermediate | 1-2 weeks | ★★★☆☆ | ★★★★☆ | Lossless |
| 15 | PPM | Expert | 3-4 weeks | ★★★★★ | ★★★★☆ | Lossless |
| 16 | Neural Compress | Master | 2-3 months | ★★★★★ | ★★★★★ | Lossless |
| 17 | Benchmarking | Intermediate | 1 week | ★★☆☆☆ | ★★★☆☆ | Tool |
| 18 | Mini Library | Expert | 2-3 weeks | ★★★★☆ | ★★★☆☆ | Library |
Recommended Learning Path
Path A: Understanding the Fundamentals (4-6 weeks)
Best if you’re new to compression and want solid foundations:
- Project 1: RLE → Understand the simplest case
- Project 2: Huffman → Understand entropy coding
- Project 3: LZ77 → Understand dictionary compression
- Project 4: DEFLATE → See how they combine
Path B: Modern High-Performance Compression (6-8 weeks)
Best if you want to understand current industry standards:
- Project 2: Huffman → Foundation of entropy coding
- Project 6: ANS → Modern entropy coding
- Project 8: LZ4 → Speed-optimized compression
- Project 9: Zstandard → State of the art
Path C: Media Compression Deep Dive (8-12 weeks)
Best if you’re interested in images, audio, or video:
- Project 2: Huffman → Needed for JPEG
- Project 10: JPEG → Image compression fundamentals
- Project 12: MP3 → Audio compression
- Project 13: Video Codec → Video compression
Path D: Research-Oriented Path (10-12 weeks)
Best if you want to push compression boundaries:
- Project 5: Arithmetic Coding → Optimal entropy coding
- Project 7: BWT → Transform-based compression
- Project 15: PPM → Context modeling
- Project 16: Neural Compressor → Frontier of compression
Path E: Complete Mastery (4-6 months)
For those who want to deeply understand all of compression:
- Start with Path A (fundamentals)
- Add Path B (modern algorithms)
- Add Project 10-11 (image compression)
- Add Project 15-16 (advanced modeling)
- Finish with Project 18 (library design)
Key Resources Summary
Books
- “Introduction to Data Compression” by Khalid Sayood - The textbook
- “Data Compression: The Complete Reference” by David Salomon - Comprehensive reference
- “Managing Gigabytes” by Witten, Moffat, Bell - Text compression focus
- “JPEG Still Image Data Compression Standard” by Pennebaker & Mitchell - Image compression
Online Resources
- Squash Compression Benchmark - Algorithm comparisons
- Matt Mahoney’s Data Compression Page - Comprehensive resource
- encode.su Forum - Active compression community
- Hutter Prize - Compression competition
Source Code to Study
- zlib - Reference DEFLATE
- zstd - Modern, well-documented
- lz4 - Speed-optimized
- bzip2 - BWT-based
- LAME - MP3 encoder
- x264 - H.264 encoder
RFCs and Standards
- RFC 1951 - DEFLATE
- RFC 8878 - Zstandard
- PNG Specification
- JPEG Standard
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Run-Length Encoding (RLE) Compressor | C |
| 2 | Huffman Coding Compressor | C |
| 3 | LZ77 Sliding Window Compressor | C |
| 4 | DEFLATE Implementation (LZ77 + Huffman) | C |
| 5 | Arithmetic Coding Compressor | C |
| 6 | ANS (Asymmetric Numeral Systems) Compressor | C |
| 7 | Burrows-Wheeler Transform (BWT) Compressor | C |
| 8 | LZ4/Snappy-style Fast Compressor | C |
| 9 | Zstandard-style Compressor | C |
| 10 | JPEG Encoder from Scratch | C |
| 11 | PNG Encoder from Scratch | C |
| 12 | MP3 Encoder (Simplified) | C |
| 13 | Video Codec Fundamentals (H.264-lite) | C |
| 14 | Delta Encoding & Diff Algorithm | C |
| 15 | Context-Based Compression (PPM) | C |
| 16 | Neural Network Compressor (Simple) | Python |
| 17 | Compression Benchmarking Suite | Python + C |
| 18 | Build Your Own Mini-Compression Library | C |