Project 7: PCA Image Compressor
An image compressor that uses Principal Component Analysis (PCA) to reduce image size while preserving visual quality. See how keeping different numbers of principal components affects the result.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced (The Engineer) |
| Main Programming Language | Python |
| Alternative Programming Languages | Julia, C++, Rust |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 2. The “Micro-SaaS / Pro Tool” (Solo-Preneur Potential) |
| Knowledge Area | Dimensionality Reduction / Image Processing |
| Software or Tool | PCA Compressor |
| Main Book | “Hands-On Machine Learning” by Aurélien Géron |
1. Learning Objectives
By completing this project, you will:
- Translate math definitions into deterministic implementation steps.
- Build validation checks that make correctness observable.
- Diagnose numerical, logical, and data-shape failures early.
- Explain tradeoffs in interviews using evidence from your own build.
2. All Theory Needed (Per-Concept Breakdown)
This project applies the following theory clusters:
- Symbolic-to-numeric translation (expressions, data shapes, invariants)
- Stability constraints (precision, scaling, stopping criteria)
- Optimization or inference logic (depending on project objective)
- Evaluation discipline (error analysis, test coverage, reproducibility)
Concept A: Mathematical Representation Discipline
Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.
Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.
How this fits this project You will encode each operation with explicit contracts and invariant checks.
Definitions & key terms
- Invariant: Property that must hold before and after each operation.
- Shape contract: Expected dimensional structure of vectors/matrices/tensors.
- Domain constraint: Allowed value range (for example log input > 0).
Mental model diagram
User Input -> Representation Layer -> Validated Operation -> Observable Output
(tokens/shapes) (invariants pass) (tests/plots/logs)
How it works
- Parse/ingest data into typed structures.
- Validate shape/domain invariants.
- Execute operation.
- Compare observed output with expected behavior.
- Record failure signature if mismatch appears.
Minimal concrete example
PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check
Common misconceptions
- “If it runs once, representation is correct.” -> false.
- “Type checks are enough without shape checks.” -> false.
Check-your-understanding questions
- Which invariant catches division-by-zero earliest?
- Why does shape validation belong at boundaries rather than only in core logic?
- Predict failure if tokenization ignores unary minus.
Check-your-understanding answers
- Domain check on denominator before operation execution.
- Boundary validation keeps errors local and diagnostic.
- Expressions like
-2^2get misinterpreted and produce wrong precedence behavior.
Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.
Where you’ll apply it This project and every downstream project in the sprint.
References
- CSAPP (Bryant & O’Hallaron), floating-point chapter
- Math for Programmers (Paul Orland), representation-oriented chapters
Key insight Correct representation reduces the complexity of every later decision.
Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.
Homework/Exercises
- Write five invariants for your project.
- Build a failing test input for each invariant.
Solutions
- Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
- Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.
3. Build Blueprint
- Scope the smallest end-to-end slice that produces visible output.
- Add deterministic tests and edge-case probes.
- Layer complexity only after baseline behavior is stable.
- Add metrics logging before optimization.
- Run failure drills: perturb inputs, scale values, and check stability.
4. Real-World Outcome (Target)
$ python pca_compress.py face.png
Original image: 256x256 = 65,536 pixels
Computing covariance matrix...
Finding eigenvectors (principal components)...
Compression results:
10 components: 15.3% original size, PSNR = 24.5 dB [saved: face_10.png]
50 components: 38.2% original size, PSNR = 31.2 dB [saved: face_50.png]
100 components: 61.4% original size, PSNR = 38.7 dB [saved: face_100.png]
[Visual: Side-by-side comparison of original and compressed images]
[Visual: Scree plot showing eigenvalue magnitudes - "elbow" at ~50 components]
Implementation Hints: For a grayscale image of size m×n, treat each row as a data point (m points of dimension n).
- Center the data: subtract mean from each row
- Compute covariance matrix:
C = X.T @ X / (m-1) - Find eigenvectors of C, sorted by eigenvalue magnitude
- Keep top k eigenvectors as your principal components
- Project:
X_compressed = X @ V_k - Reconstruct:
X_reconstructed = X_compressed @ V_k.T + mean
The eigenvalues tell you how much variance each component captures.
Learning milestones:
- Compression works and image is recognizable → You understand projection and reconstruction
- Scree plot shows variance explained → You understand what eigenvectors capture
- You can explain PCA without using library functions → You’ve internalized the algorithm
5. Core Design Notes from Main Guide
Core Question
How can we find the “most important” directions in high-dimensional data, and why does projecting onto these directions preserve the essential information while discarding the noise?
When you look at a face image with 65,536 pixels, most of that data is redundant. PCA reveals a profound truth: high-dimensional data often lives on a much lower-dimensional manifold. The eigenvectors of the covariance matrix point in the directions of maximum variance–the directions that matter most. This is not just compression; it is discovering the hidden structure in data. Every face can be approximated as a weighted combination of “eigenfaces.” This same principle powers feature extraction, noise reduction, and visualization of high-dimensional datasets.
Concepts You Must Understand First
Stop and research these before coding:
- What is the covariance matrix and what does it tell us?
- Why does Cov(X,Y) being positive mean X and Y tend to move together?
- Why is the covariance matrix always symmetric and positive semi-definite?
- How does centering the data (subtracting the mean) affect the covariance?
- Book Reference: “All of Statistics” Chapter 3 - Larry Wasserman
- Why are eigenvectors of the covariance matrix the “principal components”?
- What optimization problem does the first principal component solve?
- Why do subsequent components have to be orthogonal to previous ones?
- What is the connection between maximizing variance and minimizing reconstruction error?
- Book Reference: “Hands-On Machine Learning” Chapter 8 - Aurelien Geron
- Projection and reconstruction in linear algebra
- What does it mean to project a vector onto a subspace?
- Why is projection not invertible (you lose information)?
- How do you reconstruct from the projection?
- Book Reference: “Linear Algebra Done Right” Chapter 6 - Sheldon Axler
- Explained variance ratio and the scree plot
- What does it mean when an eigenvalue is large vs. small?
- How do you decide how many components to keep?
- What is the “elbow method” and why does it work?
- Book Reference: “Data Science for Business” Chapter 5 - Provost & Fawcett
- The connection between PCA and SVD
- Why can you compute PCA using Singular Value Decomposition?
- When would you use SVD instead of eigendecomposition of the covariance matrix?
- What are the computational advantages of SVD for large datasets?
- Book Reference: “Numerical Linear Algebra” Chapter 4 - Trefethen & Bau
Questions to Guide Your Design
Before implementing, think through these:
-
How will you represent the image data? Will each pixel be a feature (image as one vector) or each row as a data point? The choice affects what “variance” means.
-
How will you center the data? You must subtract the mean before computing covariance. But do you store the mean to add it back during reconstruction?
-
What happens at the boundaries of compression? With 0 components, you get the mean image. With all components, you get perfect reconstruction. What happens in between?
-
How will you handle color images? Process each RGB channel separately? Convert to grayscale? Stack channels as additional dimensions?
-
How will you visualize the principal components themselves? The eigenvectors are images too–what do they look like? What patterns do they capture?
-
How will you measure compression quality? PSNR? Visual inspection? Explained variance ratio?
Thinking Exercise
Before writing any code, trace through PCA on tiny 2D data:
Consider 4 data points: (2,3), (4,5), (6,7), (8,9)
Step 1: Center the data
Mean = (5, 6)
Centered: (-3,-3), (-1,-1), (1,1), (3,3)
Step 2: Compute covariance matrix
Cov = [[sum(x*x), sum(x*y)], = [[20, 20],
[sum(y*x), sum(y*y)]] / 3 [20, 20]] / 3
= [[6.67, 6.67],
[6.67, 6.67]]
Step 3: Find eigenvalues and eigenvectors
det([[6.67-lambda, 6.67], [6.67, 6.67-lambda]]) = 0
lambda^2 - 13.33*lambda = 0
lambda = 13.33 or lambda = 0
For lambda = 13.33: eigenvector = [1, 1] (normalized: [0.707, 0.707])
For lambda = 0: eigenvector = [1, -1] (normalized: [0.707, -0.707])
Step 4: Interpret
The data lies PERFECTLY on the line y = x
The first principal component (variance 13.33) points along this line
The second component (variance 0) is perpendicular--NO variance in that direction!
With 1 component, you capture 100% of the variance
This shows PCA finds the line of best fit!
Interview Questions
-
“What is PCA and why do we use it?” Expected answer: PCA finds the directions of maximum variance in data. Used for dimensionality reduction, visualization, noise reduction, and feature extraction.
-
“Walk me through the PCA algorithm step by step.” Expected answer: Center data, compute covariance matrix, find eigenvectors sorted by eigenvalue magnitude, project data onto top k eigenvectors, reconstruct by inverse projection plus mean.
-
“What is the relationship between eigenvectors of the covariance matrix and principal components?” Expected answer: They are the same thing. The eigenvector with the largest eigenvalue is the first principal component–the direction of maximum variance.
-
“How do you decide how many principal components to keep?” Expected answer: Look at explained variance ratio. Use scree plot and elbow method. Or choose enough to explain 95% of variance. Or use cross-validation if there is a downstream task.
-
“What are the limitations of PCA?” Expected answer: Only captures linear relationships. Sensitive to scaling (standardize first!). All components are linear combinations of original features (not always interpretable). Cannot handle missing data directly.
-
“When would you use SVD instead of eigendecomposition for PCA?” Expected answer: When the data matrix is tall and skinny (more samples than features), SVD on the data matrix is more efficient than forming and decomposing the large covariance matrix.
-
“What is the reconstruction error in PCA and how does it relate to discarded eigenvalues?” Expected answer: Reconstruction error equals the sum of discarded eigenvalues. Larger discarded eigenvalues mean more information lost.
Hints in Layers (Treat as pseudocode guidance)
Hint 1: Start with tiny images (8x8 digits from sklearn). You can verify your implementation against sklearn.decomposition.PCA before moving to larger images.
Hint 2: The core PCA algorithm in Python:
# X is (n_samples, n_features), already centered
cov_matrix = X.T @ X / (n_samples - 1)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
Hint 3: For projection and reconstruction:
# Keep top k components
V_k = eigenvectors[:, :k] # Shape: (n_features, k)
# Project (reduce dimensionality)
X_projected = X_centered @ V_k # Shape: (n_samples, k)
# Reconstruct
X_reconstructed = X_projected @ V_k.T + mean
Hint 4: For images, reshape! A 256x256 image has 65536 pixels. For n images, create a matrix of shape (n, 65536). Each row is a flattened image.
Hint 5: To visualize eigenfaces, reshape the eigenvector back to image dimensions. The first few eigenfaces capture broad patterns (lighting, face shape). Later ones capture fine details and noise.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| PCA Theory and Algorithm | “Hands-On Machine Learning” - Aurelien Geron | Chapter 8 |
| Covariance and Statistics | “All of Statistics” - Larry Wasserman | Chapter 3 |
| Linear Algebra of PCA | “Linear Algebra Done Right” - Sheldon Axler | Chapter 6 |
| SVD and Numerical Aspects | “Numerical Linear Algebra” - Trefethen & Bau | Chapter 4-5 |
| Image Processing with PCA | “Computer Vision” - Szeliski | Chapter 6 |
| Eigenfaces Application | “Pattern Recognition and ML” - Bishop | Chapter 12 |
| Variance and Information | “Information Theory” - Cover & Thomas | Chapter 8 |
Part 3: Calculus
Calculus is the mathematics of change and optimization. In ML, we constantly ask: “How does the output change when I change the input?” and “What input minimizes the error?”
6. Validation, Pitfalls, and Completion
Common Pitfalls and Debugging
Problem 1: “Outputs drift after a few iterations”
- Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
- Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
- Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.
Problem 2: “Results are inconsistent across runs”
- Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
- Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
- Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.
Problem 3: “The project works on the demo case but fails on edge cases”
- Why: Tests only cover happy-path inputs.
- Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
- Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.
Definition of Done
- Core functionality works on reference inputs
- Edge cases are tested and documented
- Results are reproducible (seeded and versioned configuration)
- Performance or convergence behavior is measured and explained
- A short retrospective explains what failed first and how you fixed it
7. Extension Ideas
- Add a stress-test mode with adversarial inputs.
- Add a short benchmark report (runtime + memory + error trend).
- Add a reproducibility bundle (seed, config, and fixed test corpus).
8. Why This Project Matters
PCA is eigenvalue decomposition applied to the covariance matrix. Building this from scratch (not using sklearn!) forces you to compute covariance, find eigenvectors, project data, and reconstruct. This is real ML, using real linear algebra.
This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.