Project 1: Vectorized KNN From First Principles

Build a deterministic KNN classifier using vectorized array operations and explicit decision policies.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	6-10 hours
Main Programming Language	Python (Alternatives: Julia, R)
Alternative Programming Languages	Julia, R
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. Resume Gold
Prerequisites	NumPy basics, vector shapes, basic classification vocabulary
Key Topics	Vectorization, broadcasting, KNN, scaling sensitivity, deterministic tie rules

1. Learning Objectives

By completing this project, you will:

Explain KNN as geometry on feature space instead of as a black-box API.
Implement distance and voting logic with vectorized operations.
Build deterministic rules for ties and edge cases.
Demonstrate why scaling changes distance-based models.
Produce a repeatable CLI-style benchmark transcript.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Distance Geometry in Feature Spaces

Fundamentals

KNN predicts by local neighborhood voting. The core assumption is that points close in feature space tend to share labels. “"”Close””” is defined by a distance metric such as Euclidean or Manhattan distance. This assumption only works if features are represented on compatible scales and meaningful axes. If one feature has a much larger numeric range than others, it dominates distances and distorts neighborhoods.

Deep Dive into the concept

Distance geometry is the first practical place where ML abstractions become concrete. Every sample is a vector, and distance is a function over vector differences. In Euclidean distance, each feature contributes squared deviation. This amplifies large deviations and makes scale control essential. If income ranges from 0-200000 and tenure ranges from 0-24, income dominates unless normalized.

In KNN, prediction complexity grows with dataset size because each query compares against all stored points in the brute-force variant. This makes KNN simple to train but expensive to query. Approximate nearest neighbor methods reduce query cost but introduce approximation error. For this project, brute-force is preferred because it exposes full mechanics.

Ties are common with discrete or noisy data. A robust KNN implementation should declare tie-breaking policy explicitly: stable index order, distance-weighted voting, or class-priority list. Undocumented tie behavior causes non-reproducible outputs and interview-grade design flaws.

Distance metrics encode assumptions about feature geometry. Euclidean assumes isotropic contribution after scaling; Manhattan can be more robust in some sparse/high-dimensional settings. Weighted-distance variants can encode domain importance but also overfit if weights are tuned carelessly.

Finally, KNN has no explicit learned parameters in basic form, but it still has hyperparameters (k, distance metric, weighting strategy). These parameters must be selected with validation, not test data.

How this fit on projects

You apply this geometry intuition directly in Project 1 and reuse scale intuition in Projects 3-5.

Definitions & key terms

k: number of nearest neighbors considered for voting.
distance metric: function that measures pairwise similarity/dissimilarity.
decision boundary: boundary separating class regions in feature space.
local model: prediction depends on nearby examples rather than global coefficients.

Mental model diagram

Class A: o   Class B: x

        o   o
   o                 x
         q(?)
      x         x
            x

Neighborhood around q determines predicted label.
Changing scale stretches space and changes neighbors.

How it works

Store training matrix and label vector.
For each query, compute distance to all training rows.
Rank distances and select top-k indices.
Vote labels with deterministic tie policy.
Return label and optional confidence proxy.

Invariants:

Feature/label row alignment preserved.
Distance computation is shape-safe.
Tie behavior deterministic.

Failure modes:

Mismatched feature scales.
Non-deterministic sorting with equal distances.
Incorrect axis in vectorized reduction.

Minimal concrete example

X_train: shape (5,2)
y_train: [A, A, B, B, B]
query: [2.4, 1.7]
compute distances -> [0.5, 1.3, 0.4, 1.1, 2.0]
nearest k=3 labels -> [B, A, B]
prediction -> B

Common misconceptions

“KNN doesn’t need preprocessing.” Correction: it is highly preprocessing-sensitive.
“Higher k is always better.” Correction: high k can wash out local structure.

Check-your-understanding questions

Why can KNN fail on unscaled features?
What is the trade-off between low and high k?
Why is deterministic tie-breaking important?

Check-your-understanding answers

Larger-range features dominate distance, distorting neighborhood identity.
Low k is high variance; high k is high bias.
Reproducibility and predictable behavior in production and testing.

Real-world applications

Content recommendation baselines.
Similar-customer retrieval.
Prototype classifiers in low-data workflows.

Where you’ll apply it

This project directly.
Also used in P03 and P05 when interpreting scale effects.

References

NumPy broadcasting docs: https://numpy.org/doc/stable/user/basics.broadcasting.html
scikit-learn neighbors overview: https://scikit-learn.org/stable/modules/neighbors.html

Key insights

KNN performance is mostly a data geometry problem, not a model complexity problem.

Summary

You should now treat KNN as a transparent geometric decision process with explicit trade-offs.

Homework/Exercises to practice the concept

Compare Euclidean vs Manhattan distance on same dataset.
Evaluate k values {1,3,5,9} and record variance in predictions.
Inject one high-scale feature and measure prediction drift.

Solutions to the homework/exercises

Expect metric-specific changes near class boundaries.
Low k fluctuates more; moderate k stabilizes.
Without scaling, high-scale feature likely dominates neighborhood choice.

3. Project Specification

3.1 What You Will Build

A command-driven KNN workflow that loads a fixed dataset, predicts labels for query points, and prints neighbor-level diagnostics.

3.2 Functional Requirements

Accept dataset input and query point(s).
Compute vectorized distances for each query.
Support configurable k and metric (euclidean, manhattan).
Return deterministic predictions and vote distribution.
Emit helpful error messages for malformed input.

3.3 Non-Functional Requirements

Performance: Predict 1000 queries against 10k rows in under 2s on laptop hardware.
Reliability: Same input yields same output.
Usability: Clear CLI output with no hidden decisions.

3.4 Example Usage / Output

$ ml-lab knn-demo --dataset toy.csv --k 5 --metric euclidean --query "2.4,1.7"
Predicted class: B
Nearest labels: [A,B,B,B,C]
Vote: A=1 B=3 C=1

3.5 Data Formats / Schemas / Protocols

CSV features: numeric columns only for baseline.
Label column: string class label.
Query format: comma-separated numeric values matching feature count.

3.6 Edge Cases

k > n_samples should fail with explicit error.
Equal-distance ties should follow documented policy.
Missing/NaN features should fail-fast unless imputation explicitly enabled.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab knn-demo --dataset data/toy_knn.csv --k 5 --metric euclidean --query "2.4,1.7"

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab knn-demo --dataset fixtures/knn_fixed.csv --k 3 --query "4.0,2.0"
Nearest labels: [B,B,A]
Predicted class: B
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab knn-demo --dataset fixtures/knn_fixed.csv --k 3 --query "4.0,2.0"
Loaded rows: 120
Distance metric: euclidean
Nearest labels: [B,B,A]
Vote counts: A=1 B=2
Predicted class: B
Exit code: 0

$ ml-lab knn-demo --dataset fixtures/knn_fixed.csv --k 999 --query "4.0,2.0"
ERROR: k (999) cannot exceed number of samples (120)
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Input Parser -> Data Validator -> Distance Engine -> Neighbor Selector -> Vote Resolver -> Reporter

4.2 Key Components

Component	Responsibility	Key Decisions
Parser	Read CLI args and query vectors	strict format validation
Validator	Enforce schema/shape constraints	fail-fast on mismatch
Distance Engine	Vectorized metric computation	axis-safe operations
Vote Resolver	Deterministic class decision	stable tie policy
Reporter	Human-readable diagnostics	include vote counts

4.3 Data Structures (No Full Code)

X_train: float matrix [n_samples, n_features]
y_train: label vector [n_samples]
queries: float matrix [n_queries, n_features]

4.4 Algorithm Overview

Validate query dimensionality.
Compute pairwise distances from each query to all train points.
Sort and slice nearest k.
Count labels and resolve tie deterministically.

Complexity:

Time: O(n_queries * n_samples * n_features)
Space: O(n_samples) per query (or larger if batching)

5. Implementation Guide

5.1 Development Environment Setup

# create venv, install numpy/pandas, run command-line entrypoint

5.2 Project Structure

knn-project/
|-- data/
|-- fixtures/
|-- src/
|-- tests/
`-- reports/

5.3 The Core Question You’re Answering

How do local geometric assumptions become concrete classifier behavior?

5.4 Concepts You Must Understand First

Vectorized axis operations
Stable sorting behavior
Distance scaling effects

5.5 Questions to Guide Your Design

Which axis reduction produces one distance per sample?
How do you ensure reproducibility for ties?
Should confidence be vote ratio or distance-weighted score?

5.6 Thinking Exercise

Manually classify three points and compare with your tool output.

5.7 The Interview Questions They’ll Ask

Why is KNN expensive at inference?
How do KD-trees help and when do they fail?
Why can normalization change decision boundaries?
How do you tune k safely?

5.8 Hints in Layers

Hint 1: build one-query path first.
Hint 2: print intermediate shapes.
Hint 3: add deterministic sort policy.
Hint 4: run a scale stress-test fixture.

5.9 Books That Will Help

Topic	Book	Chapter
NumPy fundamentals	Python for Data Analysis	Ch. 4
Basic ML intuition	Hands-On Machine Learning	Ch. 2

5.10 Implementation Phases

Phase 1: Parsing + validation.
Phase 2: Distance + voting core.
Phase 3: Diagnostics + edge-case handling.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Tie policy	random, stable index, weighted	stable index	reproducibility
Metric default	euclidean, manhattan	euclidean	common baseline

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	validate core math	distance output shape/value
Integration	end-to-end CLI	deterministic transcript
Edge Case	invalid input behavior	k too large, malformed query

6.2 Critical Test Cases

Golden fixture predicts expected class.
Scale-perturbed fixture changes expected decision.
Tie fixture follows documented tie policy.

6.3 Test Data

Use fixed CSV fixtures with known nearest neighbors.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong axis reduction	shape errors or wrong distances	assert output shape
Unscaled features	unstable/biased predictions	apply normalization
Undocumented tie logic	inconsistent outputs	enforce policy

7.2 Debugging Strategies

Print intermediate vector shapes for first query.
Compare one-query manual calculation with tool result.

7.3 Performance Traps

Recomputing expensive transforms per query; precompute reusable pieces.

8. Extensions & Challenges

8.1 Beginner Extensions

Add configurable k sweep report.
Add optional Manhattan metric.

8.2 Intermediate Extensions

Add distance-weighted voting.
Add confusion-matrix summary over validation set.

8.3 Advanced Extensions

Add approximate nearest-neighbor backend.
Add multi-threaded batch query mode.

9. Real-World Connections

9.1 Industry Applications

Similar-item search baselines.
Case-based recommendation systems.

scikit-learn neighbors module.
FAISS for approximate nearest-neighbor search.

9.3 Interview Relevance

This project maps directly to nearest-neighbor theory, scaling pitfalls, and complexity analysis questions.

10. Resources

10.1 Essential Reading

NumPy docs (broadcasting and array operations).
scikit-learn neighbors user guide.

10.2 Video Resources

3Blue1Brown linear algebra intuition.
Intro ML distance-based learning lectures.

10.3 Tools & Documentation

NumPy docs: https://numpy.org/doc/stable/
scikit-learn docs: https://scikit-learn.org/stable/

Next: Project 2

11. Self-Assessment Checklist

11.1 Understanding

I can explain why scaling matters in KNN.
I can describe tie-resolution trade-offs.

11.2 Implementation

Golden fixture passes deterministically.
Error handling for invalid k works.

11.3 Growth

I documented one optimization idea for inference speed.

12. Submission / Completion Criteria

Minimum Viable Completion:

Deterministic KNN predictions with clear logs.
Basic edge-case handling.

Full Completion:

Includes metric comparison across k values.
Includes scaling before/after analysis.

Excellence (Going Above & Beyond):

Includes weighted voting and robust benchmark notes.