← Back to all projects

LEARN PYTHON ML BASICS

Learn Basic Python ML: From NumPy to Scikit-learn

Goal: Master the foundational libraries of the Python machine learning stack. Learn the end-to-end workflow of a classical ML project: loading and cleaning data, visualization, model training, and evaluation.


Why Learn the Python ML Stack?

Python has become the undisputed lingua franca of machine learning. Its power comes from a rich ecosystem of libraries that provide the building blocks for everything from simple data analysis to complex artificial intelligence. Understanding this core stack is the first and most crucial step into the world of data science and AI.

By the end of these projects, you will be able to:

  • Manipulate and analyze datasets with NumPy and Pandas.
  • Create insightful data visualizations with Matplotlib and Seaborn.
  • Train, evaluate, and use predictive models with Scikit-learn.
  • Understand the complete workflow of a typical machine learning project.

Core Concept Analysis

The Machine Learning Workflow

A typical machine learning project follows a standard sequence of steps. Our projects will be structured around this workflow, introducing the right library for each job.

┌───────────────────────────────────────────────────┐
│ 1. Problem Definition & Data Gathering            │
│   "What question are we trying to answer?"        │
└───────────────────────────────────────────────────┘
                 │
                 ▼
┌───────────────────────────────────────────────────┐
│ 2. Exploratory Data Analysis (EDA)                │
│   • Load, clean, and explore the data.            │
│   • Tools: Pandas, Matplotlib, Seaborn            │
└───────────────────────────────────────────────────┘
                 │
                 ▼
┌───────────────────────────────────────────────────┐
│ 3. Data Preprocessing & Feature Engineering       │
│   • Handle missing values, scale numbers, encode text.│
│   • Tools: Scikit-learn, Pandas                   │
└───────────────────────────────────────────────────┘
                 │
                 ▼
┌───────────────────────────────────────────────────┐
│ 4. Model Training                                 │
│   • Choose an algorithm and train it on the data. │
│   • Tools: Scikit-learn                           │
└───────────────────────────────────────────────────┘
                 │
                 ▼
┌───────────────────────────────────────────────────┐
│ 5. Model Evaluation                               │
│   • Test the model on unseen data. Check metrics. │
│   • Tools: Scikit-learn                           │
└───────────────────────────────────────────────────┘
                 │
                 ▼
┌───────────────────────────────────────────────────┐
│ 6. Prediction / Inference                         │
│   • Use the trained model on new data.            │
└───────────────────────────────────────────────────┘

The Core Libraries

  • NumPy: The foundation. It provides the powerful ndarray object for efficient numerical computation. All other libraries are built on top of it.
  • Pandas: The data manipulation powerhouse. It gives you the DataFrame, a spreadsheet-like object for cleaning, filtering, transforming, and exploring data.
  • Matplotlib/Seaborn: The visualization duo. Matplotlib is the low-level plotting library, while Seaborn provides a high-level, statistically-oriented interface for creating beautiful plots with less code.
  • Scikit-learn: The “batteries-included” machine learning library. It offers a consistent API for dozens of classification, regression, and clustering models, plus tools for preprocessing and evaluation.

Project List


Project 1: NumPy from Scratch - The K-Nearest Neighbors Algorithm

  • File: LEARN_PYTHON_ML_BASICS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Julia, MATLAB
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Numerical Computing / Machine Learning Fundamentals
  • Software or Tool: NumPy
  • Main Book: “Python for Data Analysis, 2nd Edition” by Wes McKinney

What you’ll build: The K-Nearest Neighbors (KNN) classification algorithm implemented entirely from scratch using only Python and the NumPy library.

Why it teaches the basics: This project forces you to understand the low-level numerical operations that underpin many ML algorithms. Before you can use the one-line model.fit() from Scikit-learn, you’ll appreciate the array manipulations, distance calculations, and sorting that happens behind the scenes. It’s a rite of passage for understanding vector-based computation.

Core challenges you’ll face:

  • Storing data in ndarrays → maps to understanding NumPy arrays, shapes, and dtypes
  • Calculating Euclidean distance → maps to using broadcasting and vectorized operations (np.sqrt, np.sum) to avoid slow Python loops
  • Finding the ‘k’ closest neighbors → maps to using np.argsort to efficiently find the indices of the smallest distances
  • Voting for the majority class → maps to counting occurrences in a NumPy array

Key Concepts:

  • Vectorization: Performing operations on entire arrays at once instead of iterating element-by-element.
  • Broadcasting: How NumPy treats arrays with different shapes during arithmetic operations.
  • Euclidean Distance: The straight-line distance between two points in Euclidean space.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Solid Python basics (functions, loops, lists).

Real world outcome: You’ll have a function that can predict the class of a new data point based on your training data.

import numpy as np

# You will build this function
def predict_knn(X_train, y_train, new_point, k):
    # 1. Calculate distances from new_point to all points in X_train
    distances = np.sqrt(np.sum((X_train - new_point)**2, axis=1))
    
    # 2. Get the indices of the k nearest neighbors
    k_nearest_indices = np.argsort(distances)[:k]
    
    # 3. Get the labels of those neighbors
    k_nearest_labels = y_train[k_nearest_indices]
    
    # 4. Return the most common label (the prediction)
    # (A simple way to do this is with collections.Counter)
    # ...
    
# Example usage:
X_train = np.array([[1, 2], [2, 3], [3, 1], [4, 2]])
y_train = np.array([0, 0, 1, 1]) # Two classes: 0 and 1

new_point = np.array([2.5, 1.5])
prediction = predict_knn(X_train, y_train, new_point, k=3)
print(f"The predicted class is: {prediction}")

Learning milestones:

  1. You can calculate the distance between two NumPy vectors → You understand basic array arithmetic.
  2. Your distance calculation works for an entire array of vectors against one vector → You understand broadcasting.
  3. You can find the labels of the k closest points → You are using argsort correctly.
  4. You have a working predict function → You have implemented a complete, albeit simple, machine learning algorithm.

Project 2: The Data Explorer - Titanic Dataset EDA

  • File: LEARN_PYTHON_ML_BASICS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R (for comparison of data analysis tools)
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Exploratory Data Analysis (EDA) / Data Visualization
  • Software or Tool: Pandas, Matplotlib, Seaborn
  • Main Book: “Python for Data Analysis, 2nd Edition” by Wes McKinney

What you’ll build: A Jupyter Notebook that loads the famous Titanic dataset and uses Pandas and Seaborn to explore the data, find patterns, and answer questions like “Did women and children really have a better chance of survival?”

Why it teaches the basics: This project is a perfect introduction to the 80% of data science work that isn’t model training: data cleaning, manipulation, and visualization. You’ll learn how to use the Pandas DataFrame as your primary tool for wrangling data and Seaborn for creating insightful plots with minimal code.

Core challenges you’ll face:

  • Loading and inspecting data → maps to pd.read_csv(), .head(), .info(), and .describe()
  • Handling missing values → maps to finding NaNs with .isnull().sum() and deciding whether to fill them (.fillna()) or drop them
  • Answering questions with data → maps to using .groupby() and value counts to create summary statistics
  • Visualizing relationships → maps to using seaborn.countplot, seaborn.histplot, and seaborn.heatmap to “see” the data

Key Concepts:

  • DataFrame: The core data structure of Pandas.
  • Exploratory Data Analysis (EDA): The process of summarizing the main characteristics of a dataset, often with visual methods.
  • Feature Engineering: Creating new input features from existing ones.

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python.

Real world outcome: A well-documented Jupyter Notebook with compelling visualizations that tell a story about the Titanic disaster.

Example Code Snippets in your Notebook:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('titanic.csv')

# Inspect the data
print(df.info())

# Ask a question: What was the survival rate by gender?
print(df.groupby('Sex')['Survived'].mean())

# Visualize it
sns.countplot(x='Survived', hue='Sex', data=df)
plt.show()

# Visualize the age distribution of passengers
sns.histplot(df['Age'].dropna(), kde=True)
plt.show()

# Create a new feature for family size
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# See correlation between features
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

Learning milestones:

  1. You can load a CSV into a Pandas DataFrame and inspect its properties → You’ve mastered the first step.
  2. You can identify and handle missing data in the ‘Age’ column → You are learning data cleaning.
  3. You can create a bar chart showing survival rates by class → You are using visualization to answer questions.
  4. You can create a new ‘FamilySize’ feature → You understand basic feature engineering.

Project 3: The Predictor - House Price Regression

  • File: LEARN_PYTHON_ML_BASICS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Machine Learning / Regression
  • Software or Tool: Scikit-learn, NumPy, Pandas
  • Main Book: “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron

What you’ll build: An end-to-end machine learning model that predicts median house values in California using the Scikit-learn library.

Why it teaches the basics: This project introduces you to the core, consistent API of Scikit-learn, the most important ML library for beginners. You’ll learn the fit/transform/predict pattern, how to split your data correctly, and how to evaluate your model’s performance for a regression problem.

Core challenges you’ll face:

  • Splitting data → maps to using train_test_split to create training and testing sets to prevent data leakage
  • Feature Scaling → maps to using StandardScaler to normalize features, a crucial step for many algorithms
  • The Scikit-learn API → maps to instantiating a model, training it with .fit(), and making predictions with .predict()
  • Evaluating performance → maps to using regression metrics like Mean Squared Error (mean_squared_error)

Key Concepts:

  • Regression: Predicting a continuous numerical value.
  • Train-Test Split: The practice of separating data into a training set and a test set to evaluate a model’s performance on unseen data.
  • Feature Scaling: Scaling numerical features to a standard range to prevent features with large scales from dominating the model.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1 and 2, or a basic understanding of NumPy/Pandas.

Real world outcome: A trained model that can take in housing data and predict a price, along with a number that tells you how accurate the model is.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 1. Load data
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Preprocess data (Scale features)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use the scaler fitted on the training data!

# 4. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# 5. Evaluate model
predictions = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")

# 6. Predict on a new sample
# (You'd need a new, scaled sample of data here)

Learning milestones:

  1. You can split your data into training and testing sets → You understand the importance of validation on unseen data.
  2. You correctly fit the StandardScaler on the training data and use it to transform both sets → You have avoided the common data leakage pitfall.
  3. You can train a LinearRegression model using the .fit() method → You have mastered the core Scikit-learn API.
  4. You can evaluate your model’s performance using an appropriate metric → You know how to measure success.

Project 4: The Classifier - Iris Flower Species

  • File: LEARN_PYTHON_ML_BASICS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R
  • Coolness Level: Level 1: Pure Corporate Snoozefest (but essential)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Machine Learning / Classification
  • Software or Tool: Scikit-learn, Seaborn
  • Main Book: “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron

What you’ll build: A model that can classify an iris flower into one of three species based on the dimensions of its petals and sepals.

Why it teaches the basics: This project reinforces the Scikit-learn API but applies it to classification, the other major type of supervised learning. You’ll learn that the training process is identical (.fit(), .predict()), but the models and evaluation metrics are different. This drives home the consistency and power of the Scikit-learn API design.

Core challenges you’ll face:

  • Applying the same API to a new problem → maps to recognizing the universal fit/predict pattern
  • Trying different classification algorithms → maps to swapping out LogisticRegression for KNeighborsClassifier or SVC
  • Using classification-specific metrics → maps to understanding accuracy_score, confusion_matrix, and classification_report
  • Visualizing the results → maps to using a confusion matrix heatmap to see what your model is getting wrong

Key Concepts:

  • Classification: Predicting a discrete category or class label.
  • Accuracy: A common (but sometimes misleading) metric for classification performance.
  • Confusion Matrix: A table that breaks down the performance of a classification model, showing true positives, true negatives, false positives, and false negatives.

Difficulty: Intermediate Time estimate: A few hours Prerequisites: Project 3.

Real world outcome: A trained classifier and a report that shows how accurately it can identify iris species.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Load and split data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train model (no scaling needed for this dataset/model combo)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# 3. Evaluate model
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print("\nClassification Report:")
print(classification_report(y_test, predictions))

# 4. Visualize Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Learning milestones:

  1. You have trained a classification model → You can now solve a new class of problems.
  2. You can swap one algorithm for another with only one line of code change → You appreciate the consistent API of Scikit-learn.
  3. You can produce and interpret a classification report → You can measure classification performance beyond simple accuracy.
  4. You can create and interpret a confusion matrix → You can visually diagnose where your model is making mistakes.

Summary

Project Main Libraries Difficulty Key Takeaway
1. NumPy from Scratch: KNN NumPy Intermediate Understanding the low-level numerical computing behind ML.
2. The Data Explorer: Titanic Pandas, Seaborn Beginner How to clean, explore, and visualize a dataset.
3. The Predictor: House Prices Scikit-learn Intermediate The end-to-end workflow for a regression problem.
4. The Classifier: Iris Flowers Scikit-learn Intermediate Adapting the workflow for a classification problem.

This learning path provides a comprehensive introduction to the foundational tools and concepts of classical machine learning in Python.

```