LEARN PYTHON ML BASICS
Learn Basic Python ML: From NumPy to Scikit-learn
Goal: Master the foundational libraries of the Python machine learning stack. Learn the end-to-end workflow of a classical ML project: loading and cleaning data, visualization, model training, and evaluation.
Why Learn the Python ML Stack?
Python has become the undisputed lingua franca of machine learning. Its power comes from a rich ecosystem of libraries that provide the building blocks for everything from simple data analysis to complex artificial intelligence. Understanding this core stack is the first and most crucial step into the world of data science and AI.
By the end of these projects, you will be able to:
- Manipulate and analyze datasets with NumPy and Pandas.
- Create insightful data visualizations with Matplotlib and Seaborn.
- Train, evaluate, and use predictive models with Scikit-learn.
- Understand the complete workflow of a typical machine learning project.
Core Concept Analysis
The Machine Learning Workflow
A typical machine learning project follows a standard sequence of steps. Our projects will be structured around this workflow, introducing the right library for each job.
┌───────────────────────────────────────────────────┐
│ 1. Problem Definition & Data Gathering │
│ "What question are we trying to answer?" │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ 2. Exploratory Data Analysis (EDA) │
│ • Load, clean, and explore the data. │
│ • Tools: Pandas, Matplotlib, Seaborn │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ 3. Data Preprocessing & Feature Engineering │
│ • Handle missing values, scale numbers, encode text.│
│ • Tools: Scikit-learn, Pandas │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ 4. Model Training │
│ • Choose an algorithm and train it on the data. │
│ • Tools: Scikit-learn │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ 5. Model Evaluation │
│ • Test the model on unseen data. Check metrics. │
│ • Tools: Scikit-learn │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ 6. Prediction / Inference │
│ • Use the trained model on new data. │
└───────────────────────────────────────────────────┘
The Core Libraries
- NumPy: The foundation. It provides the powerful
ndarrayobject for efficient numerical computation. All other libraries are built on top of it. - Pandas: The data manipulation powerhouse. It gives you the
DataFrame, a spreadsheet-like object for cleaning, filtering, transforming, and exploring data. - Matplotlib/Seaborn: The visualization duo. Matplotlib is the low-level plotting library, while Seaborn provides a high-level, statistically-oriented interface for creating beautiful plots with less code.
- Scikit-learn: The “batteries-included” machine learning library. It offers a consistent API for dozens of classification, regression, and clustering models, plus tools for preprocessing and evaluation.
Project List
Project 1: NumPy from Scratch - The K-Nearest Neighbors Algorithm
- File: LEARN_PYTHON_ML_BASICS.md
- Main Programming Language: Python
- Alternative Programming Languages: Julia, MATLAB
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Numerical Computing / Machine Learning Fundamentals
- Software or Tool: NumPy
- Main Book: “Python for Data Analysis, 2nd Edition” by Wes McKinney
What you’ll build: The K-Nearest Neighbors (KNN) classification algorithm implemented entirely from scratch using only Python and the NumPy library.
Why it teaches the basics: This project forces you to understand the low-level numerical operations that underpin many ML algorithms. Before you can use the one-line model.fit() from Scikit-learn, you’ll appreciate the array manipulations, distance calculations, and sorting that happens behind the scenes. It’s a rite of passage for understanding vector-based computation.
Core challenges you’ll face:
- Storing data in
ndarrays → maps to understanding NumPy arrays, shapes, and dtypes - Calculating Euclidean distance → maps to using broadcasting and vectorized operations (
np.sqrt,np.sum) to avoid slow Python loops - Finding the ‘k’ closest neighbors → maps to using
np.argsortto efficiently find the indices of the smallest distances - Voting for the majority class → maps to counting occurrences in a NumPy array
Key Concepts:
- Vectorization: Performing operations on entire arrays at once instead of iterating element-by-element.
- Broadcasting: How NumPy treats arrays with different shapes during arithmetic operations.
- Euclidean Distance: The straight-line distance between two points in Euclidean space.
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Solid Python basics (functions, loops, lists).
Real world outcome: You’ll have a function that can predict the class of a new data point based on your training data.
import numpy as np
# You will build this function
def predict_knn(X_train, y_train, new_point, k):
# 1. Calculate distances from new_point to all points in X_train
distances = np.sqrt(np.sum((X_train - new_point)**2, axis=1))
# 2. Get the indices of the k nearest neighbors
k_nearest_indices = np.argsort(distances)[:k]
# 3. Get the labels of those neighbors
k_nearest_labels = y_train[k_nearest_indices]
# 4. Return the most common label (the prediction)
# (A simple way to do this is with collections.Counter)
# ...
# Example usage:
X_train = np.array([[1, 2], [2, 3], [3, 1], [4, 2]])
y_train = np.array([0, 0, 1, 1]) # Two classes: 0 and 1
new_point = np.array([2.5, 1.5])
prediction = predict_knn(X_train, y_train, new_point, k=3)
print(f"The predicted class is: {prediction}")
Learning milestones:
- You can calculate the distance between two NumPy vectors → You understand basic array arithmetic.
- Your distance calculation works for an entire array of vectors against one vector → You understand broadcasting.
- You can find the labels of the
kclosest points → You are usingargsortcorrectly. - You have a working
predictfunction → You have implemented a complete, albeit simple, machine learning algorithm.
Project 2: The Data Explorer - Titanic Dataset EDA
- File: LEARN_PYTHON_ML_BASICS.md
- Main Programming Language: Python
- Alternative Programming Languages: R (for comparison of data analysis tools)
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Exploratory Data Analysis (EDA) / Data Visualization
- Software or Tool: Pandas, Matplotlib, Seaborn
- Main Book: “Python for Data Analysis, 2nd Edition” by Wes McKinney
What you’ll build: A Jupyter Notebook that loads the famous Titanic dataset and uses Pandas and Seaborn to explore the data, find patterns, and answer questions like “Did women and children really have a better chance of survival?”
Why it teaches the basics: This project is a perfect introduction to the 80% of data science work that isn’t model training: data cleaning, manipulation, and visualization. You’ll learn how to use the Pandas DataFrame as your primary tool for wrangling data and Seaborn for creating insightful plots with minimal code.
Core challenges you’ll face:
- Loading and inspecting data → maps to
pd.read_csv(),.head(),.info(), and.describe() - Handling missing values → maps to finding
NaNs with.isnull().sum()and deciding whether to fill them (.fillna()) or drop them - Answering questions with data → maps to using
.groupby()and value counts to create summary statistics - Visualizing relationships → maps to using
seaborn.countplot,seaborn.histplot, andseaborn.heatmapto “see” the data
Key Concepts:
- DataFrame: The core data structure of Pandas.
- Exploratory Data Analysis (EDA): The process of summarizing the main characteristics of a dataset, often with visual methods.
- Feature Engineering: Creating new input features from existing ones.
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python.
Real world outcome: A well-documented Jupyter Notebook with compelling visualizations that tell a story about the Titanic disaster.
Example Code Snippets in your Notebook:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the data
df = pd.read_csv('titanic.csv')
# Inspect the data
print(df.info())
# Ask a question: What was the survival rate by gender?
print(df.groupby('Sex')['Survived'].mean())
# Visualize it
sns.countplot(x='Survived', hue='Sex', data=df)
plt.show()
# Visualize the age distribution of passengers
sns.histplot(df['Age'].dropna(), kde=True)
plt.show()
# Create a new feature for family size
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
# See correlation between features
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
Learning milestones:
- You can load a CSV into a Pandas DataFrame and inspect its properties → You’ve mastered the first step.
- You can identify and handle missing data in the ‘Age’ column → You are learning data cleaning.
- You can create a bar chart showing survival rates by class → You are using visualization to answer questions.
- You can create a new ‘FamilySize’ feature → You understand basic feature engineering.
Project 3: The Predictor - House Price Regression
- File: LEARN_PYTHON_ML_BASICS.md
- Main Programming Language: Python
- Alternative Programming Languages: R
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Machine Learning / Regression
- Software or Tool: Scikit-learn, NumPy, Pandas
- Main Book: “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron
What you’ll build: An end-to-end machine learning model that predicts median house values in California using the Scikit-learn library.
Why it teaches the basics: This project introduces you to the core, consistent API of Scikit-learn, the most important ML library for beginners. You’ll learn the fit/transform/predict pattern, how to split your data correctly, and how to evaluate your model’s performance for a regression problem.
Core challenges you’ll face:
- Splitting data → maps to using
train_test_splitto create training and testing sets to prevent data leakage - Feature Scaling → maps to using
StandardScalerto normalize features, a crucial step for many algorithms - The Scikit-learn API → maps to instantiating a model, training it with
.fit(), and making predictions with.predict() - Evaluating performance → maps to using regression metrics like Mean Squared Error (
mean_squared_error)
Key Concepts:
- Regression: Predicting a continuous numerical value.
- Train-Test Split: The practice of separating data into a training set and a test set to evaluate a model’s performance on unseen data.
- Feature Scaling: Scaling numerical features to a standard range to prevent features with large scales from dominating the model.
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1 and 2, or a basic understanding of NumPy/Pandas.
Real world outcome: A trained model that can take in housing data and predict a price, along with a number that tells you how accurate the model is.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# 1. Load data
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Preprocess data (Scale features)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use the scaler fitted on the training data!
# 4. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# 5. Evaluate model
predictions = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")
# 6. Predict on a new sample
# (You'd need a new, scaled sample of data here)
Learning milestones:
- You can split your data into training and testing sets → You understand the importance of validation on unseen data.
- You correctly fit the
StandardScaleron the training data and use it to transform both sets → You have avoided the common data leakage pitfall. - You can train a
LinearRegressionmodel using the.fit()method → You have mastered the core Scikit-learn API. - You can evaluate your model’s performance using an appropriate metric → You know how to measure success.
Project 4: The Classifier - Iris Flower Species
- File: LEARN_PYTHON_ML_BASICS.md
- Main Programming Language: Python
- Alternative Programming Languages: R
- Coolness Level: Level 1: Pure Corporate Snoozefest (but essential)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Machine Learning / Classification
- Software or Tool: Scikit-learn, Seaborn
- Main Book: “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron
What you’ll build: A model that can classify an iris flower into one of three species based on the dimensions of its petals and sepals.
Why it teaches the basics: This project reinforces the Scikit-learn API but applies it to classification, the other major type of supervised learning. You’ll learn that the training process is identical (.fit(), .predict()), but the models and evaluation metrics are different. This drives home the consistency and power of the Scikit-learn API design.
Core challenges you’ll face:
- Applying the same API to a new problem → maps to recognizing the universal
fit/predictpattern - Trying different classification algorithms → maps to swapping out
LogisticRegressionforKNeighborsClassifierorSVC - Using classification-specific metrics → maps to understanding
accuracy_score,confusion_matrix, andclassification_report - Visualizing the results → maps to using a confusion matrix heatmap to see what your model is getting wrong
Key Concepts:
- Classification: Predicting a discrete category or class label.
- Accuracy: A common (but sometimes misleading) metric for classification performance.
- Confusion Matrix: A table that breaks down the performance of a classification model, showing true positives, true negatives, false positives, and false negatives.
Difficulty: Intermediate Time estimate: A few hours Prerequisites: Project 3.
Real world outcome: A trained classifier and a report that shows how accurately it can identify iris species.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
# 1. Load and split data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 2. Train model (no scaling needed for this dataset/model combo)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# 3. Evaluate model
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print("\nClassification Report:")
print(classification_report(y_test, predictions))
# 4. Visualize Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Learning milestones:
- You have trained a classification model → You can now solve a new class of problems.
- You can swap one algorithm for another with only one line of code change → You appreciate the consistent API of Scikit-learn.
- You can produce and interpret a classification report → You can measure classification performance beyond simple accuracy.
- You can create and interpret a confusion matrix → You can visually diagnose where your model is making mistakes.
Summary
| Project | Main Libraries | Difficulty | Key Takeaway |
|---|---|---|---|
| 1. NumPy from Scratch: KNN | NumPy | Intermediate | Understanding the low-level numerical computing behind ML. |
| 2. The Data Explorer: Titanic | Pandas, Seaborn | Beginner | How to clean, explore, and visualize a dataset. |
| 3. The Predictor: House Prices | Scikit-learn | Intermediate | The end-to-end workflow for a regression problem. |
| 4. The Classifier: Iris Flowers | Scikit-learn | Intermediate | Adapting the workflow for a classification problem. |
This learning path provides a comprehensive introduction to the foundational tools and concepts of classical machine learning in Python.
```