LEARN STATISTICS FROM SCRATCH
Learn Statistics: From Scratch to Data-Driven Thinking
Goal: Build a strong, intuitive understanding of statistics from the ground up. You will go from basic data summarization to making informed decisions and predictions, using practical, real-world projects.
Why Learn Statistics?
Statistics is the science of learning from data. It’s the foundation of data science, machine learning, and any field where decisions are made under uncertainty. Most developers have a vague notion of it, but few can wield it confidently.
- Become a Better Thinker: Learn to spot biases, question assumptions, and separate signal from noise in any context.
- Unlock Data Science: Statistics is the “why” behind the “how” of machine learning algorithms.
- Build Smarter Products: Use A/B testing and data analysis to build things your users actually want.
- Win Arguments with Data: Move from “I think” to “I can show you.”
After completing these projects, you will:
- Intuitively understand core concepts like probability distributions, confidence intervals, and p-values.
- Be able to clean, analyze, and visualize datasets with professional tools like Python’s Pandas and Matplotlib.
- Confidently perform hypothesis tests to make data-driven decisions.
- Build and interpret simple predictive models using linear regression.
- Never look at a news headline citing a “study” the same way again.
Core Concept Analysis
The Statistics Learning Ladder
Your journey will follow two main paths: Descriptive Statistics (describing what you see) and Inferential Statistics (making guesses about what you can’t see).
┌─────────────────────────────────────────────────────────────┐
│ 6. REGRESSION │
│ (Predicting values, e.g., house price from size) │
└─────────────────────────────────────────────────────────────┘
▲
│
┌─────────────────────────────────────────────────────────────┐
│ 5. INFERENTIAL STATISTICS │
│ (Hypothesis Testing, A/B Tests, p-values, CI) │
│ (Is this new drug better than the old one?) │
└─────────────────────────────────────────────────────────────┘
▲
│
┌─────────────────────────────────────────────────────────────┐
│ 4. PROBABILITY DISTRIBUTIONS │
│ (Normal, Binomial, Poisson - Modeling random events) │
│ (What's the range of likely outcomes for coin flips?) │
└─────────────────────────────────────────────────────────────┘
▲
│
┌─────────────────────────────────────────────────────────────┐
│ 3. PROBABILITY │
│ (The language of uncertainty, Bayes' Theorem) │
│ (What are the chances of drawing a red card?) │
└─────────────────────────────────────────────────────────────┘
▲
│
┌─────────────────────────────────────────────────────────────┐
│ 1 & 2. DESCRIPTIVE STATISTICS │
│ (Mean, Median, Mode, Variance, Std Dev, Histograms) │
│ (What does our data look like?) │
└─────────────────────────────────────────────────────────────┘
Key Concepts Explained
1. Descriptive Statistics (The “What”)
These tools summarize a dataset into a few key numbers.
- Measures of Central Tendency: Where is the “center” of the data?
- Mean: The average. Sum of all values / number of values. Sensitive to outliers.
- Median: The middle value when sorted. Robust to outliers.
- Mode: The most frequent value.
- Measures of Dispersion (Spread): How spread out is the data?
- Range: Maximum value - Minimum value. Very simple.
- Variance (σ²): The average of the squared differences from the Mean. Hard to interpret directly.
- Standard Deviation (σ): The square root of the variance. Easy to interpret as it’s in the original units of the data. A low SD means data is clustered around the mean.
- Quartiles & IQR: Divides data into four equal parts. The Interquartile Range (IQR = Q3 - Q1) is the middle 50% of the data.
2. Probability (The “Maybe”)
Probability is a number between 0 (impossible) and 1 (certain) that represents the likelihood of an event.
- Key idea: If we repeat an experiment many times, the proportion of times an event occurs will approach its probability.
- Conditional Probability: The probability of event A happening, given that event B has already happened. Written as
P(A|B). - Bayes’ Theorem: A revolutionary idea that lets you update your beliefs in light of new evidence. It connects
P(A|B)withP(B|A). It’s the engine behind medical diagnoses and spam filters.
3. Probability Distributions (The “Shape”)
A distribution describes the probabilities of all possible outcomes.
- Normal Distribution (The “Bell Curve”): Describes many natural phenomena (heights, blood pressure). Defined by its mean and standard deviation.
- Binomial Distribution: Describes the number of successes in a fixed number of independent trials (e.g., number of heads in 10 coin flips).
- Poisson Distribution: Describes the number of events in a fixed interval of time or space, if these events happen at a known average rate (e.g., number of customers arriving at a store in an hour).
4. Inferential Statistics (The “Guess”)
This is where we use data from a small sample to make an educated guess (an inference) about a large population.
- Central Limit Theorem (CLT): The most important idea in statistics. It states that if you take many large enough samples from any population, the distribution of the sample means will be approximately normal. This is what allows us to make inferences.
- Confidence Interval (CI): An estimated range of values which is likely to include the true population parameter (e.g., “We are 95% confident that the true average height of all men is between 5’9” and 5’11””).
- Hypothesis Testing: A formal procedure for checking if your data supports a certain hypothesis.
- Null Hypothesis (H₀): The default assumption, usually stating “no effect” or “no difference” (e.g., “This new drug has no effect”).
- Alternative Hypothesis (H₁): The claim you want to prove (e.g., “This new drug reduces recovery time”).
- p-value: The probability of observing your data (or something more extreme), if the null hypothesis were true. A small p-value (typically < 0.05) suggests that your observation is surprising under the null hypothesis, providing evidence against it.
Project List
These projects are designed to be done in order, building your skills from the ground up using Python, a powerful and beginner-friendly tool for statistics.
Project 1: Personal Data Dashboard
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Google Sheets
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Descriptive Statistics / Data Visualization
- Software or Tool: Python, Pandas, Matplotlib
- Main Book: Think Stats, 2nd Edition by Allen B. Downey
What you’ll build: A simple script that ingests data you’ve collected about yourself (e.g., hours slept, cups of coffee, pages read per day) and calculates key descriptive statistics, then generates a histogram and a box plot.
Why it teaches statistics: This project makes abstract concepts like mean, median, standard deviation, and quartiles tangible because they describe your own life. You’ll see firsthand how an outlier (a sleepless night) affects the mean but not the median.
Core challenges you’ll face:
- Collecting and formatting data → maps to the basics of data entry in a CSV file
- Calculating mean, median, and mode → maps to understanding central tendency
- Calculating variance and standard deviation → maps to quantifying the “spread” or “consistency” of your habits
- Creating a histogram and box plot → maps to visualizing the shape and distribution of your data
Key Concepts:
- Descriptive Statistics: “Think Stats” Ch. 2 - Allen B. Downey
- Histograms: “Think Stats” Ch. 3
- DataFrames: Pandas Documentation - 10 Minutes to pandas
- Plotting: Matplotlib Pyplot Tutorial
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python (variables, lists, functions).
Real world outcome: You’ll run a Python script and it will output something like this to your console and generate a plot:
Analysis of 'Hours Slept' (30 days):
- Mean: 7.2 hours
- Median: 7.5 hours
- Standard Deviation: 1.1 hours
- Min: 4.0 hours, Max: 9.0 hours
You will also see a histogram showing the distribution of your sleep hours, immediately telling you your most common sleep duration.
Implementation Hints:
- Data Collection: For one week, record a daily metric in a simple text file named
my_data.csv.date,hours_slept,coffees 2025-12-01,7.5,2 2025-12-02,6.0,3 2025-12-03,8.0,1 ... - Setup: Install the necessary Python libraries:
pip install pandas matplotlib - Python Script:
import pandas as pd import matplotlib.pyplot as plt # Read the data df = pd.read_csv('my_data.csv') # Select the column to analyze sleep_data = df['hours_slept'] # Calculate descriptive statistics mean_sleep = sleep_data.mean() median_sleep = sleep_data.median() std_dev_sleep = sleep_data.std() print("Analysis of 'Hours Slept':") print(f"- Mean: {mean_sleep:.2f} hours") # ... print other stats # Create a histogram plt.figure() # Create a new figure plt.hist(sleep_data, bins=5, edgecolor='black') plt.title("Distribution of Hours Slept") plt.xlabel("Hours") plt.ylabel("Frequency") plt.savefig("sleep_histogram.png") # Save the plot to a file print("Histogram saved to sleep_histogram.png")Questions to guide you:
- Why might the mean and median be different? What does that tell you?
- If your standard deviation is high, what does that say about your sleep schedule’s consistency?
Learning milestones:
- You successfully load a CSV into a Pandas DataFrame → You’ve learned the basic unit of data analysis in Python.
- You calculate the mean, median, and standard deviation → You can summarize any dataset.
- You generate a histogram → You can visualize the shape and frequency of your data.
- You can explain what the standard deviation means in the context of your own data → You’ve built intuition, not just memorized a formula.
Project 2: Is This Game Rigged? A Loot Box Simulator
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, C#
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Probability / Monte Carlo Simulation
- Software or Tool: Python, Numpy
- Main Book: Think Bayes, 2nd Edition by Allen B. Downey
What you’ll build: A simulator that models opening “loot boxes” from a video game. You’ll run thousands of simulated trials to estimate the real-world cost of obtaining a rare item and visualize the distribution of outcomes.
Why it teaches statistics: This project makes probability concrete. Instead of calculating complex formulas, you’ll discover probabilities through experimentation (a “Monte Carlo method”). You’ll build a deep, intuitive understanding of expected value, probability distributions, and the law of large numbers.
Core challenges you’ll face:
- Modeling a probabilistic event → maps to using a random number generator to simulate a loot box opening
- Running thousands of simulations → maps to using loops to repeat an experiment many times
- Calculating the average outcome (Expected Value) → maps to understanding that the average of many random trials converges on a predictable value
- Visualizing the distribution of costs → maps to seeing that while the average cost might be $100, many people will pay much more
Key Concepts:
- Probability: “Think Stats” Ch. 4
- Monte Carlo Method: A Gentle Introduction to Monte Carlo Simulation
- Expected Value: Khan Academy - Expected Value
- The Law of Large Numbers: As you run more trials, the average result gets closer to the expected value.
Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1, comfort with loops and functions in Python.
Real world outcome: You’ll define the drop rates for items in a loot box (e.g., Legendary: 1%, Epic: 10%, Common: 89%). Your script will then simulate buying boxes until you get a Legendary item, and repeat this 10,000 times. It will output:
Simulating 10,000 players trying to get a Legendary item...
(Loot box cost: $1)
- Average cost: $100.23
- Cheapest attempt: $1
- Most expensive attempt: $750
- 95% of players got it for less than: $300
Distribution of costs has been saved to lootbox_costs.png
The generated histogram will visually prove that a “1% chance” doesn’t mean you’re guaranteed to get it in 100 tries.
Implementation Hints:
- Setup:
pip install numpy matplotlib - Define Probabilities:
# Item: probability DROP_RATES = { 'Legendary': 0.01, 'Epic': 0.10, 'Common': 0.89 } LOOT_BOX_COST = 1 - Simulate One “Player”: Write a function
simulate_one_player()that:- Initializes
cost = 0. - Enters a
while Trueloop. - In the loop,
cost += LOOT_BOX_COST. - Generate a random number between 0 and 1:
roll = np.random.rand(). - Check if you got the legendary:
if roll < DROP_RATES['Legendary']: break. - The function returns the final
cost.
- Initializes
- Run Many Simulations:
- Create an empty list
all_costs = []. - Loop 10,000 times, calling
simulate_one_player()in each iteration and appending the result toall_costs.
- Create an empty list
- Analyze and Visualize:
- Convert
all_coststo a NumPy array for easy calculations:costs_arr = np.array(all_costs). - Calculate the mean, max, and percentiles:
np.mean(costs_arr),np.max(costs_arr),np.percentile(costs_arr, 95). - Create a histogram of
costs_arrusing Matplotlib.
- Convert
Learning milestones:
- Your script can simulate a single probabilistic event → You understand how to model chance.
- You can simulate one player’s entire journey to getting the item → You can model a sequence of random events.
- The average cost from your simulation is close to the theoretical expected value (1 / 0.01 = 100) → You’ve witnessed the Law of Large Numbers in action.
- You can look at the histogram and explain why some players are “unlucky” and pay much more than the average → You understand the concept of a probability distribution.
Project 3: A “Dumb” Spam Filter
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: N/A
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Conditional Probability / Bayes’ Theorem
- Software or Tool: Python, Pandas, Scikit-learn
- Main Book: Data Science from Scratch, 2nd Edition by Joel Grus
What you’ll build: A simple spam filter for SMS messages using the Naive Bayes algorithm. You’ll train it on a dataset of real messages labeled as “spam” or “ham” (not spam).
Why it teaches statistics: This is a classic, tangible application of Bayes’ Theorem. You will learn how to “update your beliefs” (the probability of a message being spam) based on the evidence (the words in the message). It’s a bridge from pure statistics to machine learning.
Core challenges you’ll face:
-
Understanding Bayes’ Theorem intuitively → maps to *P(Spam Word) = P(Word Spam) * P(Spam) / P(Word)* - Calculating word probabilities → maps to counting word frequencies in spam vs. ham messages
- Handling words not seen during training → maps to Laplace smoothing (adding 1 to all counts)
- Combining probabilities for multiple words → maps to the “naive” assumption of independence (and why we use log probabilities)
Key Concepts:
- Bayes’ Theorem: “Think Bayes” Ch. 1 - Allen B. Downey
- Naive Bayes Classifier: “Data Science from Scratch” Ch. 13 - Joel Grus
- Text Processing: Scikit-learn documentation on Text Feature Extraction
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 & 2. Understanding of Python dictionaries.
Real world outcome: You will train your classifier. Then, you can give it a new, unseen message, and it will output a prediction:
New message: "Claim your free prize now! Click here."
Prediction: Spam (Probability: 98.7%)
New message: "Hey, are we still on for lunch tomorrow?"
Prediction: Ham (Probability: 99.9%)
Implementation Hints:
- Get Data: Find a dataset. The “SMS Spam Collection Data Set” from the UCI Machine Learning Repository is perfect for this. It’s a single file with two columns: label (
ham/spam) and the message text. - Setup:
pip install pandas scikit-learn - High-level plan (using Scikit-learn’s tools):
- Load the data with Pandas.
- Split the data into a training set and a testing set (
train_test_split). - Create a data processing “pipeline”:
- Step 1: Vectorizer (
CountVectorizer): This tool converts your text messages into numerical data by counting the occurrences of each word. - Step 2: Classifier (
MultinomialNB): This is the Naive Bayes algorithm.
- Step 1: Vectorizer (
- Train the pipeline on your training data (
pipeline.fit()). - Test its accuracy on your unseen test data (
pipeline.score()).
- The “From Scratch” Intuition (what Scikit-learn is doing for you):
- Calculate Prior Probabilities: What’s the overall probability of any given message being spam?
P(Spam) = (Number of spam messages) / (Total number of messages). - Calculate Word Probabilities (Likelihoods):
- For each word in your vocabulary, calculate
P(Word | Spam)(How often does this word appear in spam messages?) andP(Word | Ham). - This involves counting total words in spam vs. ham and applying Laplace smoothing.
- For each word in your vocabulary, calculate
- Classify a New Message:
- For a new message, you want to calculate
P(Spam | Message). - Using Bayes’ rule, this is proportional to
P(Message | Spam) * P(Spam). - The “naive” part is assuming words are independent:
P(Message | Spam) ≈ P(word1 | Spam) * P(word2 | Spam) * ... - To avoid numbers getting too small (“underflow”), you add the log probabilities instead of multiplying.
- Compare the final “score” for spam vs. ham and pick the higher one.
- For a new message, you want to calculate
- Calculate Prior Probabilities: What’s the overall probability of any given message being spam?
Learning milestones:
- You can calculate the prior probability of spam from the dataset → You understand base rates.
- You can calculate
P("free" | Spam)andP("free" | Ham)→ You understand likelihoods and how they form evidence. - Your model correctly classifies most messages in the test set → You’ve successfully built a predictive model.
- You can explain why the model classified a message as spam by looking at the word probabilities → You understand how the algorithm “thinks”.
Project 4: Is This Die Loaded?
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: R
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Inferential Statistics / Hypothesis Testing
- Software or Tool: Python, SciPy
- Main Book: Introductory Statistics (OpenStax, free online textbook)
What you’ll build: A script that uses the Chi-Squared Goodness of Fit test to determine if a series of dice rolls deviates significantly from what you’d expect from a fair die.
Why it teaches statistics: This is a perfect introduction to hypothesis testing. You will formalize a question (“Is this die fair?”), state your null and alternative hypotheses, and use a statistical test to get a p-value that helps you make a conclusion.
Core challenges you’ll face:
- Formulating a null hypothesis → maps to stating the default assumption: “The die is fair”
- Calculating expected frequencies → maps to understanding that for a fair die, each side should appear N/6 times in N rolls
- Understanding the Chi-Squared statistic → maps to quantifying the total difference between observed and expected counts
- Interpreting the p-value → maps to answering: “If the die was fair, how likely is it we’d see a deviation this large or larger?”
Key Concepts:
- Hypothesis Testing: “Introductory Statistics” Ch. 9 - OpenStax
- Chi-Squared Goodness of Fit Test: “Introductory Statistics” Ch. 11
- p-value: A Gentle Introduction to p-values
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1 & 2.
Real world outcome: You will simulate rolling a die 600 times. First, a fair die, then a loaded die. Your script will analyze the results and produce a clear conclusion.
Analyzing 600 rolls of a simulated FAIR die...
Observed counts: [98, 105, 95, 102, 99, 101]
Expected counts: [100, 100, 100, 100, 100, 100]
Chi-Squared Statistic: 0.56, p-value: 0.989
Conclusion: The p-value is high. We do not have evidence to reject the null hypothesis. The die appears to be fair.
Analyzing 600 rolls of a simulated LOADED die (6 is twice as likely)...
Observed counts: [85, 89, 81, 88, 82, 175]
Expected counts: [100, 100, 100, 100, 100, 100]
Chi-Squared Statistic: 82.1, p-value: 1.5e-16
Conclusion: The p-value is very small. We reject the null hypothesis. The die is likely loaded.
Implementation Hints:
- Setup:
pip install numpy scipy matplotlib - Simulate Rolls:
- Fair die:
np.random.randint(1, 7, size=600) - Loaded die: Use
np.random.choice([1, 2, 3, 4, 5, 6], size=600, p=[...])where you provide custom probabilities (e.g.,p=[1/7, 1/7, 1/7, 1/7, 1/7, 2/7]).
- Fair die:
- Get Observed Counts: Use NumPy’s
np.unique(rolls, return_counts=True)to get the counts for each face. - Get Expected Counts: For 600 rolls of a fair die, the expected count for each face is
600 / 6 = 100. - Perform the Test: The
scipylibrary makes this easy.from scipy.stats import chisquare observed_counts = [...] # Your counts from step 3 expected_counts = [100, 100, 100, 100, 100, 100] chi2, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts) print(f"Chi-Squared Statistic: {chi2:.2f}, p-value: {p_value:.3e}") alpha = 0.05 # Significance level if p_value < alpha: print("Conclusion: The p-value is small. We reject the null hypothesis.") else: print("Conclusion: The p-value is high. We fail to reject the null hypothesis.")The Chi-Squared statistic itself is calculated as Σ [ (Observed - Expected)² / Expected ] for all categories. You can calculate this manually to prove to yourself you understand the formula.
Learning milestones:
- You can state a clear Null and Alternative hypothesis for a problem → You understand the foundation of hypothesis testing.
- You can calculate the expected frequencies for a given scenario → You can define what “fairness” or “no effect” looks like.
- Your script produces a small p-value for the loaded die and a large one for the fair die → You understand what the test output signifies.
- You can correctly interpret the p-value in a sentence → You can translate statistical results into a plain English conclusion.
Project 5: Does More Studying Mean Higher Grades?
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Excel
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Correlation / Linear Regression
- Software or Tool: Python, Pandas, Scikit-learn, Matplotlib
- Main Book: The Art of Data Science by Roger D. Peng and Elizabeth Matsui
What you’ll build: A script that analyzes a dataset of students’ hours studied and exam scores. It will calculate the correlation, fit a linear regression model, and create a scatter plot with the regression line overlaid.
Why it teaches statistics: This is the quintessential introduction to predictive modeling. You will learn to quantify the relationship between two variables and build a simple model that can make predictions (e.g., “If a student studies for 10 hours, what is their predicted score?”).
Core challenges you’ll face:
- Understanding correlation vs. causation → maps to realizing that just because two variables move together doesn’t mean one causes the other
- Fitting a line to data → maps to the concept of “least squares,” finding the line that minimizes the total error
- Interpreting the model’s coefficients → maps to understanding the meaning of the slope and intercept
- Evaluating model performance → maps to what R-squared means (the proportion of variance explained by the model)
Key Concepts:
- Correlation: “Introductory Statistics” Ch. 12 - OpenStax
- Linear Regression: “Introductory Statistics” Ch. 12
- R-squared: StatQuest: R-squared explained
Difficulty: Advanced Time estimate: 1-2 weeks
- Prerequisites: All previous projects, a good grasp of high school algebra (the equation of a line,
y = mx + b).
Real world outcome: Your script will analyze a dataset and produce a scatter plot. The plot will show a cloud of points (each point is a student) and a straight line running through them. You’ll also get a statistical summary:
Correlation between Hours Studied and Exam Score: 0.89 (Strong positive correlation)
Linear Regression Model:
Score = 5.5 * (Hours Studied) + 45.2
- Intercept (b): 45.2 (Predicted score for 0 hours studied)
- Slope (m): 5.5 (Each additional hour of study is associated with a 5.5 point increase in score)
- R-squared: 0.79 (The model explains 79% of the variance in exam scores)
Prediction for a student who studies 8 hours: 89.2
Implementation Hints:
- Data: Create a simple CSV file
scores.csvor find one online.hours_studied,exam_score 2,65 3.5,72 ... - Setup:
pip install pandas scikit-learn matplotlib - Analysis Script:
import pandas as pd from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Load and prepare data df = pd.read_csv('scores.csv') X = df[['hours_studied']] # Features (needs to be 2D) y = df['exam_score'] # Target # Calculate correlation correlation = df['hours_studied'].corr(df['exam_score']) print(f"Correlation: {correlation:.2f}") # Build and train the model model = LinearRegression() model.fit(X, y) # Get model parameters slope = model.coef_[0] intercept = model.intercept_ r_squared = model.score(X, y) print(f"Model: Score = {slope:.2f} * Hours + {intercept:.2f}") print(f"R-squared: {r_squared:.2f}") # Make a prediction hours_to_predict = [[8]] # Needs to be 2D predicted_score = model.predict(hours_to_predict) print(f"Predicted score for {hours_to_predict[0][0]} hours: {predicted_score[0]:.2f}") # Plotting plt.figure() plt.scatter(X, y, label='Actual Scores') plt.plot(X, model.predict(X), color='red', label='Regression Line') plt.title("Hours Studied vs. Exam Score") plt.xlabel("Hours Studied") plt.ylabel("Exam Score") plt.legend() plt.savefig("regression_plot.png")
Learning milestones:
- You create a scatter plot and visually identify a trend → You can spot potential relationships in data.
- You can calculate and interpret the correlation coefficient → You can quantify the strength and direction of a linear relationship.
- You fit a linear regression model and can explain the meaning of the slope and intercept → You can model a relationship mathematically.
- You can use your model to make a prediction for a new data point → You have built your first predictive model.
Summary
| Project | Difficulty | Main Language | Key Learning |
|---|---|---|---|
| 1. Personal Data Dashboard | Beginner | Python | Descriptive Statistics, Visualization |
| 2. Loot Box Simulator | Beginner | Python | Probability, Monte Carlo Simulation |
| 3. “Dumb” Spam Filter | Intermediate | Python | Bayes’ Theorem, Text Classification |
| 4. Is This Die Loaded? | Intermediate | Python | Hypothesis Testing, Chi-Squared Test |
| 5. Study vs. Grades | Advanced | Python | Correlation, Linear Regression |