← Back to all projects

LEARN STATISTICS FROM SCRATCH

Learn Statistics: From Scratch to Data-Driven Thinking

Goal: Build a strong, intuitive understanding of statistics from the ground up. You will go from basic data summarization to making informed decisions and predictions, using practical, real-world projects.


Why Learn Statistics?

Statistics is the science of learning from data. It’s the foundation of data science, machine learning, and any field where decisions are made under uncertainty. Most developers have a vague notion of it, but few can wield it confidently.

  • Become a Better Thinker: Learn to spot biases, question assumptions, and separate signal from noise in any context.
  • Unlock Data Science: Statistics is the “why” behind the “how” of machine learning algorithms.
  • Build Smarter Products: Use A/B testing and data analysis to build things your users actually want.
  • Win Arguments with Data: Move from “I think” to “I can show you.”

After completing these projects, you will:

  • Intuitively understand core concepts like probability distributions, confidence intervals, and p-values.
  • Be able to clean, analyze, and visualize datasets with professional tools like Python’s Pandas and Matplotlib.
  • Confidently perform hypothesis tests to make data-driven decisions.
  • Build and interpret simple predictive models using linear regression.
  • Never look at a news headline citing a “study” the same way again.

Core Concept Analysis

The Statistics Learning Ladder

Your journey will follow two main paths: Descriptive Statistics (describing what you see) and Inferential Statistics (making guesses about what you can’t see).

┌─────────────────────────────────────────────────────────────┐
│                       6. REGRESSION                         │
│       (Predicting values, e.g., house price from size)      │
└─────────────────────────────────────────────────────────────┘
                                 ▲
                                 │
┌─────────────────────────────────────────────────────────────┐
│                   5. INFERENTIAL STATISTICS                 │
│      (Hypothesis Testing, A/B Tests, p-values, CI)          │
│       (Is this new drug better than the old one?)           │
└─────────────────────────────────────────────────────────────┘
                                 ▲
                                 │
┌─────────────────────────────────────────────────────────────┐
│                 4. PROBABILITY DISTRIBUTIONS                │
│ (Normal, Binomial, Poisson - Modeling random events)        │
│    (What's the range of likely outcomes for coin flips?)    │
└─────────────────────────────────────────────────────────────┘
                                 ▲
                                 │
┌─────────────────────────────────────────────────────────────┐
│                        3. PROBABILITY                       │
│     (The language of uncertainty, Bayes' Theorem)           │
│         (What are the chances of drawing a red card?)       │
└─────────────────────────────────────────────────────────────┘
                                 ▲
                                 │
┌─────────────────────────────────────────────────────────────┐
│                 1 & 2. DESCRIPTIVE STATISTICS               │
│      (Mean, Median, Mode, Variance, Std Dev, Histograms)    │
│            (What does our data look like?)                  │
└─────────────────────────────────────────────────────────────┘

Key Concepts Explained

1. Descriptive Statistics (The “What”)

These tools summarize a dataset into a few key numbers.

  • Measures of Central Tendency: Where is the “center” of the data?
    • Mean: The average. Sum of all values / number of values. Sensitive to outliers.
    • Median: The middle value when sorted. Robust to outliers.
    • Mode: The most frequent value.
  • Measures of Dispersion (Spread): How spread out is the data?
    • Range: Maximum value - Minimum value. Very simple.
    • Variance (σ²): The average of the squared differences from the Mean. Hard to interpret directly.
    • Standard Deviation (σ): The square root of the variance. Easy to interpret as it’s in the original units of the data. A low SD means data is clustered around the mean.
    • Quartiles & IQR: Divides data into four equal parts. The Interquartile Range (IQR = Q3 - Q1) is the middle 50% of the data.

2. Probability (The “Maybe”)

Probability is a number between 0 (impossible) and 1 (certain) that represents the likelihood of an event.

  • Key idea: If we repeat an experiment many times, the proportion of times an event occurs will approach its probability.
  • Conditional Probability: The probability of event A happening, given that event B has already happened. Written as P(A|B).
  • Bayes’ Theorem: A revolutionary idea that lets you update your beliefs in light of new evidence. It connects P(A|B) with P(B|A). It’s the engine behind medical diagnoses and spam filters.

3. Probability Distributions (The “Shape”)

A distribution describes the probabilities of all possible outcomes.

  • Normal Distribution (The “Bell Curve”): Describes many natural phenomena (heights, blood pressure). Defined by its mean and standard deviation.
  • Binomial Distribution: Describes the number of successes in a fixed number of independent trials (e.g., number of heads in 10 coin flips).
  • Poisson Distribution: Describes the number of events in a fixed interval of time or space, if these events happen at a known average rate (e.g., number of customers arriving at a store in an hour).

4. Inferential Statistics (The “Guess”)

This is where we use data from a small sample to make an educated guess (an inference) about a large population.

  • Central Limit Theorem (CLT): The most important idea in statistics. It states that if you take many large enough samples from any population, the distribution of the sample means will be approximately normal. This is what allows us to make inferences.
  • Confidence Interval (CI): An estimated range of values which is likely to include the true population parameter (e.g., “We are 95% confident that the true average height of all men is between 5’9” and 5’11””).
  • Hypothesis Testing: A formal procedure for checking if your data supports a certain hypothesis.
    • Null Hypothesis (H₀): The default assumption, usually stating “no effect” or “no difference” (e.g., “This new drug has no effect”).
    • Alternative Hypothesis (H₁): The claim you want to prove (e.g., “This new drug reduces recovery time”).
    • p-value: The probability of observing your data (or something more extreme), if the null hypothesis were true. A small p-value (typically < 0.05) suggests that your observation is surprising under the null hypothesis, providing evidence against it.

Project List

These projects are designed to be done in order, building your skills from the ground up using Python, a powerful and beginner-friendly tool for statistics.


Project 1: Personal Data Dashboard

  • File: LEARN_STATISTICS_FROM_SCRATCH.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, Google Sheets
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Descriptive Statistics / Data Visualization
  • Software or Tool: Python, Pandas, Matplotlib
  • Main Book: Think Stats, 2nd Edition by Allen B. Downey

What you’ll build: A simple script that ingests data you’ve collected about yourself (e.g., hours slept, cups of coffee, pages read per day) and calculates key descriptive statistics, then generates a histogram and a box plot.

Why it teaches statistics: This project makes abstract concepts like mean, median, standard deviation, and quartiles tangible because they describe your own life. You’ll see firsthand how an outlier (a sleepless night) affects the mean but not the median.

Core challenges you’ll face:

  • Collecting and formatting data → maps to the basics of data entry in a CSV file
  • Calculating mean, median, and mode → maps to understanding central tendency
  • Calculating variance and standard deviation → maps to quantifying the “spread” or “consistency” of your habits
  • Creating a histogram and box plot → maps to visualizing the shape and distribution of your data

Key Concepts:

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python (variables, lists, functions).

Real world outcome: You’ll run a Python script and it will output something like this to your console and generate a plot:

Analysis of 'Hours Slept' (30 days):
- Mean: 7.2 hours
- Median: 7.5 hours
- Standard Deviation: 1.1 hours
- Min: 4.0 hours, Max: 9.0 hours

You will also see a histogram showing the distribution of your sleep hours, immediately telling you your most common sleep duration.

Implementation Hints:

  1. Data Collection: For one week, record a daily metric in a simple text file named my_data.csv.
    date,hours_slept,coffees
    2025-12-01,7.5,2
    2025-12-02,6.0,3
    2025-12-03,8.0,1
    ...
    
  2. Setup: Install the necessary Python libraries: pip install pandas matplotlib
  3. Python Script:
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Read the data
    df = pd.read_csv('my_data.csv')
    
    # Select the column to analyze
    sleep_data = df['hours_slept']
    
    # Calculate descriptive statistics
    mean_sleep = sleep_data.mean()
    median_sleep = sleep_data.median()
    std_dev_sleep = sleep_data.std()
    
    print("Analysis of 'Hours Slept':")
    print(f"- Mean: {mean_sleep:.2f} hours")
    # ... print other stats
    
    # Create a histogram
    plt.figure() # Create a new figure
    plt.hist(sleep_data, bins=5, edgecolor='black')
    plt.title("Distribution of Hours Slept")
    plt.xlabel("Hours")
    plt.ylabel("Frequency")
    plt.savefig("sleep_histogram.png") # Save the plot to a file
    print("Histogram saved to sleep_histogram.png")
    

    Questions to guide you:

    • Why might the mean and median be different? What does that tell you?
    • If your standard deviation is high, what does that say about your sleep schedule’s consistency?

Learning milestones:

  1. You successfully load a CSV into a Pandas DataFrame → You’ve learned the basic unit of data analysis in Python.
  2. You calculate the mean, median, and standard deviation → You can summarize any dataset.
  3. You generate a histogram → You can visualize the shape and frequency of your data.
  4. You can explain what the standard deviation means in the context of your own data → You’ve built intuition, not just memorized a formula.

Project 2: Is This Game Rigged? A Loot Box Simulator

  • File: LEARN_STATISTICS_FROM_SCRATCH.md
  • Main Programming Language: Python
  • Alternative Programming Languages: JavaScript, C#
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Probability / Monte Carlo Simulation
  • Software or Tool: Python, Numpy
  • Main Book: Think Bayes, 2nd Edition by Allen B. Downey

What you’ll build: A simulator that models opening “loot boxes” from a video game. You’ll run thousands of simulated trials to estimate the real-world cost of obtaining a rare item and visualize the distribution of outcomes.

Why it teaches statistics: This project makes probability concrete. Instead of calculating complex formulas, you’ll discover probabilities through experimentation (a “Monte Carlo method”). You’ll build a deep, intuitive understanding of expected value, probability distributions, and the law of large numbers.

Core challenges you’ll face:

  • Modeling a probabilistic event → maps to using a random number generator to simulate a loot box opening
  • Running thousands of simulations → maps to using loops to repeat an experiment many times
  • Calculating the average outcome (Expected Value) → maps to understanding that the average of many random trials converges on a predictable value
  • Visualizing the distribution of costs → maps to seeing that while the average cost might be $100, many people will pay much more

Key Concepts:

  • Probability: “Think Stats” Ch. 4
  • Monte Carlo Method: A Gentle Introduction to Monte Carlo Simulation
  • Expected Value: Khan Academy - Expected Value
  • The Law of Large Numbers: As you run more trials, the average result gets closer to the expected value.

Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1, comfort with loops and functions in Python.

Real world outcome: You’ll define the drop rates for items in a loot box (e.g., Legendary: 1%, Epic: 10%, Common: 89%). Your script will then simulate buying boxes until you get a Legendary item, and repeat this 10,000 times. It will output:

Simulating 10,000 players trying to get a Legendary item...
(Loot box cost: $1)

- Average cost: $100.23
- Cheapest attempt: $1
- Most expensive attempt: $750
- 95% of players got it for less than: $300

Distribution of costs has been saved to lootbox_costs.png

The generated histogram will visually prove that a “1% chance” doesn’t mean you’re guaranteed to get it in 100 tries.

Implementation Hints:

  1. Setup: pip install numpy matplotlib
  2. Define Probabilities:
    # Item: probability
    DROP_RATES = {
        'Legendary': 0.01,
        'Epic': 0.10,
        'Common': 0.89
    }
    LOOT_BOX_COST = 1
    
  3. Simulate One “Player”: Write a function simulate_one_player() that:
    • Initializes cost = 0.
    • Enters a while True loop.
    • In the loop, cost += LOOT_BOX_COST.
    • Generate a random number between 0 and 1: roll = np.random.rand().
    • Check if you got the legendary: if roll < DROP_RATES['Legendary']: break.
    • The function returns the final cost.
  4. Run Many Simulations:
    • Create an empty list all_costs = [].
    • Loop 10,000 times, calling simulate_one_player() in each iteration and appending the result to all_costs.
  5. Analyze and Visualize:
    • Convert all_costs to a NumPy array for easy calculations: costs_arr = np.array(all_costs).
    • Calculate the mean, max, and percentiles: np.mean(costs_arr), np.max(costs_arr), np.percentile(costs_arr, 95).
    • Create a histogram of costs_arr using Matplotlib.

Learning milestones:

  1. Your script can simulate a single probabilistic event → You understand how to model chance.
  2. You can simulate one player’s entire journey to getting the item → You can model a sequence of random events.
  3. The average cost from your simulation is close to the theoretical expected value (1 / 0.01 = 100) → You’ve witnessed the Law of Large Numbers in action.
  4. You can look at the histogram and explain why some players are “unlucky” and pay much more than the average → You understand the concept of a probability distribution.

Project 3: A “Dumb” Spam Filter

  • File: LEARN_STATISTICS_FROM_SCRATCH.md
  • Main Programming Language: Python
  • Alternative Programming Languages: N/A
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Conditional Probability / Bayes’ Theorem
  • Software or Tool: Python, Pandas, Scikit-learn
  • Main Book: Data Science from Scratch, 2nd Edition by Joel Grus

What you’ll build: A simple spam filter for SMS messages using the Naive Bayes algorithm. You’ll train it on a dataset of real messages labeled as “spam” or “ham” (not spam).

Why it teaches statistics: This is a classic, tangible application of Bayes’ Theorem. You will learn how to “update your beliefs” (the probability of a message being spam) based on the evidence (the words in the message). It’s a bridge from pure statistics to machine learning.

Core challenges you’ll face:

  • Understanding Bayes’ Theorem intuitively → maps to *P(Spam Word) = P(Word Spam) * P(Spam) / P(Word)*
  • Calculating word probabilities → maps to counting word frequencies in spam vs. ham messages
  • Handling words not seen during training → maps to Laplace smoothing (adding 1 to all counts)
  • Combining probabilities for multiple words → maps to the “naive” assumption of independence (and why we use log probabilities)

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 & 2. Understanding of Python dictionaries.

Real world outcome: You will train your classifier. Then, you can give it a new, unseen message, and it will output a prediction:

New message: "Claim your free prize now! Click here."
Prediction: Spam (Probability: 98.7%)

New message: "Hey, are we still on for lunch tomorrow?"
Prediction: Ham (Probability: 99.9%)

Implementation Hints:

  1. Get Data: Find a dataset. The “SMS Spam Collection Data Set” from the UCI Machine Learning Repository is perfect for this. It’s a single file with two columns: label (ham/spam) and the message text.
  2. Setup: pip install pandas scikit-learn
  3. High-level plan (using Scikit-learn’s tools):
    • Load the data with Pandas.
    • Split the data into a training set and a testing set (train_test_split).
    • Create a data processing “pipeline”:
      • Step 1: Vectorizer (CountVectorizer): This tool converts your text messages into numerical data by counting the occurrences of each word.
      • Step 2: Classifier (MultinomialNB): This is the Naive Bayes algorithm.
    • Train the pipeline on your training data (pipeline.fit()).
    • Test its accuracy on your unseen test data (pipeline.score()).
  4. The “From Scratch” Intuition (what Scikit-learn is doing for you):
    • Calculate Prior Probabilities: What’s the overall probability of any given message being spam? P(Spam) = (Number of spam messages) / (Total number of messages).
    • Calculate Word Probabilities (Likelihoods):
      • For each word in your vocabulary, calculate P(Word | Spam) (How often does this word appear in spam messages?) and P(Word | Ham).
      • This involves counting total words in spam vs. ham and applying Laplace smoothing.
    • Classify a New Message:
      • For a new message, you want to calculate P(Spam | Message).
      • Using Bayes’ rule, this is proportional to P(Message | Spam) * P(Spam).
      • The “naive” part is assuming words are independent: P(Message | Spam) ≈ P(word1 | Spam) * P(word2 | Spam) * ...
      • To avoid numbers getting too small (“underflow”), you add the log probabilities instead of multiplying.
      • Compare the final “score” for spam vs. ham and pick the higher one.

Learning milestones:

  1. You can calculate the prior probability of spam from the dataset → You understand base rates.
  2. You can calculate P("free" | Spam) and P("free" | Ham) → You understand likelihoods and how they form evidence.
  3. Your model correctly classifies most messages in the test set → You’ve successfully built a predictive model.
  4. You can explain why the model classified a message as spam by looking at the word probabilities → You understand how the algorithm “thinks”.

Project 4: Is This Die Loaded?

  • File: LEARN_STATISTICS_FROM_SCRATCH.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Inferential Statistics / Hypothesis Testing
  • Software or Tool: Python, SciPy
  • Main Book: Introductory Statistics (OpenStax, free online textbook)

What you’ll build: A script that uses the Chi-Squared Goodness of Fit test to determine if a series of dice rolls deviates significantly from what you’d expect from a fair die.

Why it teaches statistics: This is a perfect introduction to hypothesis testing. You will formalize a question (“Is this die fair?”), state your null and alternative hypotheses, and use a statistical test to get a p-value that helps you make a conclusion.

Core challenges you’ll face:

  • Formulating a null hypothesis → maps to stating the default assumption: “The die is fair”
  • Calculating expected frequencies → maps to understanding that for a fair die, each side should appear N/6 times in N rolls
  • Understanding the Chi-Squared statistic → maps to quantifying the total difference between observed and expected counts
  • Interpreting the p-value → maps to answering: “If the die was fair, how likely is it we’d see a deviation this large or larger?”

Key Concepts:

  • Hypothesis Testing: “Introductory Statistics” Ch. 9 - OpenStax
  • Chi-Squared Goodness of Fit Test: “Introductory Statistics” Ch. 11
  • p-value: A Gentle Introduction to p-values

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1 & 2.

Real world outcome: You will simulate rolling a die 600 times. First, a fair die, then a loaded die. Your script will analyze the results and produce a clear conclusion.

Analyzing 600 rolls of a simulated FAIR die...
Observed counts: [98, 105, 95, 102, 99, 101]
Expected counts: [100, 100, 100, 100, 100, 100]
Chi-Squared Statistic: 0.56, p-value: 0.989
Conclusion: The p-value is high. We do not have evidence to reject the null hypothesis. The die appears to be fair.

Analyzing 600 rolls of a simulated LOADED die (6 is twice as likely)...
Observed counts: [85, 89, 81, 88, 82, 175]
Expected counts: [100, 100, 100, 100, 100, 100]
Chi-Squared Statistic: 82.1, p-value: 1.5e-16
Conclusion: The p-value is very small. We reject the null hypothesis. The die is likely loaded.

Implementation Hints:

  1. Setup: pip install numpy scipy matplotlib
  2. Simulate Rolls:
    • Fair die: np.random.randint(1, 7, size=600)
    • Loaded die: Use np.random.choice([1, 2, 3, 4, 5, 6], size=600, p=[...]) where you provide custom probabilities (e.g., p=[1/7, 1/7, 1/7, 1/7, 1/7, 2/7]).
  3. Get Observed Counts: Use NumPy’s np.unique(rolls, return_counts=True) to get the counts for each face.
  4. Get Expected Counts: For 600 rolls of a fair die, the expected count for each face is 600 / 6 = 100.
  5. Perform the Test: The scipy library makes this easy.
    from scipy.stats import chisquare
    
    observed_counts = [...] # Your counts from step 3
    expected_counts = [100, 100, 100, 100, 100, 100]
    
    chi2, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts)
    
    print(f"Chi-Squared Statistic: {chi2:.2f}, p-value: {p_value:.3e}")
    
    alpha = 0.05 # Significance level
    if p_value < alpha:
        print("Conclusion: The p-value is small. We reject the null hypothesis.")
    else:
        print("Conclusion: The p-value is high. We fail to reject the null hypothesis.")
    

    The Chi-Squared statistic itself is calculated as Σ [ (Observed - Expected)² / Expected ] for all categories. You can calculate this manually to prove to yourself you understand the formula.

Learning milestones:

  1. You can state a clear Null and Alternative hypothesis for a problem → You understand the foundation of hypothesis testing.
  2. You can calculate the expected frequencies for a given scenario → You can define what “fairness” or “no effect” looks like.
  3. Your script produces a small p-value for the loaded die and a large one for the fair die → You understand what the test output signifies.
  4. You can correctly interpret the p-value in a sentence → You can translate statistical results into a plain English conclusion.

Project 5: Does More Studying Mean Higher Grades?

  • File: LEARN_STATISTICS_FROM_SCRATCH.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, Excel
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Correlation / Linear Regression
  • Software or Tool: Python, Pandas, Scikit-learn, Matplotlib
  • Main Book: The Art of Data Science by Roger D. Peng and Elizabeth Matsui

What you’ll build: A script that analyzes a dataset of students’ hours studied and exam scores. It will calculate the correlation, fit a linear regression model, and create a scatter plot with the regression line overlaid.

Why it teaches statistics: This is the quintessential introduction to predictive modeling. You will learn to quantify the relationship between two variables and build a simple model that can make predictions (e.g., “If a student studies for 10 hours, what is their predicted score?”).

Core challenges you’ll face:

  • Understanding correlation vs. causation → maps to realizing that just because two variables move together doesn’t mean one causes the other
  • Fitting a line to data → maps to the concept of “least squares,” finding the line that minimizes the total error
  • Interpreting the model’s coefficients → maps to understanding the meaning of the slope and intercept
  • Evaluating model performance → maps to what R-squared means (the proportion of variance explained by the model)

Key Concepts:

  • Correlation: “Introductory Statistics” Ch. 12 - OpenStax
  • Linear Regression: “Introductory Statistics” Ch. 12
  • R-squared: StatQuest: R-squared explained

Difficulty: Advanced Time estimate: 1-2 weeks

  • Prerequisites: All previous projects, a good grasp of high school algebra (the equation of a line, y = mx + b).

Real world outcome: Your script will analyze a dataset and produce a scatter plot. The plot will show a cloud of points (each point is a student) and a straight line running through them. You’ll also get a statistical summary:

Correlation between Hours Studied and Exam Score: 0.89 (Strong positive correlation)

Linear Regression Model:
Score = 5.5 * (Hours Studied) + 45.2

- Intercept (b): 45.2 (Predicted score for 0 hours studied)
- Slope (m): 5.5 (Each additional hour of study is associated with a 5.5 point increase in score)
- R-squared: 0.79 (The model explains 79% of the variance in exam scores)

Prediction for a student who studies 8 hours: 89.2

Implementation Hints:

  1. Data: Create a simple CSV file scores.csv or find one online.
    hours_studied,exam_score
    2,65
    3.5,72
    ...
    
  2. Setup: pip install pandas scikit-learn matplotlib
  3. Analysis Script:
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    import matplotlib.pyplot as plt
    
    # Load and prepare data
    df = pd.read_csv('scores.csv')
    X = df[['hours_studied']] # Features (needs to be 2D)
    y = df['exam_score']      # Target
    
    # Calculate correlation
    correlation = df['hours_studied'].corr(df['exam_score'])
    print(f"Correlation: {correlation:.2f}")
    
    # Build and train the model
    model = LinearRegression()
    model.fit(X, y)
    
    # Get model parameters
    slope = model.coef_[0]
    intercept = model.intercept_
    r_squared = model.score(X, y)
    print(f"Model: Score = {slope:.2f} * Hours + {intercept:.2f}")
    print(f"R-squared: {r_squared:.2f}")
    
    # Make a prediction
    hours_to_predict = [[8]] # Needs to be 2D
    predicted_score = model.predict(hours_to_predict)
    print(f"Predicted score for {hours_to_predict[0][0]} hours: {predicted_score[0]:.2f}")
    
    # Plotting
    plt.figure()
    plt.scatter(X, y, label='Actual Scores')
    plt.plot(X, model.predict(X), color='red', label='Regression Line')
    plt.title("Hours Studied vs. Exam Score")
    plt.xlabel("Hours Studied")
    plt.ylabel("Exam Score")
    plt.legend()
    plt.savefig("regression_plot.png")
    

Learning milestones:

  1. You create a scatter plot and visually identify a trend → You can spot potential relationships in data.
  2. You can calculate and interpret the correlation coefficient → You can quantify the strength and direction of a linear relationship.
  3. You fit a linear regression model and can explain the meaning of the slope and intercept → You can model a relationship mathematically.
  4. You can use your model to make a prediction for a new data point → You have built your first predictive model.

Summary

Project Difficulty Main Language Key Learning
1. Personal Data Dashboard Beginner Python Descriptive Statistics, Visualization
2. Loot Box Simulator Beginner Python Probability, Monte Carlo Simulation
3. “Dumb” Spam Filter Intermediate Python Bayes’ Theorem, Text Classification
4. Is This Die Loaded? Intermediate Python Hypothesis Testing, Chi-Squared Test
5. Study vs. Grades Advanced Python Correlation, Linear Regression