LEARN STATISTICS FROM SCRATCH
Statistics is the science of learning from data. It's the foundation of data science, machine learning, and any field where decisions are made under uncertainty. Most developers have a vague notion of it, but few can wield it confidently.
Learn Statistics: From Scratch to Data-Driven Thinking
Goal: Build a strong, intuitive understanding of statistics from the ground up. You will go from basic data summarization to making informed decisions and predictions, using practical, real-world projects.
Why Learn Statistics?
Statistics is the science of learning from data. Itโs the foundation of data science, machine learning, and any field where decisions are made under uncertainty. Most developers have a vague notion of it, but few can wield it confidently.
- Become a Better Thinker: Learn to spot biases, question assumptions, and separate signal from noise in any context.
- Unlock Data Science: Statistics is the โwhyโ behind the โhowโ of machine learning algorithms.
- Build Smarter Products: Use A/B testing and data analysis to build things your users actually want.
- Win Arguments with Data: Move from โI thinkโ to โI can show you.โ
After completing these projects, you will:
- Intuitively understand core concepts like probability distributions, confidence intervals, and p-values.
- Be able to clean, analyze, and visualize datasets with professional tools like Pythonโs Pandas and Matplotlib.
- Confidently perform hypothesis tests to make data-driven decisions.
- Build and interpret simple predictive models using linear regression.
- Never look at a news headline citing a โstudyโ the same way again.
Core Concept Analysis
The Statistics Learning Ladder
Your journey will follow two main paths: Descriptive Statistics (describing what you see) and Inferential Statistics (making guesses about what you canโt see).
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 6. REGRESSION โ
โ (Predicting values, e.g., house price from size) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5. INFERENTIAL STATISTICS โ
โ (Hypothesis Testing, A/B Tests, p-values, CI) โ
โ (Is this new drug better than the old one?) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. PROBABILITY DISTRIBUTIONS โ
โ (Normal, Binomial, Poisson - Modeling random events) โ
โ (What's the range of likely outcomes for coin flips?) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. PROBABILITY โ
โ (The language of uncertainty, Bayes' Theorem) โ
โ (What are the chances of drawing a red card?) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1 & 2. DESCRIPTIVE STATISTICS โ
โ (Mean, Median, Mode, Variance, Std Dev, Histograms) โ
โ (What does our data look like?) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Concepts Explained
1. Descriptive Statistics (The โWhatโ)
These tools summarize a dataset into a few key numbers.
- Measures of Central Tendency: Where is the โcenterโ of the data?
- Mean: The average. Sum of all values / number of values. Sensitive to outliers.
- Median: The middle value when sorted. Robust to outliers.
- Mode: The most frequent value.
- Measures of Dispersion (Spread): How spread out is the data?
- Range: Maximum value - Minimum value. Very simple.
- Variance (ฯยฒ): The average of the squared differences from the Mean. Hard to interpret directly.
- Standard Deviation (ฯ): The square root of the variance. Easy to interpret as itโs in the original units of the data. A low SD means data is clustered around the mean.
- Quartiles & IQR: Divides data into four equal parts. The Interquartile Range (IQR = Q3 - Q1) is the middle 50% of the data.
2. Probability (The โMaybeโ)
Probability is a number between 0 (impossible) and 1 (certain) that represents the likelihood of an event.
- Key idea: If we repeat an experiment many times, the proportion of times an event occurs will approach its probability.
- Conditional Probability: The probability of event A happening, given that event B has already happened. Written as
P(A|B). - Bayesโ Theorem: A revolutionary idea that lets you update your beliefs in light of new evidence. It connects
P(A|B)withP(B|A). Itโs the engine behind medical diagnoses and spam filters.
3. Probability Distributions (The โShapeโ)
A distribution describes the probabilities of all possible outcomes.
- Normal Distribution (The โBell Curveโ): Describes many natural phenomena (heights, blood pressure). Defined by its mean and standard deviation.
- Binomial Distribution: Describes the number of successes in a fixed number of independent trials (e.g., number of heads in 10 coin flips).
- Poisson Distribution: Describes the number of events in a fixed interval of time or space, if these events happen at a known average rate (e.g., number of customers arriving at a store in an hour).
4. Inferential Statistics (The โGuessโ)
This is where we use data from a small sample to make an educated guess (an inference) about a large population.
- Central Limit Theorem (CLT): The most important idea in statistics. It states that if you take many large enough samples from any population, the distribution of the sample means will be approximately normal. This is what allows us to make inferences.
- Confidence Interval (CI): An estimated range of values which is likely to include the true population parameter (e.g., โWe are 95% confident that the true average height of all men is between 5โ9โ and 5โ11โโ).
- Hypothesis Testing: A formal procedure for checking if your data supports a certain hypothesis.
- Null Hypothesis (Hโ): The default assumption, usually stating โno effectโ or โno differenceโ (e.g., โThis new drug has no effectโ).
- Alternative Hypothesis (Hโ): The claim you want to prove (e.g., โThis new drug reduces recovery timeโ).
- p-value: The probability of observing your data (or something more extreme), if the null hypothesis were true. A small p-value (typically < 0.05) suggests that your observation is surprising under the null hypothesis, providing evidence against it.
Project List
These projects are designed to be done in order, building your skills from the ground up using Python, a powerful and beginner-friendly tool for statistics.
Project 1: Personal Data Dashboard
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Google Sheets
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The โResume Goldโ
- Difficulty: Level 1: Beginner
- Knowledge Area: Descriptive Statistics / Data Visualization
- Software or Tool: Python, Pandas, Matplotlib
- Main Book: Think Stats, 2nd Edition by Allen B. Downey
What youโll build: A simple script that ingests data youโve collected about yourself (e.g., hours slept, cups of coffee, pages read per day) and calculates key descriptive statistics, then generates a histogram and a box plot.
Why it teaches statistics: This project makes abstract concepts like mean, median, standard deviation, and quartiles tangible because they describe your own life. Youโll see firsthand how an outlier (a sleepless night) affects the mean but not the median.
Core challenges youโll face:
- Collecting and formatting data โ maps to the basics of data entry in a CSV file
- Calculating mean, median, and mode โ maps to understanding central tendency
- Calculating variance and standard deviation โ maps to quantifying the โspreadโ or โconsistencyโ of your habits
- Creating a histogram and box plot โ maps to visualizing the shape and distribution of your data
Key Concepts:
- Descriptive Statistics: โThink Statsโ Ch. 2 - Allen B. Downey
- Histograms: โThink Statsโ Ch. 3
- DataFrames: Pandas Documentation - 10 Minutes to pandas
- Plotting: Matplotlib Pyplot Tutorial
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python (variables, lists, functions).
Real world outcome: Youโll run a Python script and it will output something like this to your console and generate a plot:
Analysis of 'Hours Slept' (30 days):
- Mean: 7.2 hours
- Median: 7.5 hours
- Standard Deviation: 1.1 hours
- Min: 4.0 hours, Max: 9.0 hours
You will also see a histogram showing the distribution of your sleep hours, immediately telling you your most common sleep duration.
Implementation Hints:
- Data Collection: For one week, record a daily metric in a simple text file named
my_data.csv.date,hours_slept,coffees 2025-12-01,7.5,2 2025-12-02,6.0,3 2025-12-03,8.0,1 ... - Setup: Install the necessary Python libraries:
pip install pandas matplotlib - Python Script:
import pandas as pd import matplotlib.pyplot as plt # Read the data df = pd.read_csv('my_data.csv') # Select the column to analyze sleep_data = df['hours_slept'] # Calculate descriptive statistics mean_sleep = sleep_data.mean() median_sleep = sleep_data.median() std_dev_sleep = sleep_data.std() print("Analysis of 'Hours Slept':") print(f"- Mean: {mean_sleep:.2f} hours") # ... print other stats # Create a histogram plt.figure() # Create a new figure plt.hist(sleep_data, bins=5, edgecolor='black') plt.title("Distribution of Hours Slept") plt.xlabel("Hours") plt.ylabel("Frequency") plt.savefig("sleep_histogram.png") # Save the plot to a file print("Histogram saved to sleep_histogram.png")Questions to guide you:
- Why might the mean and median be different? What does that tell you?
- If your standard deviation is high, what does that say about your sleep scheduleโs consistency?
Learning milestones:
- You successfully load a CSV into a Pandas DataFrame โ Youโve learned the basic unit of data analysis in Python.
- You calculate the mean, median, and standard deviation โ You can summarize any dataset.
- You generate a histogram โ You can visualize the shape and frequency of your data.
- You can explain what the standard deviation means in the context of your own data โ Youโve built intuition, not just memorized a formula.
Project 2: Is This Game Rigged? A Loot Box Simulator
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: JavaScript, C#
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The โResume Goldโ
- Difficulty: Level 1: Beginner
- Knowledge Area: Probability / Monte Carlo Simulation
- Software or Tool: Python, Numpy
- Main Book: Think Bayes, 2nd Edition by Allen B. Downey
What youโll build: A simulator that models opening โloot boxesโ from a video game. Youโll run thousands of simulated trials to estimate the real-world cost of obtaining a rare item and visualize the distribution of outcomes.
Why it teaches statistics: This project makes probability concrete. Instead of calculating complex formulas, youโll discover probabilities through experimentation (a โMonte Carlo methodโ). Youโll build a deep, intuitive understanding of expected value, probability distributions, and the law of large numbers.
Core challenges youโll face:
- Modeling a probabilistic event โ maps to using a random number generator to simulate a loot box opening
- Running thousands of simulations โ maps to using loops to repeat an experiment many times
- Calculating the average outcome (Expected Value) โ maps to understanding that the average of many random trials converges on a predictable value
- Visualizing the distribution of costs โ maps to seeing that while the average cost might be $100, many people will pay much more
Key Concepts:
- Probability: โThink Statsโ Ch. 4
- Monte Carlo Method: A Gentle Introduction to Monte Carlo Simulation
- Expected Value: Khan Academy - Expected Value
- The Law of Large Numbers: As you run more trials, the average result gets closer to the expected value.
Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1, comfort with loops and functions in Python.
Real world outcome: Youโll define the drop rates for items in a loot box (e.g., Legendary: 1%, Epic: 10%, Common: 89%). Your script will then simulate buying boxes until you get a Legendary item, and repeat this 10,000 times. It will output:
Simulating 10,000 players trying to get a Legendary item...
(Loot box cost: $1)
- Average cost: $100.23
- Cheapest attempt: $1
- Most expensive attempt: $750
- 95% of players got it for less than: $300
Distribution of costs has been saved to lootbox_costs.png
The generated histogram will visually prove that a โ1% chanceโ doesnโt mean youโre guaranteed to get it in 100 tries.
Implementation Hints:
- Setup:
pip install numpy matplotlib - Define Probabilities:
# Item: probability DROP_RATES = { 'Legendary': 0.01, 'Epic': 0.10, 'Common': 0.89 } LOOT_BOX_COST = 1 - Simulate One โPlayerโ: Write a function
simulate_one_player()that:- Initializes
cost = 0. - Enters a
while Trueloop. - In the loop,
cost += LOOT_BOX_COST. - Generate a random number between 0 and 1:
roll = np.random.rand(). - Check if you got the legendary:
if roll < DROP_RATES['Legendary']: break. - The function returns the final
cost.
- Initializes
- Run Many Simulations:
- Create an empty list
all_costs = []. - Loop 10,000 times, calling
simulate_one_player()in each iteration and appending the result toall_costs.
- Create an empty list
- Analyze and Visualize:
- Convert
all_coststo a NumPy array for easy calculations:costs_arr = np.array(all_costs). - Calculate the mean, max, and percentiles:
np.mean(costs_arr),np.max(costs_arr),np.percentile(costs_arr, 95). - Create a histogram of
costs_arrusing Matplotlib.
- Convert
Learning milestones:
- Your script can simulate a single probabilistic event โ You understand how to model chance.
- You can simulate one playerโs entire journey to getting the item โ You can model a sequence of random events.
- The average cost from your simulation is close to the theoretical expected value (1 / 0.01 = 100) โ Youโve witnessed the Law of Large Numbers in action.
- You can look at the histogram and explain why some players are โunluckyโ and pay much more than the average โ You understand the concept of a probability distribution.
Project 3: A โDumbโ Spam Filter
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: N/A
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The โMicro-SaaS / Pro Toolโ
- Difficulty: Level 2: Intermediate
- Knowledge Area: Conditional Probability / Bayesโ Theorem
- Software or Tool: Python, Pandas, Scikit-learn
- Main Book: Data Science from Scratch, 2nd Edition by Joel Grus
What youโll build: A simple spam filter for SMS messages using the Naive Bayes algorithm. Youโll train it on a dataset of real messages labeled as โspamโ or โhamโ (not spam).
Why it teaches statistics: This is a classic, tangible application of Bayesโ Theorem. You will learn how to โupdate your beliefsโ (the probability of a message being spam) based on the evidence (the words in the message). Itโs a bridge from pure statistics to machine learning.
Core challenges youโll face:
-
Understanding Bayesโ Theorem intuitively โ maps to *P(Spam Word) = P(Word Spam) * P(Spam) / P(Word)* - Calculating word probabilities โ maps to counting word frequencies in spam vs. ham messages
- Handling words not seen during training โ maps to Laplace smoothing (adding 1 to all counts)
- Combining probabilities for multiple words โ maps to the โnaiveโ assumption of independence (and why we use log probabilities)
Key Concepts:
- Bayesโ Theorem: โThink Bayesโ Ch. 1 - Allen B. Downey
- Naive Bayes Classifier: โData Science from Scratchโ Ch. 13 - Joel Grus
- Text Processing: Scikit-learn documentation on Text Feature Extraction
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 & 2. Understanding of Python dictionaries.
Real world outcome: You will train your classifier. Then, you can give it a new, unseen message, and it will output a prediction:
New message: "Claim your free prize now! Click here."
Prediction: Spam (Probability: 98.7%)
New message: "Hey, are we still on for lunch tomorrow?"
Prediction: Ham (Probability: 99.9%)
Implementation Hints:
- Get Data: Find a dataset. The โSMS Spam Collection Data Setโ from the UCI Machine Learning Repository is perfect for this. Itโs a single file with two columns: label (
ham/spam) and the message text. - Setup:
pip install pandas scikit-learn - High-level plan (using Scikit-learnโs tools):
- Load the data with Pandas.
- Split the data into a training set and a testing set (
train_test_split). - Create a data processing โpipelineโ:
- Step 1: Vectorizer (
CountVectorizer): This tool converts your text messages into numerical data by counting the occurrences of each word. - Step 2: Classifier (
MultinomialNB): This is the Naive Bayes algorithm.
- Step 1: Vectorizer (
- Train the pipeline on your training data (
pipeline.fit()). - Test its accuracy on your unseen test data (
pipeline.score()).
- The โFrom Scratchโ Intuition (what Scikit-learn is doing for you):
- Calculate Prior Probabilities: Whatโs the overall probability of any given message being spam?
P(Spam) = (Number of spam messages) / (Total number of messages). - Calculate Word Probabilities (Likelihoods):
- For each word in your vocabulary, calculate
P(Word | Spam)(How often does this word appear in spam messages?) andP(Word | Ham). - This involves counting total words in spam vs. ham and applying Laplace smoothing.
- For each word in your vocabulary, calculate
- Classify a New Message:
- For a new message, you want to calculate
P(Spam | Message). - Using Bayesโ rule, this is proportional to
P(Message | Spam) * P(Spam). - The โnaiveโ part is assuming words are independent:
P(Message | Spam) โ P(word1 | Spam) * P(word2 | Spam) * ... - To avoid numbers getting too small (โunderflowโ), you add the log probabilities instead of multiplying.
- Compare the final โscoreโ for spam vs. ham and pick the higher one.
- For a new message, you want to calculate
- Calculate Prior Probabilities: Whatโs the overall probability of any given message being spam?
Learning milestones:
- You can calculate the prior probability of spam from the dataset โ You understand base rates.
- You can calculate
P("free" | Spam)andP("free" | Ham)โ You understand likelihoods and how they form evidence. - Your model correctly classifies most messages in the test set โ Youโve successfully built a predictive model.
- You can explain why the model classified a message as spam by looking at the word probabilities โ You understand how the algorithm โthinksโ.
Project 4: Is This Die Loaded?
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: R
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The โResume Goldโ
- Difficulty: Level 2: Intermediate
- Knowledge Area: Inferential Statistics / Hypothesis Testing
- Software or Tool: Python, SciPy
- Main Book: Introductory Statistics (OpenStax, free online textbook)
What youโll build: A script that uses the Chi-Squared Goodness of Fit test to determine if a series of dice rolls deviates significantly from what youโd expect from a fair die.
Why it teaches statistics: This is a perfect introduction to hypothesis testing. You will formalize a question (โIs this die fair?โ), state your null and alternative hypotheses, and use a statistical test to get a p-value that helps you make a conclusion.
Core challenges youโll face:
- Formulating a null hypothesis โ maps to stating the default assumption: โThe die is fairโ
- Calculating expected frequencies โ maps to understanding that for a fair die, each side should appear N/6 times in N rolls
- Understanding the Chi-Squared statistic โ maps to quantifying the total difference between observed and expected counts
- Interpreting the p-value โ maps to answering: โIf the die was fair, how likely is it weโd see a deviation this large or larger?โ
Key Concepts:
- Hypothesis Testing: โIntroductory Statisticsโ Ch. 9 - OpenStax
- Chi-Squared Goodness of Fit Test: โIntroductory Statisticsโ Ch. 11
- p-value: A Gentle Introduction to p-values
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1 & 2.
Real world outcome: You will simulate rolling a die 600 times. First, a fair die, then a loaded die. Your script will analyze the results and produce a clear conclusion.
Analyzing 600 rolls of a simulated FAIR die...
Observed counts: [98, 105, 95, 102, 99, 101]
Expected counts: [100, 100, 100, 100, 100, 100]
Chi-Squared Statistic: 0.56, p-value: 0.989
Conclusion: The p-value is high. We do not have evidence to reject the null hypothesis. The die appears to be fair.
Analyzing 600 rolls of a simulated LOADED die (6 is twice as likely)...
Observed counts: [85, 89, 81, 88, 82, 175]
Expected counts: [100, 100, 100, 100, 100, 100]
Chi-Squared Statistic: 82.1, p-value: 1.5e-16
Conclusion: The p-value is very small. We reject the null hypothesis. The die is likely loaded.
Implementation Hints:
- Setup:
pip install numpy scipy matplotlib - Simulate Rolls:
- Fair die:
np.random.randint(1, 7, size=600) - Loaded die: Use
np.random.choice([1, 2, 3, 4, 5, 6], size=600, p=[...])where you provide custom probabilities (e.g.,p=[1/7, 1/7, 1/7, 1/7, 1/7, 2/7]).
- Fair die:
- Get Observed Counts: Use NumPyโs
np.unique(rolls, return_counts=True)to get the counts for each face. - Get Expected Counts: For 600 rolls of a fair die, the expected count for each face is
600 / 6 = 100. - Perform the Test: The
scipylibrary makes this easy.from scipy.stats import chisquare observed_counts = [...] # Your counts from step 3 expected_counts = [100, 100, 100, 100, 100, 100] chi2, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts) print(f"Chi-Squared Statistic: {chi2:.2f}, p-value: {p_value:.3e}") alpha = 0.05 # Significance level if p_value < alpha: print("Conclusion: The p-value is small. We reject the null hypothesis.") else: print("Conclusion: The p-value is high. We fail to reject the null hypothesis.")The Chi-Squared statistic itself is calculated as ฮฃ [ (Observed - Expected)ยฒ / Expected ] for all categories. You can calculate this manually to prove to yourself you understand the formula.
Learning milestones:
- You can state a clear Null and Alternative hypothesis for a problem โ You understand the foundation of hypothesis testing.
- You can calculate the expected frequencies for a given scenario โ You can define what โfairnessโ or โno effectโ looks like.
- Your script produces a small p-value for the loaded die and a large one for the fair die โ You understand what the test output signifies.
- You can correctly interpret the p-value in a sentence โ You can translate statistical results into a plain English conclusion.
Project 5: Does More Studying Mean Higher Grades?
- File: LEARN_STATISTICS_FROM_SCRATCH.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Excel
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The โResume Goldโ
- Difficulty: Level 3: Advanced
- Knowledge Area: Correlation / Linear Regression
- Software or Tool: Python, Pandas, Scikit-learn, Matplotlib
- Main Book: The Art of Data Science by Roger D. Peng and Elizabeth Matsui
What youโll build: A script that analyzes a dataset of studentsโ hours studied and exam scores. It will calculate the correlation, fit a linear regression model, and create a scatter plot with the regression line overlaid.
Why it teaches statistics: This is the quintessential introduction to predictive modeling. You will learn to quantify the relationship between two variables and build a simple model that can make predictions (e.g., โIf a student studies for 10 hours, what is their predicted score?โ).
Core challenges youโll face:
- Understanding correlation vs. causation โ maps to realizing that just because two variables move together doesnโt mean one causes the other
- Fitting a line to data โ maps to the concept of โleast squares,โ finding the line that minimizes the total error
- Interpreting the modelโs coefficients โ maps to understanding the meaning of the slope and intercept
- Evaluating model performance โ maps to what R-squared means (the proportion of variance explained by the model)
Key Concepts:
- Correlation: โIntroductory Statisticsโ Ch. 12 - OpenStax
- Linear Regression: โIntroductory Statisticsโ Ch. 12
- R-squared: StatQuest: R-squared explained
Difficulty: Advanced Time estimate: 1-2 weeks
- Prerequisites: All previous projects, a good grasp of high school algebra (the equation of a line,
y = mx + b).
Real world outcome: Your script will analyze a dataset and produce a scatter plot. The plot will show a cloud of points (each point is a student) and a straight line running through them. Youโll also get a statistical summary:
Correlation between Hours Studied and Exam Score: 0.89 (Strong positive correlation)
Linear Regression Model:
Score = 5.5 * (Hours Studied) + 45.2
- Intercept (b): 45.2 (Predicted score for 0 hours studied)
- Slope (m): 5.5 (Each additional hour of study is associated with a 5.5 point increase in score)
- R-squared: 0.79 (The model explains 79% of the variance in exam scores)
Prediction for a student who studies 8 hours: 89.2
Implementation Hints:
- Data: Create a simple CSV file
scores.csvor find one online.hours_studied,exam_score 2,65 3.5,72 ... - Setup:
pip install pandas scikit-learn matplotlib - Analysis Script:
import pandas as pd from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Load and prepare data df = pd.read_csv('scores.csv') X = df[['hours_studied']] # Features (needs to be 2D) y = df['exam_score'] # Target # Calculate correlation correlation = df['hours_studied'].corr(df['exam_score']) print(f"Correlation: {correlation:.2f}") # Build and train the model model = LinearRegression() model.fit(X, y) # Get model parameters slope = model.coef_[0] intercept = model.intercept_ r_squared = model.score(X, y) print(f"Model: Score = {slope:.2f} * Hours + {intercept:.2f}") print(f"R-squared: {r_squared:.2f}") # Make a prediction hours_to_predict = [[8]] # Needs to be 2D predicted_score = model.predict(hours_to_predict) print(f"Predicted score for {hours_to_predict[0][0]} hours: {predicted_score[0]:.2f}") # Plotting plt.figure() plt.scatter(X, y, label='Actual Scores') plt.plot(X, model.predict(X), color='red', label='Regression Line') plt.title("Hours Studied vs. Exam Score") plt.xlabel("Hours Studied") plt.ylabel("Exam Score") plt.legend() plt.savefig("regression_plot.png")
Learning milestones:
- You create a scatter plot and visually identify a trend โ You can spot potential relationships in data.
- You can calculate and interpret the correlation coefficient โ You can quantify the strength and direction of a linear relationship.
- You fit a linear regression model and can explain the meaning of the slope and intercept โ You can model a relationship mathematically.
- You can use your model to make a prediction for a new data point โ You have built your first predictive model.
Summary
| Project | Difficulty | Main Language | Key Learning |
|---|---|---|---|
| 1. Personal Data Dashboard | Beginner | Python | Descriptive Statistics, Visualization |
| 2. Loot Box Simulator | Beginner | Python | Probability, Monte Carlo Simulation |
| 3. โDumbโ Spam Filter | Intermediate | Python | Bayesโ Theorem, Text Classification |
| 4. Is This Die Loaded? | Intermediate | Python | Hypothesis Testing, Chi-Squared Test |
| 5. Study vs. Grades | Advanced | Python | Correlation, Linear Regression |