Project 4: Correlation and Scatter Plots
Build an analysis that measures and visualizes relationships between variables.
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | Weekend |
| Main Language | R |
| Alternative Languages | Python, Julia |
| Knowledge Area | Correlation |
| Tools | ggplot2 |
| Main Book | “OpenIntro Statistics” by Diez et al. |
What you’ll build: Scatter plots with correlation coefficients and interpretation notes.
Why it teaches stats: Correlation is the entry point for understanding relationships.
Core challenges you’ll face:
- Computing correlations correctly
- Recognizing non-linear relationships
- Avoiding over-interpretation
Real World Outcome
You will generate scatter plots and report correlation values for key variable pairs.
Example Output:
Correlation (mpg, wt): -0.87
Saved plot: mpg_vs_wt.png
Verification steps:
- Compare correlations across different pairs
- Check for outliers influencing results
The Core Question You’re Answering
“How strongly do two variables move together, and what does that mean?”
Correlation is useful, but easy to misuse.
Concepts You Must Understand First
Stop and research these before coding:
- Pearson correlation
- What does a correlation of -0.8 mean?
- Book Reference: “OpenIntro Statistics”, Ch. 3
- Nonlinear relationships
- Why can correlation miss curved patterns?
- Book Reference: “OpenIntro Statistics”, Ch. 3
- Outliers
- How can one point distort correlation?
- Book Reference: “OpenIntro Statistics”, Ch. 3
Questions to Guide Your Design
- Variable selection
- Which variable pairs are meaningful to compare?
- How will you justify each pair?
- Interpretation
- How will you explain correlation vs causation?
- Will you test robustness with outlier removal?
Thinking Exercise
Correlation vs Causation
Find two variables with high correlation. Propose a third variable that might explain both.
Questions while working:
- Why doesn’t correlation prove causality?
- What experiments could test causality?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What does correlation measure?”
- “Why doesn’t correlation imply causation?”
- “How can outliers affect correlation?”
- “What is the difference between Pearson and Spearman?”
- “How do you interpret a scatter plot?”
Hints in Layers
Hint 1: Starting Point Start with a simple scatter plot.
Hint 2: Next Level Compute correlation and annotate the plot.
Hint 3: Technical Details Check residual plots to detect nonlinearity.
Hint 4: Tools/Debugging Remove a suspected outlier and see how correlation changes.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Pearson correlation | “OpenIntro Statistics” | Ch. 3 |
| Nonlinear patterns | “OpenIntro Statistics” | Ch. 3 |
| Outliers | “OpenIntro Statistics” | Ch. 3 |
Implementation Hints
- Always plot before computing correlation.
- Label axes with units.
- Add a trend line but explain limitations.
Learning Milestones
- First milestone: You can compute correlation coefficients.
- Second milestone: You can interpret scatter plots.
- Final milestone: You can explain correlation caveats clearly.