Project 8: Descriptive Statistics Observatory
Build an observatory that profiles data shape, spread, robustness, and transformation opportunities.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1 week |
| Main Programming Language | Python |
| Alternative Programming Languages | R, SQL+BI |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 2. The “Micro-SaaS / Pro Tool” |
| Prerequisites | Projects 1, 6, 7 |
| Key Topics | Mean/median/mode, variance, skewness, kurtosis, quantiles, outliers |
1. Learning Objectives
- Build robust and classical descriptive summaries.
- Interpret histogram/boxplot/density tradeoffs.
- Propose transformation strategies with clear rationale.
- Distinguish anomalies from natural heavy-tail behavior.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Robust vs Non-Robust Summary Statistics
- Fundamentals: Mean is efficient under symmetric noise; median/IQR are robust under outliers.
- Deep Dive into the concept: A good observatory compares both and explains divergence, rather than declaring one universally correct.
2.2 Shape and Transformation
- Fundamentals: Skewness, kurtosis, and quantiles characterize distribution shape.
- Deep Dive into the concept: Transformations (log/scale) can stabilize variance and improve comparability but alter interpretation.
3. Project Specification
3.1 What You Will Build
A profiling tool that emits quality and distribution dashboards for tabular data.
3.2 Functional Requirements
- Produce center/spread/shape metrics for each numeric column.
- Emit visualization bundle (histogram, density, boxplot).
- Suggest transformations with reason codes.
- Flag outlier candidates with tunable rules.
3.3 Non-Functional Requirements
- Works on >1M rows efficiently.
- Reproducible output bundle with run metadata.
3.4 Example Usage / Output
$ python descriptive_observatory.py --input data/retail_orders.csv
Columns profiled: 34
Skewness alerts: 7
Outlier flags: 18442
Dashboard exported: outputs/descriptive_observatory/index.html
3.5 Real World Outcome
You get a decision-ready profile showing where inference/modeling assumptions are likely to fail.
4. Solution Architecture
Input table -> profiler -> visualizer -> transform recommender -> report
5. Implementation Guide
5.1 Development Environment Setup
pip install pandas numpy matplotlib seaborn
5.2 Project Structure
P08/
descriptive_observatory.py
templates/
outputs/
5.3 The Core Question You Are Answering
“What does this data actually look like before I run inference or modeling?”
5.4 Concepts You Must Understand First
- Center/spread metrics
- Quantiles and IQR
- Skewness and kurtosis
- Data transformation logic
5.5 Questions to Guide Your Design
- What threshold triggers warnings?
- How will you avoid over-flagging outliers?
- Which outputs matter most to stakeholders?
5.6 Thinking Exercise
Compare interpretation of the same variable before and after log transform.
5.7 The Interview Questions They’ll Ask
- Why median over mean in skewed data?
- What does kurtosis capture?
- Why can histograms mislead?
- How do you choose outlier rules?
- Why do transformations change interpretation?
5.8 Hints in Layers
- Hint 1: Build metric table first.
- Hint 2: Add plot generation.
- Hint 3: Add transformation suggestions.
- Hint 4: Add cohort-level profiling.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Descriptive stats | Think Stats | Ch. 2-4 |
| Visualization | Fundamentals of Data Visualization | selected |
| EDA mindset | Tukey EDA | core sections |
6. Testing Strategy
- Synthetic heavy-tail data tests.
- Known outlier insertion tests.
- Plot generation smoke tests.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| One-size outlier rule | noisy false positives | cohort-based thresholds |
| Missing type checks | broken summaries | schema validation upfront |
| Over-smoothed densities | hidden multimodality | compare with histogram bins |
8. Extensions & Challenges
- Add robust scaling and winsorization options.
- Add automated data quality scorecard.
9. Real-World Connections
- Data quality gates for product analytics.
- Monitoring KPI drift in operations.
10. Resources
- Think Stats
- Tukey EDA
11. Self-Assessment Checklist
- I can justify metric choices under skew/outliers.
- I can explain one transformation tradeoff.
- My report is reproducible.
12. Submission / Completion Criteria
Minimum: profiling + plots + outlier flags.
Full: includes transformation rationale and segment-aware diagnostics.