Project 8: Descriptive Statistics Observatory

Build an observatory that profiles data shape, spread, robustness, and transformation opportunities.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1 week
Main Programming Language Python
Alternative Programming Languages R, SQL+BI
Coolness Level Level 3: Genuinely Clever
Business Potential 2. The “Micro-SaaS / Pro Tool”
Prerequisites Projects 1, 6, 7
Key Topics Mean/median/mode, variance, skewness, kurtosis, quantiles, outliers

1. Learning Objectives

  1. Build robust and classical descriptive summaries.
  2. Interpret histogram/boxplot/density tradeoffs.
  3. Propose transformation strategies with clear rationale.
  4. Distinguish anomalies from natural heavy-tail behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Robust vs Non-Robust Summary Statistics

  • Fundamentals: Mean is efficient under symmetric noise; median/IQR are robust under outliers.
  • Deep Dive into the concept: A good observatory compares both and explains divergence, rather than declaring one universally correct.

2.2 Shape and Transformation

  • Fundamentals: Skewness, kurtosis, and quantiles characterize distribution shape.
  • Deep Dive into the concept: Transformations (log/scale) can stabilize variance and improve comparability but alter interpretation.

3. Project Specification

3.1 What You Will Build

A profiling tool that emits quality and distribution dashboards for tabular data.

3.2 Functional Requirements

  1. Produce center/spread/shape metrics for each numeric column.
  2. Emit visualization bundle (histogram, density, boxplot).
  3. Suggest transformations with reason codes.
  4. Flag outlier candidates with tunable rules.

3.3 Non-Functional Requirements

  • Works on >1M rows efficiently.
  • Reproducible output bundle with run metadata.

3.4 Example Usage / Output

$ python descriptive_observatory.py --input data/retail_orders.csv
Columns profiled: 34
Skewness alerts: 7
Outlier flags: 18442
Dashboard exported: outputs/descriptive_observatory/index.html

3.5 Real World Outcome

You get a decision-ready profile showing where inference/modeling assumptions are likely to fail.


4. Solution Architecture

Input table -> profiler -> visualizer -> transform recommender -> report

5. Implementation Guide

5.1 Development Environment Setup

pip install pandas numpy matplotlib seaborn

5.2 Project Structure

P08/
  descriptive_observatory.py
  templates/
  outputs/

5.3 The Core Question You Are Answering

“What does this data actually look like before I run inference or modeling?”

5.4 Concepts You Must Understand First

  1. Center/spread metrics
  2. Quantiles and IQR
  3. Skewness and kurtosis
  4. Data transformation logic

5.5 Questions to Guide Your Design

  1. What threshold triggers warnings?
  2. How will you avoid over-flagging outliers?
  3. Which outputs matter most to stakeholders?

5.6 Thinking Exercise

Compare interpretation of the same variable before and after log transform.

5.7 The Interview Questions They’ll Ask

  1. Why median over mean in skewed data?
  2. What does kurtosis capture?
  3. Why can histograms mislead?
  4. How do you choose outlier rules?
  5. Why do transformations change interpretation?

5.8 Hints in Layers

  • Hint 1: Build metric table first.
  • Hint 2: Add plot generation.
  • Hint 3: Add transformation suggestions.
  • Hint 4: Add cohort-level profiling.

5.9 Books That Will Help

Topic Book Chapter
Descriptive stats Think Stats Ch. 2-4
Visualization Fundamentals of Data Visualization selected
EDA mindset Tukey EDA core sections

6. Testing Strategy

  • Synthetic heavy-tail data tests.
  • Known outlier insertion tests.
  • Plot generation smoke tests.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
One-size outlier rule noisy false positives cohort-based thresholds
Missing type checks broken summaries schema validation upfront
Over-smoothed densities hidden multimodality compare with histogram bins

8. Extensions & Challenges

  • Add robust scaling and winsorization options.
  • Add automated data quality scorecard.

9. Real-World Connections

  • Data quality gates for product analytics.
  • Monitoring KPI drift in operations.

10. Resources

  • Think Stats
  • Tukey EDA

11. Self-Assessment Checklist

  • I can justify metric choices under skew/outliers.
  • I can explain one transformation tradeoff.
  • My report is reproducible.

12. Submission / Completion Criteria

Minimum: profiling + plots + outlier flags.

Full: includes transformation rationale and segment-aware diagnostics.