Project 8: Descriptive Statistics Observatory

Build an observatory that profiles data shape, spread, robustness, and transformation opportunities.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1 week
Main Programming Language	Python
Alternative Programming Languages	R, SQL+BI
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Prerequisites	Projects 1, 6, 7
Key Topics	Mean/median/mode, variance, skewness, kurtosis, quantiles, outliers

1. Learning Objectives

Build robust and classical descriptive summaries.
Interpret histogram/boxplot/density tradeoffs.
Propose transformation strategies with clear rationale.
Distinguish anomalies from natural heavy-tail behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Robust vs Non-Robust Summary Statistics

Fundamentals: Mean is efficient under symmetric noise; median/IQR are robust under outliers.
Deep Dive into the concept: A good observatory compares both and explains divergence, rather than declaring one universally correct.

2.2 Shape and Transformation

Fundamentals: Skewness, kurtosis, and quantiles characterize distribution shape.
Deep Dive into the concept: Transformations (log/scale) can stabilize variance and improve comparability but alter interpretation.

3. Project Specification

3.1 What You Will Build

A profiling tool that emits quality and distribution dashboards for tabular data.

3.2 Functional Requirements

Produce center/spread/shape metrics for each numeric column.
Emit visualization bundle (histogram, density, boxplot).
Suggest transformations with reason codes.
Flag outlier candidates with tunable rules.

3.3 Non-Functional Requirements

Works on >1M rows efficiently.
Reproducible output bundle with run metadata.

3.4 Example Usage / Output

$ python descriptive_observatory.py --input data/retail_orders.csv
Columns profiled: 34
Skewness alerts: 7
Outlier flags: 18442
Dashboard exported: outputs/descriptive_observatory/index.html

3.5 Real World Outcome

You get a decision-ready profile showing where inference/modeling assumptions are likely to fail.

4. Solution Architecture

Input table -> profiler -> visualizer -> transform recommender -> report

5. Implementation Guide

5.1 Development Environment Setup

pip install pandas numpy matplotlib seaborn

5.2 Project Structure

P08/
  descriptive_observatory.py
  templates/
  outputs/

5.3 The Core Question You Are Answering

“What does this data actually look like before I run inference or modeling?”

5.4 Concepts You Must Understand First

Center/spread metrics
Quantiles and IQR
Skewness and kurtosis
Data transformation logic

5.5 Questions to Guide Your Design

What threshold triggers warnings?
How will you avoid over-flagging outliers?
Which outputs matter most to stakeholders?

5.6 Thinking Exercise

Compare interpretation of the same variable before and after log transform.

5.7 The Interview Questions They’ll Ask

Why median over mean in skewed data?
What does kurtosis capture?
Why can histograms mislead?
How do you choose outlier rules?
Why do transformations change interpretation?

5.8 Hints in Layers

Hint 1: Build metric table first.
Hint 2: Add plot generation.
Hint 3: Add transformation suggestions.
Hint 4: Add cohort-level profiling.

5.9 Books That Will Help

Topic	Book	Chapter
Descriptive stats	Think Stats	Ch. 2-4
Visualization	Fundamentals of Data Visualization	selected
EDA mindset	Tukey EDA	core sections

6. Testing Strategy

Synthetic heavy-tail data tests.
Known outlier insertion tests.
Plot generation smoke tests.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
One-size outlier rule	noisy false positives	cohort-based thresholds
Missing type checks	broken summaries	schema validation upfront
Over-smoothed densities	hidden multimodality	compare with histogram bins

8. Extensions & Challenges

Add robust scaling and winsorization options.
Add automated data quality scorecard.

9. Real-World Connections

Data quality gates for product analytics.
Monitoring KPI drift in operations.

10. Resources

Think Stats
Tukey EDA

11. Self-Assessment Checklist

I can justify metric choices under skew/outliers.
I can explain one transformation tradeoff.
My report is reproducible.

12. Submission / Completion Criteria

Minimum: profiling + plots + outlier flags.

Full: includes transformation rationale and segment-aware diagnostics.