Project 2: Titanic EDA Storybook

Build a reproducible exploratory analysis report that converts raw data into actionable modeling decisions.

Quick Reference

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	6-10 hours
Main Programming Language	Python (Alternatives: R, Julia)
Alternative Programming Languages	R, Julia
Coolness Level	Level 2: Practical but Forgettable
Business Potential	1. Resume Gold
Prerequisites	pandas basics, chart literacy, descriptive statistics
Key Topics	EDA workflow, missingness, segmentation, leakage suspicion

1. Learning Objectives

Profile dataset quality with explicit checks.
Produce question-driven visual analysis instead of chart dumping.
Convert findings into preprocessing and modeling rules.
Identify leakage and subgroup-risk candidates early.

2. All Theory Needed (Per-Concept Breakdown)

2.1 EDA as Risk Discovery

Fundamentals

EDA is a pre-model risk discovery process. It should identify data quality issues, distribution behavior, and subgroup disparities that could invalidate model conclusions.

Deep Dive into the concept

EDA begins with structure: row count, column definitions, missingness map, duplicate logic, and data type integrity. Without this baseline, every later chart is suspect. Next comes univariate analysis for each feature to understand central tendency, spread, skew, and rare values. Then bivariate analysis ties features to target outcomes through grouped summaries and plots.

The key mindset shift: each analysis output must map to a decision. A null-heavy column may require imputation or exclusion. A heavy skew may require transform. A subgroup disparity may force separate diagnostics and fairness monitoring. EDA is effective only when it constrains downstream modeling behavior.

Segment analysis is crucial. Global averages conceal subgroup failures. In this project, comparing survival by sex, class, and age segments demonstrates how aggregate narratives can miss structural effects.

Leakage clues can also appear in EDA. Features that are temporally downstream of the target event or direct proxies should be flagged immediately.

How this fit on projects

This concept powers P02 and sets guardrails for P03-P06.

Definitions & key terms

Univariate analysis: one variable at a time.
Bivariate analysis: relationship between two variables.
Missingness mechanism: reason data is absent.
Leakage candidate: feature potentially invalid for prediction.

Mental model diagram

Profile -> Distributions -> Segment checks -> Risk notes -> Modeling rules

How it works

Audit schema and quality.
Ask hypothesis-driven questions.
Compute grouped summaries.
Visualize with context.
Write actionable conclusions.

Failure modes:

plotting without decisions,
ignoring subgroup counts,
mistaking correlation for causation.

Minimal concrete example

Hypothesis: "Passenger class influences survival"
Group by class -> compute survival rates -> visualize bars -> annotate sample counts
Action: include class-aware feature engineering

Common misconceptions

”"”EDA is optional if model is strong.””” Correction: hidden data risks can invalidate “"”strong””” models.

Check-your-understanding questions

Why should EDA findings map to explicit pipeline actions?
How can subgroup analysis change model design?
What is a leakage red flag in exploratory work?

Check-your-understanding answers

Because EDA without action has no engineering value.
It can require separate thresholds or extra features.
Features unavailable at prediction time but highly correlated with target.

Real-world applications

Churn diagnostics
Incident triage analytics
Pricing and demand exploration

Where you’ll apply it

This project directly.
Also in P03/P04/P05 for feature and metric decisions.

References

pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
Seaborn tutorial: https://seaborn.pydata.org/tutorial.html

Key insights

EDA is a decision engine, not a visualization exercise.

Summary

Strong EDA turns raw records into constrained, testable modeling assumptions.

Homework/Exercises to practice the concept

Produce a null-rate by segment table.
Identify one likely leakage column and justify.
Write three preprocessing rules from your findings.

Solutions to the homework/exercises

Use grouped null summaries by ticket class/sex.
Flag columns with post-event semantics.
Example: class-aware age imputation, cabin sparsity handling, subgroup metric reporting.

3. Project Specification

3.1 What You Will Build

A markdown report plus plots that narrate key Titanic patterns and recommends preprocessing for downstream modeling.

3.2 Functional Requirements

Generate data profile table.
Produce at least five hypothesis-linked charts.
Include subgroup analysis with sample counts.
Document at least three actionable preprocessing rules.

3.3 Non-Functional Requirements

Performance: report generation under 60s for local dataset.
Reliability: same input yields same summaries and plots.
Usability: clear sectioned narrative for reviewers.

3.4 Example Usage / Output

$ ml-lab eda-report --dataset titanic.csv --out reports/titanic_eda.md
Generated: profile table, 6 plots, 8 findings, 4 preprocessing actions

3.5 Data Formats / Schemas / Protocols

Input: CSV with target Survived.
Output: markdown report + PNG figures.

3.6 Edge Cases

Missing target column should fail.
Zero-row filtered subsets should be reported gracefully.
High-cardinality text columns should be capped in chart summaries.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab eda-report --dataset data/titanic.csv --out reports/titanic_eda.md

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab eda-report --dataset fixtures/titanic_fixed.csv --out reports/fixed.md
Rows: 891
Findings: 8
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab eda-report --dataset fixtures/titanic_fixed.csv --out reports/fixed.md
Rows: 891 | Columns: 12
Missing: age=177, cabin=687, embarked=2
Saved report: reports/fixed.md
Saved figures: reports/figures/
Exit code: 0

$ ml-lab eda-report --dataset fixtures/bad_schema.csv --out reports/bad.md
ERROR: target column "Survived" not found
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Loader -> Profiler -> Hypothesis Engine -> Plot Generator -> Narrative Writer

4.2 Key Components

Component	Responsibility	Key Decisions
Profiler	quality stats and schema checks	fail-fast vs warn
Hypothesis Engine	grouped summaries	enforce question templates
Plot Generator	produce stable visuals	fixed style and order
Narrative Writer	findings -> actions	each finding needs action

4.3 Data Structures (No Full Code)

profile_table: columns x quality metrics
finding_item: {question, evidence, action, risk}

4.4 Algorithm Overview

Load and validate schema.
Build quality profile.
Execute hypothesis queries.
Generate linked figures.
Write final report.

5. Implementation Guide

5.1 Development Environment Setup

# install pandas, seaborn, matplotlib; create reports directory

5.2 Project Structure

eda-project/
|-- data/
|-- reports/
|-- figures/
`-- src/

5.3 The Core Question You’re Answering

What should change in modeling strategy based on what data actually contains?

5.4 Concepts You Must Understand First

Null handling
Grouped summaries
Segment diagnostics

5.5 Questions to Guide Your Design

Which findings are actionable?
Which charts are redundant?
What confidence caveats are needed?

5.6 Thinking Exercise

Write one-page “"”risks before modeling””” memo from raw profile only.

5.7 The Interview Questions They’ll Ask

What are your first five EDA checks?
How do you detect leakage candidates in tabular data?
How do you prioritize EDA outputs under time pressure?

5.8 Hints in Layers

Start with profile table.
Keep one chart per question.
Always attach action to finding.

5.9 Books That Will Help

Topic	Book	Chapter
pandas workflow	Python for Data Analysis	Ch. 6-8
statistical summaries	Practical Statistics for Data Scientists	Ch. 1-3

5.10 Implementation Phases

Phase 1: Profile builder.
Phase 2: Hypothesis + plots.
Phase 3: Narrative output and action mapping.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Report format	notebook, markdown	markdown	review-friendly artifact
Plot count	many, targeted	targeted	signal over noise

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	profile accuracy	null counts, cardinality
Integration	report generation	full command run
Edge	schema failures	missing target column

6.2 Critical Test Cases

Fixed dataset generates expected profile values.
Missing target causes explicit non-zero exit.
Report includes action items for each finding.

6.3 Test Data

titanic_fixed.csv
bad_schema.csv

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Too many charts	report unreadable	enforce question-driven plotting
No action mapping	findings don’t help modeling	require action field per finding
Ignoring sample counts	unstable conclusions	annotate counts on all segment summaries

7.2 Debugging Strategies

Diff profile outputs between runs.
Validate grouped summaries with manual spot checks.

7.3 Performance Traps

Rendering high-cardinality plots unnecessarily; pre-filter top categories.

8. Extensions & Challenges

8.1 Beginner Extensions

Add automatic data dictionary extraction.
Add null heatmap by segment.

8.2 Intermediate Extensions

Add leakage suspicion scoring rubric.
Add drift comparison against a second snapshot.

8.3 Advanced Extensions

Auto-generate model-ready preprocessing config from findings.

9. Real-World Connections

9.1 Industry Applications

KPI anomaly diagnostics
Customer behavior analysis

ydata-profiling
Great Expectations (for data checks)

9.3 Interview Relevance

Builds explainability around data quality and risk-aware modeling.

10. Resources

10.1 Essential Reading

pandas docs
seaborn docs

10.2 Video Resources

Exploratory data analysis lectures (university/open courses)

10.3 Tools & Documentation

https://pandas.pydata.org/
https://seaborn.pydata.org/

Previous: Project 1
Next: Project 3

11. Self-Assessment Checklist

11.1 Understanding

I can explain each report finding and its modeling implication.

11.2 Implementation

Report regenerates deterministically from same input.

11.3 Growth

I can prioritize top three risks under a 30-minute review constraint.

12. Submission / Completion Criteria

Minimum Viable Completion:

Profile + five hypothesis-driven findings.

Full Completion:

Segment analysis and leakage notes included.

Excellence (Going Above & Beyond):

Structured action plan directly consumable by downstream modeling project.