Project 2: Titanic EDA Storybook

Build a reproducible exploratory analysis report that converts raw data into actionable modeling decisions.

Quick Reference

Attribute Value
Difficulty Level 1: Beginner
Time Estimate 6-10 hours
Main Programming Language Python (Alternatives: R, Julia)
Alternative Programming Languages R, Julia
Coolness Level Level 2: Practical but Forgettable
Business Potential 1. Resume Gold
Prerequisites pandas basics, chart literacy, descriptive statistics
Key Topics EDA workflow, missingness, segmentation, leakage suspicion

1. Learning Objectives

  1. Profile dataset quality with explicit checks.
  2. Produce question-driven visual analysis instead of chart dumping.
  3. Convert findings into preprocessing and modeling rules.
  4. Identify leakage and subgroup-risk candidates early.

2. All Theory Needed (Per-Concept Breakdown)

2.1 EDA as Risk Discovery

Fundamentals

EDA is a pre-model risk discovery process. It should identify data quality issues, distribution behavior, and subgroup disparities that could invalidate model conclusions.

Deep Dive into the concept

EDA begins with structure: row count, column definitions, missingness map, duplicate logic, and data type integrity. Without this baseline, every later chart is suspect. Next comes univariate analysis for each feature to understand central tendency, spread, skew, and rare values. Then bivariate analysis ties features to target outcomes through grouped summaries and plots.

The key mindset shift: each analysis output must map to a decision. A null-heavy column may require imputation or exclusion. A heavy skew may require transform. A subgroup disparity may force separate diagnostics and fairness monitoring. EDA is effective only when it constrains downstream modeling behavior.

Segment analysis is crucial. Global averages conceal subgroup failures. In this project, comparing survival by sex, class, and age segments demonstrates how aggregate narratives can miss structural effects.

Leakage clues can also appear in EDA. Features that are temporally downstream of the target event or direct proxies should be flagged immediately.

How this fit on projects

This concept powers P02 and sets guardrails for P03-P06.

Definitions & key terms

  • Univariate analysis: one variable at a time.
  • Bivariate analysis: relationship between two variables.
  • Missingness mechanism: reason data is absent.
  • Leakage candidate: feature potentially invalid for prediction.

Mental model diagram

Profile -> Distributions -> Segment checks -> Risk notes -> Modeling rules

How it works

  1. Audit schema and quality.
  2. Ask hypothesis-driven questions.
  3. Compute grouped summaries.
  4. Visualize with context.
  5. Write actionable conclusions.

Failure modes:

  • plotting without decisions,
  • ignoring subgroup counts,
  • mistaking correlation for causation.

Minimal concrete example

Hypothesis: "Passenger class influences survival"
Group by class -> compute survival rates -> visualize bars -> annotate sample counts
Action: include class-aware feature engineering

Common misconceptions

  • ”"”EDA is optional if model is strong.””” Correction: hidden data risks can invalidate “"”strong””” models.

Check-your-understanding questions

  1. Why should EDA findings map to explicit pipeline actions?
  2. How can subgroup analysis change model design?
  3. What is a leakage red flag in exploratory work?

Check-your-understanding answers

  1. Because EDA without action has no engineering value.
  2. It can require separate thresholds or extra features.
  3. Features unavailable at prediction time but highly correlated with target.

Real-world applications

  • Churn diagnostics
  • Incident triage analytics
  • Pricing and demand exploration

Where you’ll apply it

  • This project directly.
  • Also in P03/P04/P05 for feature and metric decisions.

References

  • pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
  • Seaborn tutorial: https://seaborn.pydata.org/tutorial.html

Key insights

EDA is a decision engine, not a visualization exercise.

Summary

Strong EDA turns raw records into constrained, testable modeling assumptions.

Homework/Exercises to practice the concept

  1. Produce a null-rate by segment table.
  2. Identify one likely leakage column and justify.
  3. Write three preprocessing rules from your findings.

Solutions to the homework/exercises

  1. Use grouped null summaries by ticket class/sex.
  2. Flag columns with post-event semantics.
  3. Example: class-aware age imputation, cabin sparsity handling, subgroup metric reporting.

3. Project Specification

3.1 What You Will Build

A markdown report plus plots that narrate key Titanic patterns and recommends preprocessing for downstream modeling.

3.2 Functional Requirements

  1. Generate data profile table.
  2. Produce at least five hypothesis-linked charts.
  3. Include subgroup analysis with sample counts.
  4. Document at least three actionable preprocessing rules.

3.3 Non-Functional Requirements

  • Performance: report generation under 60s for local dataset.
  • Reliability: same input yields same summaries and plots.
  • Usability: clear sectioned narrative for reviewers.

3.4 Example Usage / Output

$ ml-lab eda-report --dataset titanic.csv --out reports/titanic_eda.md
Generated: profile table, 6 plots, 8 findings, 4 preprocessing actions

3.5 Data Formats / Schemas / Protocols

  • Input: CSV with target Survived.
  • Output: markdown report + PNG figures.

3.6 Edge Cases

  • Missing target column should fail.
  • Zero-row filtered subsets should be reported gracefully.
  • High-cardinality text columns should be capped in chart summaries.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab eda-report --dataset data/titanic.csv --out reports/titanic_eda.md

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab eda-report --dataset fixtures/titanic_fixed.csv --out reports/fixed.md
Rows: 891
Findings: 8
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab eda-report --dataset fixtures/titanic_fixed.csv --out reports/fixed.md
Rows: 891 | Columns: 12
Missing: age=177, cabin=687, embarked=2
Saved report: reports/fixed.md
Saved figures: reports/figures/
Exit code: 0

$ ml-lab eda-report --dataset fixtures/bad_schema.csv --out reports/bad.md
ERROR: target column "Survived" not found
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Loader -> Profiler -> Hypothesis Engine -> Plot Generator -> Narrative Writer

4.2 Key Components

Component Responsibility Key Decisions
Profiler quality stats and schema checks fail-fast vs warn
Hypothesis Engine grouped summaries enforce question templates
Plot Generator produce stable visuals fixed style and order
Narrative Writer findings -> actions each finding needs action

4.3 Data Structures (No Full Code)

profile_table: columns x quality metrics
finding_item: {question, evidence, action, risk}

4.4 Algorithm Overview

  1. Load and validate schema.
  2. Build quality profile.
  3. Execute hypothesis queries.
  4. Generate linked figures.
  5. Write final report.

5. Implementation Guide

5.1 Development Environment Setup

# install pandas, seaborn, matplotlib; create reports directory

5.2 Project Structure

eda-project/
|-- data/
|-- reports/
|-- figures/
`-- src/

5.3 The Core Question You’re Answering

What should change in modeling strategy based on what data actually contains?

5.4 Concepts You Must Understand First

  • Null handling
  • Grouped summaries
  • Segment diagnostics

5.5 Questions to Guide Your Design

  • Which findings are actionable?
  • Which charts are redundant?
  • What confidence caveats are needed?

5.6 Thinking Exercise

Write one-page “"”risks before modeling””” memo from raw profile only.

5.7 The Interview Questions They’ll Ask

  1. What are your first five EDA checks?
  2. How do you detect leakage candidates in tabular data?
  3. How do you prioritize EDA outputs under time pressure?

5.8 Hints in Layers

  • Start with profile table.
  • Keep one chart per question.
  • Always attach action to finding.

5.9 Books That Will Help

Topic Book Chapter
pandas workflow Python for Data Analysis Ch. 6-8
statistical summaries Practical Statistics for Data Scientists Ch. 1-3

5.10 Implementation Phases

  • Phase 1: Profile builder.
  • Phase 2: Hypothesis + plots.
  • Phase 3: Narrative output and action mapping.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Report format notebook, markdown markdown review-friendly artifact
Plot count many, targeted targeted signal over noise

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit profile accuracy null counts, cardinality
Integration report generation full command run
Edge schema failures missing target column

6.2 Critical Test Cases

  1. Fixed dataset generates expected profile values.
  2. Missing target causes explicit non-zero exit.
  3. Report includes action items for each finding.

6.3 Test Data

  • titanic_fixed.csv
  • bad_schema.csv

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Too many charts report unreadable enforce question-driven plotting
No action mapping findings don’t help modeling require action field per finding
Ignoring sample counts unstable conclusions annotate counts on all segment summaries

7.2 Debugging Strategies

  • Diff profile outputs between runs.
  • Validate grouped summaries with manual spot checks.

7.3 Performance Traps

Rendering high-cardinality plots unnecessarily; pre-filter top categories.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add automatic data dictionary extraction.
  • Add null heatmap by segment.

8.2 Intermediate Extensions

  • Add leakage suspicion scoring rubric.
  • Add drift comparison against a second snapshot.

8.3 Advanced Extensions

  • Auto-generate model-ready preprocessing config from findings.

9. Real-World Connections

9.1 Industry Applications

  • KPI anomaly diagnostics
  • Customer behavior analysis
  • ydata-profiling
  • Great Expectations (for data checks)

9.3 Interview Relevance

Builds explainability around data quality and risk-aware modeling.


10. Resources

10.1 Essential Reading

  • pandas docs
  • seaborn docs

10.2 Video Resources

  • Exploratory data analysis lectures (university/open courses)

10.3 Tools & Documentation

  • https://pandas.pydata.org/
  • https://seaborn.pydata.org/

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain each report finding and its modeling implication.

11.2 Implementation

  • Report regenerates deterministically from same input.

11.3 Growth

  • I can prioritize top three risks under a 30-minute review constraint.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Profile + five hypothesis-driven findings.

Full Completion:

  • Segment analysis and leakage notes included.

Excellence (Going Above & Beyond):

  • Structured action plan directly consumable by downstream modeling project.