Project 2: Titanic EDA Storybook
Build a reproducible exploratory analysis report that converts raw data into actionable modeling decisions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | 6-10 hours |
| Main Programming Language | Python (Alternatives: R, Julia) |
| Alternative Programming Languages | R, Julia |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 1. Resume Gold |
| Prerequisites | pandas basics, chart literacy, descriptive statistics |
| Key Topics | EDA workflow, missingness, segmentation, leakage suspicion |
1. Learning Objectives
- Profile dataset quality with explicit checks.
- Produce question-driven visual analysis instead of chart dumping.
- Convert findings into preprocessing and modeling rules.
- Identify leakage and subgroup-risk candidates early.
2. All Theory Needed (Per-Concept Breakdown)
2.1 EDA as Risk Discovery
Fundamentals
EDA is a pre-model risk discovery process. It should identify data quality issues, distribution behavior, and subgroup disparities that could invalidate model conclusions.
Deep Dive into the concept
EDA begins with structure: row count, column definitions, missingness map, duplicate logic, and data type integrity. Without this baseline, every later chart is suspect. Next comes univariate analysis for each feature to understand central tendency, spread, skew, and rare values. Then bivariate analysis ties features to target outcomes through grouped summaries and plots.
The key mindset shift: each analysis output must map to a decision. A null-heavy column may require imputation or exclusion. A heavy skew may require transform. A subgroup disparity may force separate diagnostics and fairness monitoring. EDA is effective only when it constrains downstream modeling behavior.
Segment analysis is crucial. Global averages conceal subgroup failures. In this project, comparing survival by sex, class, and age segments demonstrates how aggregate narratives can miss structural effects.
Leakage clues can also appear in EDA. Features that are temporally downstream of the target event or direct proxies should be flagged immediately.
How this fit on projects
This concept powers P02 and sets guardrails for P03-P06.
Definitions & key terms
- Univariate analysis: one variable at a time.
- Bivariate analysis: relationship between two variables.
- Missingness mechanism: reason data is absent.
- Leakage candidate: feature potentially invalid for prediction.
Mental model diagram
Profile -> Distributions -> Segment checks -> Risk notes -> Modeling rules
How it works
- Audit schema and quality.
- Ask hypothesis-driven questions.
- Compute grouped summaries.
- Visualize with context.
- Write actionable conclusions.
Failure modes:
- plotting without decisions,
- ignoring subgroup counts,
- mistaking correlation for causation.
Minimal concrete example
Hypothesis: "Passenger class influences survival"
Group by class -> compute survival rates -> visualize bars -> annotate sample counts
Action: include class-aware feature engineering
Common misconceptions
- ”"”EDA is optional if model is strong.””” Correction: hidden data risks can invalidate “"”strong””” models.
Check-your-understanding questions
- Why should EDA findings map to explicit pipeline actions?
- How can subgroup analysis change model design?
- What is a leakage red flag in exploratory work?
Check-your-understanding answers
- Because EDA without action has no engineering value.
- It can require separate thresholds or extra features.
- Features unavailable at prediction time but highly correlated with target.
Real-world applications
- Churn diagnostics
- Incident triage analytics
- Pricing and demand exploration
Where you’ll apply it
- This project directly.
- Also in P03/P04/P05 for feature and metric decisions.
References
- pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
- Seaborn tutorial: https://seaborn.pydata.org/tutorial.html
Key insights
EDA is a decision engine, not a visualization exercise.
Summary
Strong EDA turns raw records into constrained, testable modeling assumptions.
Homework/Exercises to practice the concept
- Produce a null-rate by segment table.
- Identify one likely leakage column and justify.
- Write three preprocessing rules from your findings.
Solutions to the homework/exercises
- Use grouped null summaries by ticket class/sex.
- Flag columns with post-event semantics.
- Example: class-aware age imputation, cabin sparsity handling, subgroup metric reporting.
3. Project Specification
3.1 What You Will Build
A markdown report plus plots that narrate key Titanic patterns and recommends preprocessing for downstream modeling.
3.2 Functional Requirements
- Generate data profile table.
- Produce at least five hypothesis-linked charts.
- Include subgroup analysis with sample counts.
- Document at least three actionable preprocessing rules.
3.3 Non-Functional Requirements
- Performance: report generation under 60s for local dataset.
- Reliability: same input yields same summaries and plots.
- Usability: clear sectioned narrative for reviewers.
3.4 Example Usage / Output
$ ml-lab eda-report --dataset titanic.csv --out reports/titanic_eda.md
Generated: profile table, 6 plots, 8 findings, 4 preprocessing actions
3.5 Data Formats / Schemas / Protocols
- Input: CSV with target
Survived. - Output: markdown report + PNG figures.
3.6 Edge Cases
- Missing target column should fail.
- Zero-row filtered subsets should be reported gracefully.
- High-cardinality text columns should be capped in chart summaries.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
ml-lab eda-report --dataset data/titanic.csv --out reports/titanic_eda.md
3.7.2 Golden Path Demo (Deterministic)
$ ml-lab eda-report --dataset fixtures/titanic_fixed.csv --out reports/fixed.md
Rows: 891
Findings: 8
Exit code: 0
3.7.3 If CLI: exact terminal transcript
$ ml-lab eda-report --dataset fixtures/titanic_fixed.csv --out reports/fixed.md
Rows: 891 | Columns: 12
Missing: age=177, cabin=687, embarked=2
Saved report: reports/fixed.md
Saved figures: reports/figures/
Exit code: 0
$ ml-lab eda-report --dataset fixtures/bad_schema.csv --out reports/bad.md
ERROR: target column "Survived" not found
Exit code: 2
4. Solution Architecture
4.1 High-Level Design
Loader -> Profiler -> Hypothesis Engine -> Plot Generator -> Narrative Writer
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Profiler | quality stats and schema checks | fail-fast vs warn |
| Hypothesis Engine | grouped summaries | enforce question templates |
| Plot Generator | produce stable visuals | fixed style and order |
| Narrative Writer | findings -> actions | each finding needs action |
4.3 Data Structures (No Full Code)
profile_table: columns x quality metrics
finding_item: {question, evidence, action, risk}
4.4 Algorithm Overview
- Load and validate schema.
- Build quality profile.
- Execute hypothesis queries.
- Generate linked figures.
- Write final report.
5. Implementation Guide
5.1 Development Environment Setup
# install pandas, seaborn, matplotlib; create reports directory
5.2 Project Structure
eda-project/
|-- data/
|-- reports/
|-- figures/
`-- src/
5.3 The Core Question You’re Answering
What should change in modeling strategy based on what data actually contains?
5.4 Concepts You Must Understand First
- Null handling
- Grouped summaries
- Segment diagnostics
5.5 Questions to Guide Your Design
- Which findings are actionable?
- Which charts are redundant?
- What confidence caveats are needed?
5.6 Thinking Exercise
Write one-page “"”risks before modeling””” memo from raw profile only.
5.7 The Interview Questions They’ll Ask
- What are your first five EDA checks?
- How do you detect leakage candidates in tabular data?
- How do you prioritize EDA outputs under time pressure?
5.8 Hints in Layers
- Start with profile table.
- Keep one chart per question.
- Always attach action to finding.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| pandas workflow | Python for Data Analysis | Ch. 6-8 |
| statistical summaries | Practical Statistics for Data Scientists | Ch. 1-3 |
5.10 Implementation Phases
- Phase 1: Profile builder.
- Phase 2: Hypothesis + plots.
- Phase 3: Narrative output and action mapping.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Report format | notebook, markdown | markdown | review-friendly artifact |
| Plot count | many, targeted | targeted | signal over noise |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | profile accuracy | null counts, cardinality |
| Integration | report generation | full command run |
| Edge | schema failures | missing target column |
6.2 Critical Test Cases
- Fixed dataset generates expected profile values.
- Missing target causes explicit non-zero exit.
- Report includes action items for each finding.
6.3 Test Data
titanic_fixed.csvbad_schema.csv
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Too many charts | report unreadable | enforce question-driven plotting |
| No action mapping | findings don’t help modeling | require action field per finding |
| Ignoring sample counts | unstable conclusions | annotate counts on all segment summaries |
7.2 Debugging Strategies
- Diff profile outputs between runs.
- Validate grouped summaries with manual spot checks.
7.3 Performance Traps
Rendering high-cardinality plots unnecessarily; pre-filter top categories.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add automatic data dictionary extraction.
- Add null heatmap by segment.
8.2 Intermediate Extensions
- Add leakage suspicion scoring rubric.
- Add drift comparison against a second snapshot.
8.3 Advanced Extensions
- Auto-generate model-ready preprocessing config from findings.
9. Real-World Connections
9.1 Industry Applications
- KPI anomaly diagnostics
- Customer behavior analysis
9.2 Related Open Source Projects
- ydata-profiling
- Great Expectations (for data checks)
9.3 Interview Relevance
Builds explainability around data quality and risk-aware modeling.
10. Resources
10.1 Essential Reading
- pandas docs
- seaborn docs
10.2 Video Resources
- Exploratory data analysis lectures (university/open courses)
10.3 Tools & Documentation
- https://pandas.pydata.org/
- https://seaborn.pydata.org/
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain each report finding and its modeling implication.
11.2 Implementation
- Report regenerates deterministically from same input.
11.3 Growth
- I can prioritize top three risks under a 30-minute review constraint.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Profile + five hypothesis-driven findings.
Full Completion:
- Segment analysis and leakage notes included.
Excellence (Going Above & Beyond):
- Structured action plan directly consumable by downstream modeling project.