Project 2: The Data Janitor - Cleaning a Messy Dataset
Build a notebook that diagnoses and fixes missing values, types, and duplicates.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | Weekend |
| Language | Python |
| Prerequisites | Project 1 |
| Key Topics | missing data, dtype conversion, cleaning |
| Output | Cleaned dataset + report |
Learning Objectives
- Detect missing values and data quality issues.
- Apply fill/drop strategies by column.
- Convert columns to correct dtypes.
- Normalize string columns.
- Produce a documented cleaning report.
The Core Question You’re Answering
“How do I turn messy real-world data into a clean, analyzable table?”
Project Specification
Functional Requirements
- Missing-value report per column.
- At least two cleaning strategies applied.
- Type conversion for numeric and date columns.
- Duplicate detection and handling.
- Save cleaned output.
Implementation Guide
Questions to Guide Your Design
- Which columns are critical and cannot be missing?
- When should you drop vs fill?
- How will you document decisions?
Testing Strategy
- Missing counts reduced.
- Dtypes correct.
- Duplicates handled.
Extensions
- Data quality score.
- Unit tests for cleaning steps.
This guide was generated from LEARN_PANDAS_DEEP_DIVE.md. For the complete learning path, see the parent directory LEARN_PANDAS_DEEP_DIVE/README.md.