Project 2: The Data Janitor - Cleaning a Messy Dataset

Build a notebook that diagnoses and fixes missing values, types, and duplicates.


Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate Weekend
Language Python
Prerequisites Project 1
Key Topics missing data, dtype conversion, cleaning
Output Cleaned dataset + report

Learning Objectives

  1. Detect missing values and data quality issues.
  2. Apply fill/drop strategies by column.
  3. Convert columns to correct dtypes.
  4. Normalize string columns.
  5. Produce a documented cleaning report.

The Core Question You’re Answering

“How do I turn messy real-world data into a clean, analyzable table?”


Project Specification

Functional Requirements

  1. Missing-value report per column.
  2. At least two cleaning strategies applied.
  3. Type conversion for numeric and date columns.
  4. Duplicate detection and handling.
  5. Save cleaned output.

Implementation Guide

Questions to Guide Your Design

  1. Which columns are critical and cannot be missing?
  2. When should you drop vs fill?
  3. How will you document decisions?

Testing Strategy

  • Missing counts reduced.
  • Dtypes correct.
  • Duplicates handled.

Extensions

  • Data quality score.
  • Unit tests for cleaning steps.

This guide was generated from LEARN_PANDAS_DEEP_DIVE.md. For the complete learning path, see the parent directory LEARN_PANDAS_DEEP_DIVE/README.md.