Project 3: Wrangling Data - The dplyr Verbs

Build a data cleaning pipeline using core dplyr verbs.


Project Overview

Attribute Value
Difficulty Level 1: Beginner
Time Estimate Weekend
Main Language R
Alternative Languages Python (pandas), Julia
Knowledge Area Data wrangling
Tools dplyr
Main Book “R for Data Science” by Wickham & Grolemund

What you’ll build: A reproducible data cleaning script that filters, selects, groups, and summarizes data.

Why it teaches stats: Clean data is mandatory for valid statistical conclusions.

Core challenges you’ll face:

  • Handling missing values
  • Performing grouped summaries
  • Keeping transformations readable

Real World Outcome

You will output a cleaned dataset and a summary table for selected groups.

Example Output:

Filtered rows: 20
Grouped summary saved to summary.csv

Verification steps:

  • Check row counts after each step
  • Validate summaries against expectations

The Core Question You’re Answering

“How do I transform messy data into analysis-ready tables?”

This is the core of practical data work.


Concepts You Must Understand First

Stop and research these before coding:

  1. Filter and select
    • How do you subset data safely?
    • Book Reference: “R for Data Science”, Ch. 5
  2. Group by and summarize
    • How do you compute stats per group?
    • Book Reference: “R for Data Science”, Ch. 5
  3. Missing values
    • How do NA values affect calculations?
    • Book Reference: “R for Data Science”, Ch. 5

Questions to Guide Your Design

  1. Pipeline clarity
    • How will you structure chained operations?
    • Will you keep intermediate outputs for debugging?
  2. Validation
    • How will you verify that cleaning didn’t remove valid data?
    • How will you document assumptions?

Thinking Exercise

Grouped Summary

Group cars by cylinder count and compute average mpg. What patterns do you expect?

Questions while working:

  • Why do grouped summaries matter?
  • How does sample size affect interpretation?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What are the core dplyr verbs?”
  2. “How do you handle missing values?”
  3. “Why is grouping important?”
  4. “How do you keep data pipelines readable?”
  5. “How do you validate transformations?”

Hints in Layers

Hint 1: Starting Point Use select and filter to create a smaller dataset.

Hint 2: Next Level Add group_by and summarize to compute stats.

Hint 3: Technical Details Use n() to track group sizes.

Hint 4: Tools/Debugging Print intermediate summaries to verify steps.


Books That Will Help

Topic Book Chapter
dplyr verbs “R for Data Science” Ch. 5
Grouped summaries “R for Data Science” Ch. 5
Missing values “R for Data Science” Ch. 5

Implementation Hints

  • Keep a consistent naming scheme for columns.
  • Use pipes to keep transformations readable.
  • Save cleaned data as a separate artifact.

Learning Milestones

  1. First milestone: You can filter and select data correctly.
  2. Second milestone: You can compute grouped summaries.
  3. Final milestone: You can build reproducible data pipelines.