Project 2: CSV/Data Transformation Pipeline

Build a CLI pipeline that cleans, validates, and transforms CSV data.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language Shell + awk/sed
Prerequisites awk/sed basics, CSV structure
Key Topics field parsing, validation, normalization

1. Learning Objectives

By completing this project, you will:

  1. Parse CSV safely with quoted fields and delimiters.
  2. Normalize inconsistent data with sed/awk pipelines.
  3. Validate required fields and numeric ranges.
  4. Generate summary statistics from cleaned data.
  5. Produce a clean output CSV suitable for import.

2. Theoretical Foundation

2.1 Core Concepts

  • CSV Rules: Quoted fields may contain commas or newlines. Naive splitting breaks.
  • Normalization: Trimming whitespace, lowercasing emails, standardizing dates.
  • Validation: Required fields, type checks, and domain constraints.
  • ETL: Extract, transform, load is a pipeline of transformations.

2.2 Why This Matters

Every system eventually needs data cleanup. Knowing how to do it with text tools gives you speed and portability in production.

2.3 Common Misconceptions

  • “CSV is simple”: Quoted commas are the classic trap.
  • “Just use Excel”: Large files and automation require scripts.

3. Project Specification

3.1 What You Will Build

A script csv-pipeline.sh that:

  • Reads a CSV file
  • Normalizes and validates each row
  • Outputs clean CSV + summary report

3.2 Functional Requirements

  1. Input: CSV file path or STDIN.
  2. Normalization: Trim whitespace, normalize casing.
  3. Validation: Required fields not empty, numeric columns valid.
  4. Output: Write clean CSV and print counts.

3.3 Non-Functional Requirements

  • Performance: Stream line-by-line.
  • Reliability: Invalid rows logged but do not crash.
  • Usability: Clear report at end.

3.4 Example Usage / Output

$ ./csv-pipeline.sh sales.csv > clean.csv
Processed: 10000 rows
Valid: 9850
Invalid: 150

3.5 Real World Outcome

You run the pipeline and receive a clean CSV file plus a report. The cleaned file is safe to import into a database or BI tool, with consistent casing and filtered invalid rows.


4. Solution Architecture

4.1 High-Level Design

+---------+    +-----------+    +-----------+    +-----------+
| Input   | -> | Normalize | -> | Validate  | -> | Output    |
+---------+    +-----------+    +-----------+    +-----------+

4.2 Key Components

Component Responsibility Key Decisions
Parser Split fields awk FPAT or csv parser
Normalizer Clean values sed/awk rules
Validator Filter rows condition checks
Reporter Count stats awk aggregates

4.3 Data Structures

# awk associative counters
valid++
invalid++

4.4 Algorithm Overview

Key Algorithm: Streaming ETL

  1. Parse row into fields.
  2. Normalize each field.
  3. Validate required fields.
  4. Emit row if valid.
  5. Update counters.

Complexity Analysis:

  • Time: O(N) for N rows
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

chmod +x csv-pipeline.sh

5.2 Project Structure

csv-pipeline/
├── csv-pipeline.sh
└── README.md

5.3 The Core Question You Are Answering

“How do I clean and validate large CSV datasets using only Unix text tools?”

5.4 Concepts You Must Understand First

  1. CSV quoting rules
  2. awk field parsing with FPAT
  3. Regex validation patterns

5.5 Questions to Guide Your Design

  1. How do you handle commas inside quotes?
  2. What fields are required vs optional?
  3. How do you report invalid rows?

5.6 Thinking Exercise

Given a row with quoted commas, manually split fields and decide validity.

5.7 The Interview Questions They Will Ask

  1. How do you parse CSV safely?
  2. Why is streaming better for large files?
  3. How do you validate numeric fields?

5.8 Hints in Layers

Hint 1: Start with a simple CSV without quotes.

Hint 2: Add FPAT for quoted fields.

Hint 3: Add validation and counters.

5.9 Books That Will Help

Topic Book Chapter
awk “Sed & Awk” Ch. 7
pipelines “The Linux Command Line” Ch. 6

5.10 Implementation Phases

Phase 1: Foundation

  • Parse CSV
  • Normalize fields

Phase 2: Validation

  • Filter invalid rows

Phase 3: Reporting

  • Summary counts

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Parsing quoted commas
Integration Tests Pipeline file -> clean output
Edge Cases Empty rows skip gracefully

6.2 Critical Test Cases

  1. Quoted comma should not split.
  2. Empty required field -> invalid.
  3. Numeric column with letters -> invalid.

7. Common Pitfalls and Debugging

Pitfall Symptom Solution
Naive split Broken fields Use FPAT
Over-trimming Data loss Trim only edges
Invalid stats Wrong counts Validate with small fixture

8. Extensions and Challenges

8.1 Beginner Extensions

  • Add --summary-only flag

8.2 Intermediate Extensions

  • Add schema file for validation rules

8.3 Advanced Extensions

  • Output to SQLite directly

9. Real-World Connections

  • ETL jobs in data engineering
  • Pre-processing before ML training

10. Resources

  • GNU awk manual
  • CSV RFC 4180

11. Self-Assessment Checklist

  • I can explain CSV quoting rules
  • I can validate fields in awk

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Clean CSV output with invalid rows removed

Full Completion:

  • Report + validation

Excellence (Going Above and Beyond):

  • Schema-driven validation and SQL output