Project 2: CSV/Data Transformation Pipeline

Build a streaming CSV pipeline that cleans, validates, normalizes, and summarizes data at scale.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Programming Language Shell + awk
Alternative Programming Languages Python, Go
Coolness Level Level 3: Practical Data Engineering
Business Potential Level 4: Data Services / ETL
Prerequisites Basic awk, sed substitution, CLI pipelines
Key Topics CSV parsing, validation, normalization, aggregation

1. Learning Objectives

By completing this project, you will:

  1. Parse CSV safely, including quoted fields and embedded commas.
  2. Design streaming validation rules that do not require loading all data.
  3. Normalize inconsistent records into a canonical schema.
  4. Implement deduplication and summary statistics with awk arrays.
  5. Build a CLI interface with deterministic outputs and error reports.

2. All Theory Needed (Per-Concept Breakdown)

2.1 CSV Grammar, Quoting, and Edge Cases

Fundamentals

CSV looks simple but has subtle rules that impact parsing. A CSV file is a sequence of records separated by newlines. Fields are separated by commas, but if a field contains a comma, newline, or quote, it must be wrapped in quotes. Inside a quoted field, quotes are escaped by doubling them. This means you cannot parse CSV correctly by simply splitting on commas. Tools like awk can parse CSV if configured correctly (using FPAT or a custom parser), but you must understand the grammar first. Proper parsing is crucial for data correctness: a misparsed line can shift columns, corrupt numeric values, or break downstream aggregations. In a streaming pipeline, CSV grammar defines the record boundaries and determines whether your line-by-line processing is safe. If you ignore quoting rules, your pipeline will silently produce incorrect data.

Deep Dive into the concept

CSV is a minimal data interchange format with a deceptively rich grammar. The RFC 4180 definition specifies that records are separated by CRLF (or LF in practice), fields are separated by commas, and fields containing special characters must be quoted. A quoted field may include commas, newlines, or quotes. Quotes are escaped by doubling them (""). This leads to an important consequence: a record is not always a single line in the file. If a field contains a newline, the record spans multiple lines. Many real-world CSV files avoid embedded newlines, but you cannot assume this. Your pipeline should either explicitly disallow multiline fields (and log them as invalid) or implement a record-aware parser that can reassemble them.

In a streaming toolset like awk/sed/grep, this limitation matters. awk is line-oriented by default, so it treats each line as a record. If your CSV has multiline fields, awk will split the record and parse it incorrectly. For a learning project, you can choose to disallow multiline fields and document it, but you should detect them. A practical detection approach is to count quotes: if a line contains an odd number of quotes, it likely starts or ends a multiline field. You can buffer lines until quotes are balanced, then treat the combined lines as one record. This is a common technique in streaming CSV processors.

Quoted commas are the most common issue. Consider "Acme, Inc",2026-01-01,1200,paid. A naive FS="," split yields five fields instead of four because of the comma inside quotes. The correct approach is to use FPAT (field pattern) in GNU awk, which defines what a field looks like rather than how to split. A common FPAT pattern is "([^"\\]|\\.)*"|[^,]+, which matches either quoted fields or unquoted fields. In POSIX awk, FPAT is not available, so you may need a custom parser using split() and state tracking.

CSV also interacts with whitespace normalization. Some files include spaces after commas, while others do not. Some use tabs instead of commas (TSV). If your parser strips whitespace too early, you might change data values (e.g., leading spaces in text fields). The correct approach is to parse first, then normalize field-by-field based on schema rules. For numeric fields, you can trim whitespace and validate numeric ranges. For string fields, you can preserve or normalize case based on requirements. This makes data transformations explicit rather than accidental.

The schema is another critical concept. CSV is schema-less by default: it is just rows of fields. Your pipeline needs to impose a schema: expected column names, types, and required fields. In a streaming context, schema validation must be row-by-row. This means you need a strategy for missing fields, extra fields, or schema evolution. The simplest approach is strict: drop rows that do not match the expected field count and log them. A more flexible approach is to allow optional fields and fill missing values with defaults. For this project, define a strict schema and log invalid rows, but provide a --lenient mode that attempts to repair simple issues (like extra trailing commas).

Finally, encoding and locale can cause subtle problems. CSV is often UTF-8, but your tools operate on bytes. If you perform regex validation like [A-Z] or \w, locale settings can change what is considered a letter. Use explicit character classes or set LC_ALL=C if you want byte-level behavior. For numeric parsing, be aware of locale-specific decimal separators (comma vs dot). You can choose to support only dot-separated decimals and document that requirement. Consistency is key for deterministic pipelines.

How this fit on projects

This project is about building a pipeline that treats CSV as structured records. Understanding the grammar determines whether your field extraction, normalization, and validation are correct or silently corrupted.

Definitions & key terms

  • Record: One CSV row (may span multiple lines if quoted fields contain newlines).
  • Field: One column value.
  • Quoted field: Field wrapped in quotes to allow commas/newlines.
  • FPAT: awk feature to define fields by pattern (GNU awk).
  • Schema: Expected column names, types, and constraints.

Mental model diagram (ASCII)

Raw line -> [CSV parser] -> fields[] -> [validation] -> normalized record
"Acme, Inc",2026-01-01,1200,paid
        ^ comma inside quotes is part of field

How it works (step-by-step, with invariants and failure modes)

  1. Read a raw line (or combined lines if multiline quoting detected).
  2. Parse into fields using a CSV-aware method.
  3. Validate field count against schema.
  4. Normalize fields (trim, type cast, fix case).
  5. Emit normalized CSV or log invalid row.

Invariant: A valid record must match the schema field count. Failure modes: unmatched quotes, extra delimiters, unexpected encodings.

Minimal concrete example

# GNU awk example with FPAT
awk -v FPAT='"[^"]+"|[^,]+' '{print $1, $2, $3, $4}' sales.csv

Common misconceptions

  • “CSV = split on commas” -> Quoted fields make this incorrect.
  • “Records always fit on one line” -> Quoted fields can contain newlines.
  • “Whitespace can always be trimmed” -> Some text fields have meaningful spaces.

Check-your-understanding questions

  1. Why does FS="," fail for fields with embedded commas?
  2. How can you detect a multiline CSV record in a stream?
  3. What is the difference between parsing and normalization?
  4. Why should you define a schema for CSV processing?

Check-your-understanding answers

  1. It splits inside quoted fields, producing too many columns.
  2. By counting quotes; an odd number indicates an open quote.
  3. Parsing extracts fields; normalization cleans and transforms them.
  4. Without a schema, you cannot validate or enforce correctness.

Real-world applications

  • Importing sales data into analytics systems.
  • Cleaning CRM exports before database ingestion.
  • ETL pipelines for reporting and dashboards.

Where you’ll apply it

  • See §3.5 for the expected CSV schemas.
  • See §4.4 for the parsing algorithm and schema validation flow.
  • Also used in: P01 Log Analyzer for structured log parsing.

References

  • RFC 4180: Common Format and MIME Type for CSV Files
  • “Effective awk Programming” by Arnold Robbins, data parsing sections

Key insights

CSV parsing is a grammar problem; correct parsing is the foundation of trustworthy metrics.

Summary

CSV has explicit quoting rules that invalidate naive splitting. A reliable pipeline must parse with awareness of quotes and enforce a schema before normalization.

Homework/Exercises to practice the concept

  1. Create a CSV file with a quoted field that contains commas and verify parsing.
  2. Write a script that flags rows with an odd number of quotes.
  3. Design a schema for a dataset and list validation rules.

Solutions to the homework/exercises

  1. Use FPAT or a custom parser; verify the field count remains correct.
  2. Count quotes with gsub(/"/,"",$0) and check if count is odd.
  3. Include field types, required fields, and numeric ranges.

2.2 Awk Parsing, Field Patterns, and Validation Rules

Fundamentals

awk treats each record as a line and splits it into fields using a field separator (FS). This works for simple delimited text but is insufficient for CSV with quoting. GNU awk provides FPAT, which flips parsing: instead of specifying separators, you specify what a valid field looks like. This is powerful for CSV parsing because you can match either quoted or unquoted fields. Validation is the next step: after parsing, you must check field counts, required fields, numeric ranges, and enumerated values. Validation should be streaming-friendly: each row is validated independently and either passed through or logged as invalid. The pipeline should never crash on a single bad row; it should count errors and produce a report.

Deep Dive into the concept

Parsing with awk is a careful balance between correctness and portability. POSIX awk supports FS but not FPAT. GNU awk extends awk with FPAT, making it ideal for CSV. If you need portability across macOS (BSD awk) and Linux, you must decide whether to require gawk or implement a pure-awk parser. In this project, you can require GNU awk and document that users need to install it on macOS. This simplifies parsing and makes your pipeline robust.

With FPAT, you define a pattern that matches valid fields. A common CSV FPAT is: FPAT = "([^,]+)|(""([^""]|""")*"")" This pattern matches either unquoted fields without commas or quoted fields that may contain doubled quotes. The parser then yields fields with quotes intact. You must strip the quotes and unescape doubled quotes to normalize the data. This means parsing and normalization are separate stages: parsing extracts fields, normalization cleans them. This separation makes validation cleaner because you can validate on normalized values.

Validation rules should align with the schema. For example, if column 3 is amount, your validation rule may require it to be numeric and within a range. If column 4 is status, it may need to be in {paid, pending, failed}. You should implement validation as a function in awk, returning a boolean and a reason string. When invalid, the row should be emitted to an error file with line number, raw row, and reason. This design is critical for traceability and for debugging data issues later.

Another important aspect is error accounting. Many pipelines fail silently when rows are dropped. Instead, maintain counters: total rows, valid rows, invalid rows by reason. This gives visibility and allows you to set thresholds (e.g., fail the pipeline if more than 1% of rows are invalid). In awk, this is just another associative array: invalid[reason]++. At the end, print a summary table.

Field validation can also be extended to schema evolution. If you encounter extra columns, you can choose to ignore them or treat them as invalid. A robust pipeline should support a --strict flag to reject extra columns and a --lenient flag to allow them. This is important for real-world CSV, which often changes over time. In this project, implement strict mode by default and provide a lenient mode that logs a warning but still processes known columns.

Finally, validation should consider null/missing values. Decide on a representation ("" empty string, NULL, or -). Define how you will treat empty fields: maybe some are optional, others are required. This ensures your downstream aggregations do not break. For example, summing amounts should skip empty values, but counting rows should still include them.

Portability also affects validation. If your pipeline relies on GNU awk features like FPAT, you should detect gawk and fail with a clear message when it is missing. This avoids subtle misparsing on systems with older awk implementations. In tests, use small fixtures that include quoted commas, missing fields, and invalid numbers so you can verify both the parser and the validator in one pass. These fixtures become the truth source for correctness and help prevent regressions when you add new rules later.

How this fit on projects

This concept directly enables the pipeline to accept real CSV and produce trustworthy normalized output. It is the backbone of the validation and reporting logic.

Definitions & key terms

  • FS: Field separator in awk.
  • FPAT: Field pattern in GNU awk (defines what a field looks like).
  • Validation: Rule checking that enforces schema constraints.
  • Normalization: Standardizing values after parsing.
  • Lenient mode: A mode that attempts to repair or ignore minor issues.

Mental model diagram (ASCII)

raw line -> [FPAT parser] -> fields[] -> [validate] -> ok? -> normalize -> output
                                       -> invalid -> error log

How it works (step-by-step, with invariants and failure modes)

  1. Parse line into fields using FPAT.
  2. If field count != schema length, mark invalid.
  3. Validate types and required fields.
  4. Normalize (trim, unescape quotes, cast numbers).
  5. Emit normalized row or log invalid with reason.

Invariant: Only rows that satisfy schema rules reach the output. Failure modes: missing gawk, mismatched FPAT, unexpected quoting.

Minimal concrete example

# Validate numeric amount in column 3
awk -v FPAT='"[^"]+"|[^,]+' '{
  amt=$3; gsub(/"/,"",amt);
  if (amt ~ /^[0-9]+(\.[0-9]+)?$/) print $0; else print "INVALID", NR, $0 > "bad.csv"
}' sales.csv

Common misconceptions

  • “FPAT is optional” -> It is essential for quoted CSV fields.
  • “Validation is just field count” -> Types and ranges matter for correctness.
  • “Invalid rows should be dropped silently” -> You lose traceability.

Check-your-understanding questions

  1. Why is FPAT preferred over FS for CSV parsing?
  2. What is the difference between strict and lenient validation?
  3. How would you log invalid rows without breaking the output stream?
  4. Why should normalization happen after parsing?

Check-your-understanding answers

  1. FPAT can match quoted fields containing commas, which FS cannot handle.
  2. Strict rejects any mismatch; lenient attempts to recover and logs warnings.
  3. Redirect invalid rows to stderr or a separate file while printing valid rows to stdout.
  4. You need correct field boundaries before you can safely clean values.

Real-world applications

  • Cleaning data exports for analytics.
  • Validating datasets before database import.
  • Enforcing schema integrity in ETL pipelines.

Where you’ll apply it

  • See §3.2 for functional requirements around validation and error logs.
  • See §5.10 Phase 2 for implementation of strict vs lenient mode.
  • Also used in: P06 Personal DevOps Toolkit as a subcommand.

References

  • “Effective awk Programming” by Arnold Robbins, FPAT and validation examples
  • GNU awk manual: Field Patterns

Key insights

Parsing and validation are separate responsibilities; mixing them makes errors harder to diagnose.

Summary

Awk with FPAT enables correct CSV parsing. Robust validation rules and explicit error logging turn raw data into trustworthy inputs for analytics.

Homework/Exercises to practice the concept

  1. Implement a validation rule that enforces a date format YYYY-MM-DD.
  2. Create a CSV with missing fields and confirm your pipeline logs them.
  3. Add a --lenient mode that fills missing numeric fields with 0.

Solutions to the homework/exercises

  1. Use a regex like /^[0-9]{4}-[0-9]{2}-[0-9]{2}$/.
  2. Check NF against expected count and write invalid rows to a file.
  3. In lenient mode, if a field is empty, assign 0 before normalization.

2.3 Normalization, Deduplication, and Streaming Aggregations

Fundamentals

Normalization standardizes data into a consistent format: trimming whitespace, standardizing case, and converting numbers. Deduplication removes repeated records, often by key (such as invoice ID). Aggregation computes summaries such as total revenue per product. In a streaming pipeline, normalization must happen before validation and aggregation to ensure consistency. Deduplication is trickier in streaming because it requires remembering previously seen keys. This is feasible when the key space is bounded or when you can allow approximate dedupe (e.g., per-day). Aggregation with awk uses associative arrays keyed by the dimension you want to summarize. These operations turn raw CSV into actionable insights.

Deep Dive into the concept

Normalization is the hidden work that makes data usable. If one row has "Widget-A" and another has " widget-a ", your aggregation by product will treat them as different values unless you normalize. A good normalization policy is explicit: trim leading/trailing whitespace, collapse internal whitespace if appropriate, and standardize case for categorical fields. For dates, convert to a canonical format. For numbers, remove currency symbols and commas. Each normalization rule should be tied to a field type in the schema; otherwise you risk corrupting data (for example, changing case in a case-sensitive ID field). A disciplined normalization step ensures downstream aggregates are accurate.

Deduplication in a stream is a classic trade-off between correctness and memory. The simplest approach is to maintain a set of keys seen so far and drop duplicates. This works if the dataset is small enough to fit in memory or if the key space is bounded. In practice, you can limit deduplication to a window (e.g., per day or per file). This means you only keep keys for the current batch and reset between runs. In a streaming pipeline, dedupe is often performed per run or per file, which is acceptable for many ETL tasks. You should document this limitation and provide a --dedupe-key option so the user controls what counts as a duplicate.

Aggregation with awk uses associative arrays: sum[product] += amount. This is efficient and expressive, but you must be careful about numeric conversion and missing values. awk treats empty strings as 0 in numeric contexts, which can hide data quality problems. A robust pipeline should validate numeric fields before aggregating and keep a separate count of invalid rows. Another detail is output ordering: awk iterates over arrays in arbitrary order. If you want deterministic output for tests, you must sort the output (e.g., for (k in sum) print ... | "sort"). Sorting is not strictly streaming because it buffers output, but the output size is bounded by the number of keys, not the number of rows, so it is typically safe.

Normalization and aggregation also interact with deduplication. If duplicates differ only in whitespace or casing, normalize before dedupe so that logically identical records are treated as duplicates. If duplicates differ in numeric fields, you need a policy: keep the first, keep the last, or merge values. For a learning project, the simplest rule is “keep first” and log duplicates. This preserves determinism and makes the behavior easy to explain.

A subtle issue arises with numeric precision. awk uses double-precision floating point, which can introduce rounding errors for large sums or financial data. If exact decimal arithmetic is required, you should note this limitation and suggest using bc or a language with decimal libraries. For this project, accept floating point but format with fixed decimals in output.

Finally, normalization enables schema mapping. Many CSV files have different column orders. A robust pipeline can allow a header row and map columns by name. In awk, you can parse the header line, build a map from column name to index, and then use that map to extract fields. This makes your pipeline resilient to column reordering. If you implement this, be explicit about the required column names.

How this fit on projects

Normalization, deduplication, and aggregation are the heart of the transformation pipeline. Without them, you only have raw parsing. These steps produce the final clean output and summary reports.

Definitions & key terms

  • Normalization: Converting values to a consistent format.
  • Deduplication: Removing repeated records based on a key.
  • Aggregation: Summarizing data (sum, count, average).
  • Deterministic output: Output that does not depend on hash order or run time.
  • Schema mapping: Using header names to locate fields.

Mental model diagram (ASCII)

raw CSV -> parse -> normalize -> validate -> dedupe -> aggregate -> report

How it works (step-by-step, with invariants and failure modes)

  1. Parse fields and normalize strings/numbers.
  2. Validate required fields and types.
  3. Build a dedupe key and check if seen.
  4. If not seen, emit record and update aggregates.
  5. At end, emit summary and counts.

Invariant: Each output record is normalized and validated. Failure modes: unbounded dedupe set, inconsistent normalization rules.

Minimal concrete example

# Sum revenue per product after normalization
awk -F',' '{
  gsub(/^[[:space:]]+|[[:space:]]+$/, "", $1); product=tolower($1);
  amount=$3+0; sum[product]+=amount
} END {for (p in sum) printf "%s,%.2f\n", p, sum[p]}' sales.csv

Common misconceptions

  • “Deduplication is always safe” -> It depends on your key and business rules.
  • “awk output order is stable” -> It is not; you need sorting for determinism.
  • “Normalization is just trimming” -> It includes casing, type casting, and canonical formats.

Check-your-understanding questions

  1. Why should normalization happen before deduplication?
  2. How can you keep output deterministic with awk aggregations?
  3. What is a safe dedupe key for a sales dataset?
  4. Why can floating point sums be risky for money?

Check-your-understanding answers

  1. It ensures logically identical records map to the same key.
  2. Sort the output or explicitly order keys.
  3. An invoice ID or unique transaction ID, not product name.
  4. Floating point can introduce rounding errors; decimals may be required.

Real-world applications

  • Cleaning CRM exports and deduplicating contacts.
  • Generating revenue reports from sales feeds.
  • Normalizing data before loading into a warehouse.

Where you’ll apply it

  • See §3.2 for deduplication and aggregation requirements.
  • See §3.7 for the golden path output and summary report.
  • Also used in: P04 System Inventory & Audit for aggregation of file metadata.

References

  • “Data Pipelines Pocket Reference” (CSV normalization patterns)
  • “Effective awk Programming” by Arnold Robbins

Key insights

Normalization and aggregation are not optional polish; they are the logic that turns raw CSV into decisions.

Summary

A streaming pipeline can normalize, deduplicate, and aggregate data in one pass if you carefully manage state and enforce a schema.

Homework/Exercises to practice the concept

  1. Normalize product names to lowercase and count distinct products.
  2. Implement deduplication by transaction ID and log duplicates.
  3. Generate a top-5 report sorted by revenue.

Solutions to the homework/exercises

  1. Use tolower() and an associative array of seen products.
  2. Track seen[id] and write duplicates to duplicates.csv.
  3. Output sums and pipe to sort -rn | head -5.

3. Project Specification

3.1 What You Will Build

A CLI tool called csvpipe that:

  • Reads CSV from a file or stdin.
  • Validates against a defined schema (field count, types, and required fields).
  • Normalizes data (trim whitespace, standardize casing, numeric parsing).
  • Deduplicates rows by a chosen key.
  • Produces a normalized CSV output and summary statistics.
  • Emits an error report for invalid rows and duplicates.

Included:

  • Strict and lenient validation modes.
  • Configurable schema via a simple config file.
  • Deterministic output and summaries.

Excluded:

  • Full SQL-style joins or multi-file merging.
  • Binary formats (Parquet, Avro).

3.2 Functional Requirements

  1. Input: Read from file or stdin.
  2. CSV parsing: Correctly parse quoted fields with commas.
  3. Schema validation: Enforce field count, types, and required fields.
  4. Normalization: Trim whitespace, standardize case, convert numbers.
  5. Deduplication: Drop duplicates by configurable key.
  6. Summary report: Output counts, invalid rows, and top-N aggregates.
  7. Error logging: Write invalid rows to invalid.csv with reasons.
  8. Exit codes: Report success/failure as numeric exit codes.

3.3 Non-Functional Requirements

  • Performance: Process at least 100,000 rows per minute.
  • Reliability: Deterministic output given the same input.
  • Usability: Clear CLI help and descriptive error messages.

3.4 Example Usage / Output

$ ./csvpipe.sh sales.csv --schema schema.json --normalize --dedupe --summary
Input rows: 125,402
Output rows: 123,988
Invalid rows: 1,414
Duplicates removed: 218
Top 5 products by revenue:
  widget-a  $1,203,110.00
  widget-c  $998,442.00
  widget-b  $880,923.00

3.5 Data Formats / Schemas / Protocols

Input CSV schema (example):

columns: [company, date, amount, status]
company: string (required)
date: YYYY-MM-DD (required)
amount: decimal (required)
status: enum(paid, pending, failed)

Normalized output CSV:

company,date,amount,status
acme inc,2026-01-01,1200.00,paid

Invalid row log format:

line,reason,raw
14,missing_field,"Acme, Inc",2026-01-01,1200

3.6 Edge Cases

  • Embedded commas in quoted fields.
  • Missing trailing newline.
  • Extra columns or missing columns.
  • Non-numeric amounts (“N/A”).
  • Duplicate rows with conflicting data.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./csvpipe.sh sales.csv --schema schema.json --normalize --dedupe --summary

3.7.2 Golden Path Demo (Deterministic)

$ ./csvpipe.sh sample/sales.csv --schema sample/schema.json --normalize --dedupe --summary
Input rows: 5
Output rows: 4
Invalid rows: 1
Duplicates removed: 0
Top 3 products by revenue:
  widget-a  $1200.00
  widget-b  $700.00
  widget-c  $500.00

3.7.3 Failure Demo (Deterministic)

$ ./csvpipe.sh sample/missing.csv --schema sample/schema.json
ERROR: file not found: sample/missing.csv
exit code: 2

3.7.4 If CLI: exact terminal transcript

$ ./csvpipe.sh sample/sales.csv --schema sample/schema.json --summary
Input rows: 5
Output rows: 4
Invalid rows: 1
$ echo $?
0

Exit codes:

  • 0: Success, output generated.
  • 1: Validation errors exceed allowed threshold.
  • 2: Missing file or invalid arguments.

4. Solution Architecture

4.1 High-Level Design

Input CSV -> [Parser] -> [Validator] -> [Normalizer] -> [Dedupe] -> [Aggregator] -> Output
                                        |                             |
                                        v                             v
                                   invalid.csv                     summary report

4.2 Key Components

Component Responsibility Key Decisions
Parser CSV-aware field extraction Require gawk FPAT
Validator Enforce schema and rules Strict by default, lenient optional
Normalizer Clean and standardize fields Per-field normalization rules
Dedupe Remove duplicate keys Keep first occurrence
Aggregator Compute summaries Use associative arrays and sort

4.3 Data Structures (No Full Code)

# seen[key] = 1 for dedupe
# sum[product] = total revenue
# invalid[reason] = count

4.4 Algorithm Overview

Key Algorithm: Streaming Validation + Aggregation

  1. Parse line with FPAT.
  2. Validate field count and types.
  3. Normalize fields.
  4. If dedupe enabled, skip seen keys.
  5. Emit normalized record and update aggregates.

Complexity Analysis:

  • Time: O(n) for n rows.
  • Space: O(k) for k unique keys and aggregates.

5. Implementation Guide

5.1 Development Environment Setup

# Install GNU awk on macOS
brew install gawk

5.2 Project Structure

csvpipe/
├── csvpipe.sh
├── lib/
│   ├── parse.awk
│   ├── validate.awk
│   ├── normalize.awk
│   └── aggregate.awk
├── sample/
│   ├── sales.csv
│   └── schema.json
└── tests/
    └── golden-output.txt

5.3 The Core Question You’re Answering

“How do I treat CSV as a stream of records and transform it into clean, validated output in one pass?”

5.4 Concepts You Must Understand First

  1. CSV quoting rules
    • How do quotes escape commas?
    • How do you detect multiline records?
  2. awk FPAT
    • Why does FPAT outperform FS for CSV?
    • What is the portability cost?
  3. Validation vs normalization
    • Which happens first and why?

5.5 Questions to Guide Your Design

  1. What is your schema and how strict should it be?
  2. What dedupe key makes sense for your dataset?
  3. How will you handle invalid rows (drop, log, or repair)?

5.6 Thinking Exercise

Given a file with 1 million rows, what is the maximum number of unique keys you can keep in memory before performance degrades? Estimate memory usage for 1 million keys.

5.7 The Interview Questions They’ll Ask

  1. Why is CSV parsing harder than splitting on commas?
  2. How would you validate numeric ranges in awk?
  3. What trade-offs exist between strict and lenient validation?

5.8 Hints in Layers

Hint 1: Start by parsing

awk -v FPAT='"[^"]+"|[^,]+' '{print NF}' sales.csv

Hint 2: Add validation

awk -v FPAT='"[^"]+"|[^,]+' 'NF==4 {print} NF!=4 {print "INVALID", NR > "invalid.csv"}' sales.csv

Hint 3: Normalize and aggregate

awk -v FPAT='"[^"]+"|[^,]+' '{gsub(/"/,"",$1); sum[$1]+=$3} END {for (k in sum) print k, sum[k]}' sales.csv

Hint 4: Add dedupe Track seen[$key] and skip if already seen.

5.9 Books That Will Help

Topic Book Chapter
CSV parsing “Effective awk Programming” Data parsing chapters
shell pipelines “The Linux Command Line” Ch. 6-7
data validation “Data Engineering” (selected sections) Validation chapters

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals: Parse CSV correctly and output normalized rows.

Tasks:

  1. Implement FPAT parsing with quote handling.
  2. Strip quotes and trim whitespace for string fields.

Checkpoint: Output CSV has correct field counts and clean values.

Phase 2: Core Functionality (4-5 days)

Goals: Add validation, error logging, and dedupe.

Tasks:

  1. Implement schema validation rules and error reporting.
  2. Add dedupe by key.

Checkpoint: Invalid rows are logged, duplicates are removed.

Phase 3: Reporting & Polish (2-3 days)

Goals: Add summary reports and deterministic ordering.

Tasks:

  1. Implement top-N aggregates and totals.
  2. Sort output for deterministic tests.

Checkpoint: Summary report matches golden output exactly.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
CSV parsing FS vs FPAT FPAT Correct handling of quotes
Dedupe scope global vs per-run per-run bounded memory
Output order arbitrary vs sorted sorted deterministic tests

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Field parsing quoted commas, empty fields
Integration Tests End-to-end pipeline sample -> golden output
Edge Case Tests malformed data missing quotes, extra columns

6.2 Critical Test Cases

  1. Quoted commas in company names.
  2. Missing amount field.
  3. Duplicate transaction IDs.

6.3 Test Data

"Acme, Inc",2026-01-01,1200,paid
Widget-B,2026-01-01,700,paid
Widget-C,2026-01-02,500,failed
"Bad Row",2026-01-03,,paid

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Naive FS split Columns shift Use FPAT or custom parser
Silent invalid rows Counts don’t match Log invalid rows and reasons
Non-deterministic order Tests fail Sort outputs

7.2 Debugging Strategies

  • Print NF and field values for suspect lines.
  • Use a small sample file to reproduce errors.
  • Add a --debug flag to output parsed fields.

7.3 Performance Traps

Avoid sorting huge raw streams; sort only aggregated summaries.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add TSV support with --delimiter.
  • Add a --strict flag that fails on any invalid row.

8.2 Intermediate Extensions

  • Support header-based column mapping.
  • Add JSON output for summary reports.

8.3 Advanced Extensions

  • Implement approximate dedupe with Bloom filters.
  • Add configurable data quality thresholds.

9. Real-World Connections

9.1 Industry Applications

  • Data ingestion pipelines for analytics.
  • Cleaning CRM or billing exports.
  • csvkit: Python tools for CSV.
  • miller (mlr): CLI tool for data reshaping.

9.3 Interview Relevance

  • Stream processing and schema validation.
  • Data quality and deterministic reporting.

10. Resources

10.1 Essential Reading

  • “Effective awk Programming” by Arnold Robbins
  • RFC 4180

10.2 Video Resources

  • “Data Cleaning with awk” (tutorial)
  • “CSV Parsing Pitfalls” (talk)

10.3 Tools & Documentation

  • gawk manual: FPAT and field parsing
  • sed(1) manual page

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why naive CSV splitting fails.
  • I can describe how FPAT works.
  • I can explain my dedupe policy and its limitations.

11.2 Implementation

  • All functional requirements are met.
  • Invalid rows are logged with reasons.
  • Summary report is deterministic.

11.3 Growth

  • I can describe how to scale this pipeline.
  • I can explain the project clearly in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Correct CSV parsing with quoting.
  • Validation with error logging.
  • Summary report generation.

Full Completion:

  • Deduplication and deterministic ordering.
  • Strict/lenient modes implemented.
  • Golden output tests pass.

Excellence (Going Above & Beyond):

  • Header-based schema mapping.
  • Approximate dedupe or performance benchmarks.
  • JSON summary output.