Project 2: CSV/Data Transformation Pipeline
Build a streaming CSV pipeline that cleans, validates, normalizes, and summarizes data at scale.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Shell + awk |
| Alternative Programming Languages | Python, Go |
| Coolness Level | Level 3: Practical Data Engineering |
| Business Potential | Level 4: Data Services / ETL |
| Prerequisites | Basic awk, sed substitution, CLI pipelines |
| Key Topics | CSV parsing, validation, normalization, aggregation |
1. Learning Objectives
By completing this project, you will:
- Parse CSV safely, including quoted fields and embedded commas.
- Design streaming validation rules that do not require loading all data.
- Normalize inconsistent records into a canonical schema.
- Implement deduplication and summary statistics with awk arrays.
- Build a CLI interface with deterministic outputs and error reports.
2. All Theory Needed (Per-Concept Breakdown)
2.1 CSV Grammar, Quoting, and Edge Cases
Fundamentals
CSV looks simple but has subtle rules that impact parsing. A CSV file is a sequence of records separated by newlines. Fields are separated by commas, but if a field contains a comma, newline, or quote, it must be wrapped in quotes. Inside a quoted field, quotes are escaped by doubling them. This means you cannot parse CSV correctly by simply splitting on commas. Tools like awk can parse CSV if configured correctly (using FPAT or a custom parser), but you must understand the grammar first. Proper parsing is crucial for data correctness: a misparsed line can shift columns, corrupt numeric values, or break downstream aggregations. In a streaming pipeline, CSV grammar defines the record boundaries and determines whether your line-by-line processing is safe. If you ignore quoting rules, your pipeline will silently produce incorrect data.
Deep Dive into the concept
CSV is a minimal data interchange format with a deceptively rich grammar. The RFC 4180 definition specifies that records are separated by CRLF (or LF in practice), fields are separated by commas, and fields containing special characters must be quoted. A quoted field may include commas, newlines, or quotes. Quotes are escaped by doubling them (""). This leads to an important consequence: a record is not always a single line in the file. If a field contains a newline, the record spans multiple lines. Many real-world CSV files avoid embedded newlines, but you cannot assume this. Your pipeline should either explicitly disallow multiline fields (and log them as invalid) or implement a record-aware parser that can reassemble them.
In a streaming toolset like awk/sed/grep, this limitation matters. awk is line-oriented by default, so it treats each line as a record. If your CSV has multiline fields, awk will split the record and parse it incorrectly. For a learning project, you can choose to disallow multiline fields and document it, but you should detect them. A practical detection approach is to count quotes: if a line contains an odd number of quotes, it likely starts or ends a multiline field. You can buffer lines until quotes are balanced, then treat the combined lines as one record. This is a common technique in streaming CSV processors.
Quoted commas are the most common issue. Consider "Acme, Inc",2026-01-01,1200,paid. A naive FS="," split yields five fields instead of four because of the comma inside quotes. The correct approach is to use FPAT (field pattern) in GNU awk, which defines what a field looks like rather than how to split. A common FPAT pattern is "([^"\\]|\\.)*"|[^,]+, which matches either quoted fields or unquoted fields. In POSIX awk, FPAT is not available, so you may need a custom parser using split() and state tracking.
CSV also interacts with whitespace normalization. Some files include spaces after commas, while others do not. Some use tabs instead of commas (TSV). If your parser strips whitespace too early, you might change data values (e.g., leading spaces in text fields). The correct approach is to parse first, then normalize field-by-field based on schema rules. For numeric fields, you can trim whitespace and validate numeric ranges. For string fields, you can preserve or normalize case based on requirements. This makes data transformations explicit rather than accidental.
The schema is another critical concept. CSV is schema-less by default: it is just rows of fields. Your pipeline needs to impose a schema: expected column names, types, and required fields. In a streaming context, schema validation must be row-by-row. This means you need a strategy for missing fields, extra fields, or schema evolution. The simplest approach is strict: drop rows that do not match the expected field count and log them. A more flexible approach is to allow optional fields and fill missing values with defaults. For this project, define a strict schema and log invalid rows, but provide a --lenient mode that attempts to repair simple issues (like extra trailing commas).
Finally, encoding and locale can cause subtle problems. CSV is often UTF-8, but your tools operate on bytes. If you perform regex validation like [A-Z] or \w, locale settings can change what is considered a letter. Use explicit character classes or set LC_ALL=C if you want byte-level behavior. For numeric parsing, be aware of locale-specific decimal separators (comma vs dot). You can choose to support only dot-separated decimals and document that requirement. Consistency is key for deterministic pipelines.
How this fit on projects
This project is about building a pipeline that treats CSV as structured records. Understanding the grammar determines whether your field extraction, normalization, and validation are correct or silently corrupted.
Definitions & key terms
- Record: One CSV row (may span multiple lines if quoted fields contain newlines).
- Field: One column value.
- Quoted field: Field wrapped in quotes to allow commas/newlines.
- FPAT: awk feature to define fields by pattern (GNU awk).
- Schema: Expected column names, types, and constraints.
Mental model diagram (ASCII)
Raw line -> [CSV parser] -> fields[] -> [validation] -> normalized record
"Acme, Inc",2026-01-01,1200,paid
^ comma inside quotes is part of field
How it works (step-by-step, with invariants and failure modes)
- Read a raw line (or combined lines if multiline quoting detected).
- Parse into fields using a CSV-aware method.
- Validate field count against schema.
- Normalize fields (trim, type cast, fix case).
- Emit normalized CSV or log invalid row.
Invariant: A valid record must match the schema field count. Failure modes: unmatched quotes, extra delimiters, unexpected encodings.
Minimal concrete example
# GNU awk example with FPAT
awk -v FPAT='"[^"]+"|[^,]+' '{print $1, $2, $3, $4}' sales.csv
Common misconceptions
- “CSV = split on commas” -> Quoted fields make this incorrect.
- “Records always fit on one line” -> Quoted fields can contain newlines.
- “Whitespace can always be trimmed” -> Some text fields have meaningful spaces.
Check-your-understanding questions
- Why does
FS=","fail for fields with embedded commas? - How can you detect a multiline CSV record in a stream?
- What is the difference between parsing and normalization?
- Why should you define a schema for CSV processing?
Check-your-understanding answers
- It splits inside quoted fields, producing too many columns.
- By counting quotes; an odd number indicates an open quote.
- Parsing extracts fields; normalization cleans and transforms them.
- Without a schema, you cannot validate or enforce correctness.
Real-world applications
- Importing sales data into analytics systems.
- Cleaning CRM exports before database ingestion.
- ETL pipelines for reporting and dashboards.
Where you’ll apply it
- See §3.5 for the expected CSV schemas.
- See §4.4 for the parsing algorithm and schema validation flow.
- Also used in: P01 Log Analyzer for structured log parsing.
References
- RFC 4180: Common Format and MIME Type for CSV Files
- “Effective awk Programming” by Arnold Robbins, data parsing sections
Key insights
CSV parsing is a grammar problem; correct parsing is the foundation of trustworthy metrics.
Summary
CSV has explicit quoting rules that invalidate naive splitting. A reliable pipeline must parse with awareness of quotes and enforce a schema before normalization.
Homework/Exercises to practice the concept
- Create a CSV file with a quoted field that contains commas and verify parsing.
- Write a script that flags rows with an odd number of quotes.
- Design a schema for a dataset and list validation rules.
Solutions to the homework/exercises
- Use FPAT or a custom parser; verify the field count remains correct.
- Count quotes with
gsub(/"/,"",$0)and check if count is odd. - Include field types, required fields, and numeric ranges.
2.2 Awk Parsing, Field Patterns, and Validation Rules
Fundamentals
awk treats each record as a line and splits it into fields using a field separator (FS). This works for simple delimited text but is insufficient for CSV with quoting. GNU awk provides FPAT, which flips parsing: instead of specifying separators, you specify what a valid field looks like. This is powerful for CSV parsing because you can match either quoted or unquoted fields. Validation is the next step: after parsing, you must check field counts, required fields, numeric ranges, and enumerated values. Validation should be streaming-friendly: each row is validated independently and either passed through or logged as invalid. The pipeline should never crash on a single bad row; it should count errors and produce a report.
Deep Dive into the concept
Parsing with awk is a careful balance between correctness and portability. POSIX awk supports FS but not FPAT. GNU awk extends awk with FPAT, making it ideal for CSV. If you need portability across macOS (BSD awk) and Linux, you must decide whether to require gawk or implement a pure-awk parser. In this project, you can require GNU awk and document that users need to install it on macOS. This simplifies parsing and makes your pipeline robust.
With FPAT, you define a pattern that matches valid fields. A common CSV FPAT is:
FPAT = "([^,]+)|(""([^""]|""")*"")"
This pattern matches either unquoted fields without commas or quoted fields that may contain doubled quotes. The parser then yields fields with quotes intact. You must strip the quotes and unescape doubled quotes to normalize the data. This means parsing and normalization are separate stages: parsing extracts fields, normalization cleans them. This separation makes validation cleaner because you can validate on normalized values.
Validation rules should align with the schema. For example, if column 3 is amount, your validation rule may require it to be numeric and within a range. If column 4 is status, it may need to be in {paid, pending, failed}. You should implement validation as a function in awk, returning a boolean and a reason string. When invalid, the row should be emitted to an error file with line number, raw row, and reason. This design is critical for traceability and for debugging data issues later.
Another important aspect is error accounting. Many pipelines fail silently when rows are dropped. Instead, maintain counters: total rows, valid rows, invalid rows by reason. This gives visibility and allows you to set thresholds (e.g., fail the pipeline if more than 1% of rows are invalid). In awk, this is just another associative array: invalid[reason]++. At the end, print a summary table.
Field validation can also be extended to schema evolution. If you encounter extra columns, you can choose to ignore them or treat them as invalid. A robust pipeline should support a --strict flag to reject extra columns and a --lenient flag to allow them. This is important for real-world CSV, which often changes over time. In this project, implement strict mode by default and provide a lenient mode that logs a warning but still processes known columns.
Finally, validation should consider null/missing values. Decide on a representation ("" empty string, NULL, or -). Define how you will treat empty fields: maybe some are optional, others are required. This ensures your downstream aggregations do not break. For example, summing amounts should skip empty values, but counting rows should still include them.
Portability also affects validation. If your pipeline relies on GNU awk features like FPAT, you should detect gawk and fail with a clear message when it is missing. This avoids subtle misparsing on systems with older awk implementations. In tests, use small fixtures that include quoted commas, missing fields, and invalid numbers so you can verify both the parser and the validator in one pass. These fixtures become the truth source for correctness and help prevent regressions when you add new rules later.
How this fit on projects
This concept directly enables the pipeline to accept real CSV and produce trustworthy normalized output. It is the backbone of the validation and reporting logic.
Definitions & key terms
- FS: Field separator in awk.
- FPAT: Field pattern in GNU awk (defines what a field looks like).
- Validation: Rule checking that enforces schema constraints.
- Normalization: Standardizing values after parsing.
- Lenient mode: A mode that attempts to repair or ignore minor issues.
Mental model diagram (ASCII)
raw line -> [FPAT parser] -> fields[] -> [validate] -> ok? -> normalize -> output
-> invalid -> error log
How it works (step-by-step, with invariants and failure modes)
- Parse line into fields using FPAT.
- If field count != schema length, mark invalid.
- Validate types and required fields.
- Normalize (trim, unescape quotes, cast numbers).
- Emit normalized row or log invalid with reason.
Invariant: Only rows that satisfy schema rules reach the output. Failure modes: missing gawk, mismatched FPAT, unexpected quoting.
Minimal concrete example
# Validate numeric amount in column 3
awk -v FPAT='"[^"]+"|[^,]+' '{
amt=$3; gsub(/"/,"",amt);
if (amt ~ /^[0-9]+(\.[0-9]+)?$/) print $0; else print "INVALID", NR, $0 > "bad.csv"
}' sales.csv
Common misconceptions
- “FPAT is optional” -> It is essential for quoted CSV fields.
- “Validation is just field count” -> Types and ranges matter for correctness.
- “Invalid rows should be dropped silently” -> You lose traceability.
Check-your-understanding questions
- Why is FPAT preferred over FS for CSV parsing?
- What is the difference between strict and lenient validation?
- How would you log invalid rows without breaking the output stream?
- Why should normalization happen after parsing?
Check-your-understanding answers
- FPAT can match quoted fields containing commas, which FS cannot handle.
- Strict rejects any mismatch; lenient attempts to recover and logs warnings.
- Redirect invalid rows to stderr or a separate file while printing valid rows to stdout.
- You need correct field boundaries before you can safely clean values.
Real-world applications
- Cleaning data exports for analytics.
- Validating datasets before database import.
- Enforcing schema integrity in ETL pipelines.
Where you’ll apply it
- See §3.2 for functional requirements around validation and error logs.
- See §5.10 Phase 2 for implementation of strict vs lenient mode.
- Also used in: P06 Personal DevOps Toolkit as a subcommand.
References
- “Effective awk Programming” by Arnold Robbins, FPAT and validation examples
- GNU awk manual: Field Patterns
Key insights
Parsing and validation are separate responsibilities; mixing them makes errors harder to diagnose.
Summary
Awk with FPAT enables correct CSV parsing. Robust validation rules and explicit error logging turn raw data into trustworthy inputs for analytics.
Homework/Exercises to practice the concept
- Implement a validation rule that enforces a date format YYYY-MM-DD.
- Create a CSV with missing fields and confirm your pipeline logs them.
- Add a
--lenientmode that fills missing numeric fields with 0.
Solutions to the homework/exercises
- Use a regex like
/^[0-9]{4}-[0-9]{2}-[0-9]{2}$/. - Check
NFagainst expected count and write invalid rows to a file. - In lenient mode, if a field is empty, assign
0before normalization.
2.3 Normalization, Deduplication, and Streaming Aggregations
Fundamentals
Normalization standardizes data into a consistent format: trimming whitespace, standardizing case, and converting numbers. Deduplication removes repeated records, often by key (such as invoice ID). Aggregation computes summaries such as total revenue per product. In a streaming pipeline, normalization must happen before validation and aggregation to ensure consistency. Deduplication is trickier in streaming because it requires remembering previously seen keys. This is feasible when the key space is bounded or when you can allow approximate dedupe (e.g., per-day). Aggregation with awk uses associative arrays keyed by the dimension you want to summarize. These operations turn raw CSV into actionable insights.
Deep Dive into the concept
Normalization is the hidden work that makes data usable. If one row has "Widget-A" and another has " widget-a ", your aggregation by product will treat them as different values unless you normalize. A good normalization policy is explicit: trim leading/trailing whitespace, collapse internal whitespace if appropriate, and standardize case for categorical fields. For dates, convert to a canonical format. For numbers, remove currency symbols and commas. Each normalization rule should be tied to a field type in the schema; otherwise you risk corrupting data (for example, changing case in a case-sensitive ID field). A disciplined normalization step ensures downstream aggregates are accurate.
Deduplication in a stream is a classic trade-off between correctness and memory. The simplest approach is to maintain a set of keys seen so far and drop duplicates. This works if the dataset is small enough to fit in memory or if the key space is bounded. In practice, you can limit deduplication to a window (e.g., per day or per file). This means you only keep keys for the current batch and reset between runs. In a streaming pipeline, dedupe is often performed per run or per file, which is acceptable for many ETL tasks. You should document this limitation and provide a --dedupe-key option so the user controls what counts as a duplicate.
Aggregation with awk uses associative arrays: sum[product] += amount. This is efficient and expressive, but you must be careful about numeric conversion and missing values. awk treats empty strings as 0 in numeric contexts, which can hide data quality problems. A robust pipeline should validate numeric fields before aggregating and keep a separate count of invalid rows. Another detail is output ordering: awk iterates over arrays in arbitrary order. If you want deterministic output for tests, you must sort the output (e.g., for (k in sum) print ... | "sort"). Sorting is not strictly streaming because it buffers output, but the output size is bounded by the number of keys, not the number of rows, so it is typically safe.
Normalization and aggregation also interact with deduplication. If duplicates differ only in whitespace or casing, normalize before dedupe so that logically identical records are treated as duplicates. If duplicates differ in numeric fields, you need a policy: keep the first, keep the last, or merge values. For a learning project, the simplest rule is “keep first” and log duplicates. This preserves determinism and makes the behavior easy to explain.
A subtle issue arises with numeric precision. awk uses double-precision floating point, which can introduce rounding errors for large sums or financial data. If exact decimal arithmetic is required, you should note this limitation and suggest using bc or a language with decimal libraries. For this project, accept floating point but format with fixed decimals in output.
Finally, normalization enables schema mapping. Many CSV files have different column orders. A robust pipeline can allow a header row and map columns by name. In awk, you can parse the header line, build a map from column name to index, and then use that map to extract fields. This makes your pipeline resilient to column reordering. If you implement this, be explicit about the required column names.
How this fit on projects
Normalization, deduplication, and aggregation are the heart of the transformation pipeline. Without them, you only have raw parsing. These steps produce the final clean output and summary reports.
Definitions & key terms
- Normalization: Converting values to a consistent format.
- Deduplication: Removing repeated records based on a key.
- Aggregation: Summarizing data (sum, count, average).
- Deterministic output: Output that does not depend on hash order or run time.
- Schema mapping: Using header names to locate fields.
Mental model diagram (ASCII)
raw CSV -> parse -> normalize -> validate -> dedupe -> aggregate -> report
How it works (step-by-step, with invariants and failure modes)
- Parse fields and normalize strings/numbers.
- Validate required fields and types.
- Build a dedupe key and check if seen.
- If not seen, emit record and update aggregates.
- At end, emit summary and counts.
Invariant: Each output record is normalized and validated. Failure modes: unbounded dedupe set, inconsistent normalization rules.
Minimal concrete example
# Sum revenue per product after normalization
awk -F',' '{
gsub(/^[[:space:]]+|[[:space:]]+$/, "", $1); product=tolower($1);
amount=$3+0; sum[product]+=amount
} END {for (p in sum) printf "%s,%.2f\n", p, sum[p]}' sales.csv
Common misconceptions
- “Deduplication is always safe” -> It depends on your key and business rules.
- “awk output order is stable” -> It is not; you need sorting for determinism.
- “Normalization is just trimming” -> It includes casing, type casting, and canonical formats.
Check-your-understanding questions
- Why should normalization happen before deduplication?
- How can you keep output deterministic with awk aggregations?
- What is a safe dedupe key for a sales dataset?
- Why can floating point sums be risky for money?
Check-your-understanding answers
- It ensures logically identical records map to the same key.
- Sort the output or explicitly order keys.
- An invoice ID or unique transaction ID, not product name.
- Floating point can introduce rounding errors; decimals may be required.
Real-world applications
- Cleaning CRM exports and deduplicating contacts.
- Generating revenue reports from sales feeds.
- Normalizing data before loading into a warehouse.
Where you’ll apply it
- See §3.2 for deduplication and aggregation requirements.
- See §3.7 for the golden path output and summary report.
- Also used in: P04 System Inventory & Audit for aggregation of file metadata.
References
- “Data Pipelines Pocket Reference” (CSV normalization patterns)
- “Effective awk Programming” by Arnold Robbins
Key insights
Normalization and aggregation are not optional polish; they are the logic that turns raw CSV into decisions.
Summary
A streaming pipeline can normalize, deduplicate, and aggregate data in one pass if you carefully manage state and enforce a schema.
Homework/Exercises to practice the concept
- Normalize product names to lowercase and count distinct products.
- Implement deduplication by transaction ID and log duplicates.
- Generate a top-5 report sorted by revenue.
Solutions to the homework/exercises
- Use
tolower()and an associative array of seen products. - Track
seen[id]and write duplicates toduplicates.csv. - Output sums and pipe to
sort -rn | head -5.
3. Project Specification
3.1 What You Will Build
A CLI tool called csvpipe that:
- Reads CSV from a file or stdin.
- Validates against a defined schema (field count, types, and required fields).
- Normalizes data (trim whitespace, standardize casing, numeric parsing).
- Deduplicates rows by a chosen key.
- Produces a normalized CSV output and summary statistics.
- Emits an error report for invalid rows and duplicates.
Included:
- Strict and lenient validation modes.
- Configurable schema via a simple config file.
- Deterministic output and summaries.
Excluded:
- Full SQL-style joins or multi-file merging.
- Binary formats (Parquet, Avro).
3.2 Functional Requirements
- Input: Read from file or stdin.
- CSV parsing: Correctly parse quoted fields with commas.
- Schema validation: Enforce field count, types, and required fields.
- Normalization: Trim whitespace, standardize case, convert numbers.
- Deduplication: Drop duplicates by configurable key.
- Summary report: Output counts, invalid rows, and top-N aggregates.
- Error logging: Write invalid rows to
invalid.csvwith reasons. - Exit codes: Report success/failure as numeric exit codes.
3.3 Non-Functional Requirements
- Performance: Process at least 100,000 rows per minute.
- Reliability: Deterministic output given the same input.
- Usability: Clear CLI help and descriptive error messages.
3.4 Example Usage / Output
$ ./csvpipe.sh sales.csv --schema schema.json --normalize --dedupe --summary
Input rows: 125,402
Output rows: 123,988
Invalid rows: 1,414
Duplicates removed: 218
Top 5 products by revenue:
widget-a $1,203,110.00
widget-c $998,442.00
widget-b $880,923.00
3.5 Data Formats / Schemas / Protocols
Input CSV schema (example):
columns: [company, date, amount, status]
company: string (required)
date: YYYY-MM-DD (required)
amount: decimal (required)
status: enum(paid, pending, failed)
Normalized output CSV:
company,date,amount,status
acme inc,2026-01-01,1200.00,paid
Invalid row log format:
line,reason,raw
14,missing_field,"Acme, Inc",2026-01-01,1200
3.6 Edge Cases
- Embedded commas in quoted fields.
- Missing trailing newline.
- Extra columns or missing columns.
- Non-numeric amounts (“N/A”).
- Duplicate rows with conflicting data.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
./csvpipe.sh sales.csv --schema schema.json --normalize --dedupe --summary
3.7.2 Golden Path Demo (Deterministic)
$ ./csvpipe.sh sample/sales.csv --schema sample/schema.json --normalize --dedupe --summary
Input rows: 5
Output rows: 4
Invalid rows: 1
Duplicates removed: 0
Top 3 products by revenue:
widget-a $1200.00
widget-b $700.00
widget-c $500.00
3.7.3 Failure Demo (Deterministic)
$ ./csvpipe.sh sample/missing.csv --schema sample/schema.json
ERROR: file not found: sample/missing.csv
exit code: 2
3.7.4 If CLI: exact terminal transcript
$ ./csvpipe.sh sample/sales.csv --schema sample/schema.json --summary
Input rows: 5
Output rows: 4
Invalid rows: 1
$ echo $?
0
Exit codes:
0: Success, output generated.1: Validation errors exceed allowed threshold.2: Missing file or invalid arguments.
4. Solution Architecture
4.1 High-Level Design
Input CSV -> [Parser] -> [Validator] -> [Normalizer] -> [Dedupe] -> [Aggregator] -> Output
| |
v v
invalid.csv summary report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Parser | CSV-aware field extraction | Require gawk FPAT |
| Validator | Enforce schema and rules | Strict by default, lenient optional |
| Normalizer | Clean and standardize fields | Per-field normalization rules |
| Dedupe | Remove duplicate keys | Keep first occurrence |
| Aggregator | Compute summaries | Use associative arrays and sort |
4.3 Data Structures (No Full Code)
# seen[key] = 1 for dedupe
# sum[product] = total revenue
# invalid[reason] = count
4.4 Algorithm Overview
Key Algorithm: Streaming Validation + Aggregation
- Parse line with FPAT.
- Validate field count and types.
- Normalize fields.
- If dedupe enabled, skip seen keys.
- Emit normalized record and update aggregates.
Complexity Analysis:
- Time: O(n) for n rows.
- Space: O(k) for k unique keys and aggregates.
5. Implementation Guide
5.1 Development Environment Setup
# Install GNU awk on macOS
brew install gawk
5.2 Project Structure
csvpipe/
├── csvpipe.sh
├── lib/
│ ├── parse.awk
│ ├── validate.awk
│ ├── normalize.awk
│ └── aggregate.awk
├── sample/
│ ├── sales.csv
│ └── schema.json
└── tests/
└── golden-output.txt
5.3 The Core Question You’re Answering
“How do I treat CSV as a stream of records and transform it into clean, validated output in one pass?”
5.4 Concepts You Must Understand First
- CSV quoting rules
- How do quotes escape commas?
- How do you detect multiline records?
- awk FPAT
- Why does FPAT outperform FS for CSV?
- What is the portability cost?
- Validation vs normalization
- Which happens first and why?
5.5 Questions to Guide Your Design
- What is your schema and how strict should it be?
- What dedupe key makes sense for your dataset?
- How will you handle invalid rows (drop, log, or repair)?
5.6 Thinking Exercise
Given a file with 1 million rows, what is the maximum number of unique keys you can keep in memory before performance degrades? Estimate memory usage for 1 million keys.
5.7 The Interview Questions They’ll Ask
- Why is CSV parsing harder than splitting on commas?
- How would you validate numeric ranges in awk?
- What trade-offs exist between strict and lenient validation?
5.8 Hints in Layers
Hint 1: Start by parsing
awk -v FPAT='"[^"]+"|[^,]+' '{print NF}' sales.csv
Hint 2: Add validation
awk -v FPAT='"[^"]+"|[^,]+' 'NF==4 {print} NF!=4 {print "INVALID", NR > "invalid.csv"}' sales.csv
Hint 3: Normalize and aggregate
awk -v FPAT='"[^"]+"|[^,]+' '{gsub(/"/,"",$1); sum[$1]+=$3} END {for (k in sum) print k, sum[k]}' sales.csv
Hint 4: Add dedupe
Track seen[$key] and skip if already seen.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| CSV parsing | “Effective awk Programming” | Data parsing chapters |
| shell pipelines | “The Linux Command Line” | Ch. 6-7 |
| data validation | “Data Engineering” (selected sections) | Validation chapters |
5.10 Implementation Phases
Phase 1: Foundation (2-3 days)
Goals: Parse CSV correctly and output normalized rows.
Tasks:
- Implement FPAT parsing with quote handling.
- Strip quotes and trim whitespace for string fields.
Checkpoint: Output CSV has correct field counts and clean values.
Phase 2: Core Functionality (4-5 days)
Goals: Add validation, error logging, and dedupe.
Tasks:
- Implement schema validation rules and error reporting.
- Add dedupe by key.
Checkpoint: Invalid rows are logged, duplicates are removed.
Phase 3: Reporting & Polish (2-3 days)
Goals: Add summary reports and deterministic ordering.
Tasks:
- Implement top-N aggregates and totals.
- Sort output for deterministic tests.
Checkpoint: Summary report matches golden output exactly.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| CSV parsing | FS vs FPAT | FPAT | Correct handling of quotes |
| Dedupe scope | global vs per-run | per-run | bounded memory |
| Output order | arbitrary vs sorted | sorted | deterministic tests |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Field parsing | quoted commas, empty fields |
| Integration Tests | End-to-end pipeline | sample -> golden output |
| Edge Case Tests | malformed data | missing quotes, extra columns |
6.2 Critical Test Cases
- Quoted commas in company names.
- Missing amount field.
- Duplicate transaction IDs.
6.3 Test Data
"Acme, Inc",2026-01-01,1200,paid
Widget-B,2026-01-01,700,paid
Widget-C,2026-01-02,500,failed
"Bad Row",2026-01-03,,paid
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Naive FS split | Columns shift | Use FPAT or custom parser |
| Silent invalid rows | Counts don’t match | Log invalid rows and reasons |
| Non-deterministic order | Tests fail | Sort outputs |
7.2 Debugging Strategies
- Print
NFand field values for suspect lines. - Use a small sample file to reproduce errors.
- Add a
--debugflag to output parsed fields.
7.3 Performance Traps
Avoid sorting huge raw streams; sort only aggregated summaries.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add TSV support with
--delimiter. - Add a
--strictflag that fails on any invalid row.
8.2 Intermediate Extensions
- Support header-based column mapping.
- Add JSON output for summary reports.
8.3 Advanced Extensions
- Implement approximate dedupe with Bloom filters.
- Add configurable data quality thresholds.
9. Real-World Connections
9.1 Industry Applications
- Data ingestion pipelines for analytics.
- Cleaning CRM or billing exports.
9.2 Related Open Source Projects
- csvkit: Python tools for CSV.
- miller (mlr): CLI tool for data reshaping.
9.3 Interview Relevance
- Stream processing and schema validation.
- Data quality and deterministic reporting.
10. Resources
10.1 Essential Reading
- “Effective awk Programming” by Arnold Robbins
- RFC 4180
10.2 Video Resources
- “Data Cleaning with awk” (tutorial)
- “CSV Parsing Pitfalls” (talk)
10.3 Tools & Documentation
- gawk manual: FPAT and field parsing
- sed(1) manual page
10.4 Related Projects in This Series
- P01 Log Analyzer: Similar streaming aggregation.
- P06 Personal DevOps Toolkit: Integrates
csvpipeas a subcommand.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why naive CSV splitting fails.
- I can describe how FPAT works.
- I can explain my dedupe policy and its limitations.
11.2 Implementation
- All functional requirements are met.
- Invalid rows are logged with reasons.
- Summary report is deterministic.
11.3 Growth
- I can describe how to scale this pipeline.
- I can explain the project clearly in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Correct CSV parsing with quoting.
- Validation with error logging.
- Summary report generation.
Full Completion:
- Deduplication and deterministic ordering.
- Strict/lenient modes implemented.
- Golden output tests pass.
Excellence (Going Above & Beyond):
- Header-based schema mapping.
- Approximate dedupe or performance benchmarks.
- JSON summary output.