Project 1: Field Extractor CLI Tool
Build a
fieldexcommand that extracts, reorders, and formats columns from delimited text with predictable, testable output.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | 6-10 hours |
| Main Programming Language | AWK (with a Bash wrapper) (Alternatives: Python, Perl, Go) |
| Alternative Programming Languages | Python, Perl, Go |
| Coolness Level | Level 2: Practical |
| Business Potential | 1: Internal utility |
| Prerequisites | Basic shell usage, piping, simple regex |
| Key Topics | FS/OFS, NF, FPAT, CLI arg parsing, field ranges |
1. Learning Objectives
By completing this project, you will:
- Explain how AWK splits records into fields and rebuilds output records.
- Implement a CLI that accepts field lists and ranges like
1,3,5-7and2-. - Support delimiter configuration and consistent output formatting.
- Handle quoted CSV reliably using GNU awk
FPAT. - Build deterministic tests for CLI text tools.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Records, Fields, and Separators (FS/OFS/RS/ORS)
Fundamentals
AWK processes text as a stream of records (usually lines) and splits each record into fields. The full line is stored in $0 and the fields are $1 through $NF. FS controls how input is split, while OFS controls how output fields are joined. RS and ORS do the same for records. The default FS is a single space, which collapses runs of whitespace, so tabular data “just works” without extra configuration. Understanding these variables is essential for a field extraction tool because your output is literally a new record assembled from specific fields. Changing a field or NF can cause AWK to rebuild $0 with OFS, which is a powerful but sometimes surprising behavior.
Deep Dive into the concept
At runtime, AWK executes a consistent pipeline: read a record using RS, set $0, split into fields using FS (or FPAT/FIELDWIDTHS), then run pattern-action rules. For most CLI tools, you keep RS="\n" and operate line by line. The key nuance is how FS behaves. If FS is a single space, AWK treats any run of whitespace as a separator and ignores leading/trailing whitespace. This is why awk '{print $1}' works on irregular spacing. But if FS is a literal comma or colon, AWK does not collapse adjacent delimiters; empty fields are preserved, which matters for CSV and /etc/passwd style data.
OFS is equally important. If you print multiple fields with commas in the print statement, AWK inserts OFS between them, not the original separator. This means print $1, $3 is “format output,” not “reproduce input.” For a CLI tool, this is a feature: you can standardize output regardless of input delimiters. However, it means you must set OFS deliberately and document it. If you print using concatenation (print $1 $3), AWK does not insert OFS, which yields different output. This distinction is the difference between a stable tool and a confusing one.
RS can be configured to treat paragraphs or other blocks as records. You probably keep RS as newline in this project, but understanding it matters because a “field extractor” that claims to be general must explicitly state its assumptions. If you later extend to multi-line records, you will need to reason about how NF changes and how you rebuild $0 after modifications.
The rebuild behavior is the critical invariant: when you assign to a field ($3 = "x") or to NF, AWK marks the record as “dirty.” The next time $0 is referenced, AWK rebuilds it by joining the fields with OFS. This is how a field extraction tool can normalize spacing or output. But it also means that scripts with multiple rules can behave differently depending on order. If an earlier rule modifies fields, later rules see the modified record. This is why a CLI tool should prefer a single rule with a deterministic BEGIN for setup and a single action for output.
Finally, consider portability. POSIX AWK guarantees FS, OFS, RS, ORS, and the record/field model. GNU awk adds FPAT for content-based splitting, which is extremely useful for CSV with quoted commas. If you require FPAT, the tool should document that it expects GNU awk. This is a design decision that affects users, tests, and error handling. For the fieldex tool, it is reasonable to detect gawk and enable --csv mode with FPAT, while leaving default FS parsing for other cases.
How this fit on projects
This concept is the foundation of the extractor: you are explicitly selecting fields and rebuilding output records in a controlled format.
Definitions & key terms
- Record -> The current input unit (
$0), typically one line. - Field -> A token within a record (
$1..$NF). - FS -> Input field separator (string or regex).
- OFS -> Output field separator inserted by
print. - RS -> Input record separator, usually newline.
- ORS -> Output record separator, usually newline.
- NF -> Number of fields in the current record.
Mental model diagram
raw line -> split by FS -> [$1][$2][$3]... -> select -> join with OFS -> output line
How it works (step-by-step)
- Read a line into
$0usingRS. - Split
$0into$1..$NFusingFS. - Select requested fields and build a list.
printthe list so AWK insertsOFSbetween fields.- Emit
ORSat the end of the record.
Invariants and failure modes:
- Invariant:
NFequals the number of fields after splitting. - Failure: If
FSdoes not match input format, fields shift or become empty.
Minimal concrete example
BEGIN { FS=":"; OFS="|" }
{ print $1, $7 }
Common misconceptions
- “
FS=" "means a single space.” (It collapses runs of whitespace.) - “
print $1 $2uses OFS.” (It does not; it concatenates.) - “Changing
$1doesn’t change$0.” (It does when$0is rebuilt.)
Check-your-understanding questions
- Why does
FS=" "behave differently thanFS=","? - What is the difference between
print $1, $2andprint $1 $2? - What happens if you set
NF=2in the middle of a script?
Check-your-understanding answers
- A single space collapses runs of whitespace; a literal comma does not.
- The comma inserts
OFS; concatenation does not insert anything. - AWK truncates the record to two fields and rebuilds
$0on demand.
Real-world applications
- Extracting specific columns from CSV exports
- Normalizing log formats for downstream tools
- Selecting columns from
/etc/passwdorpsoutput
Where you’ll apply it
- See §5.4 “Concepts You Must Understand First” and §5.10 Phase 1.
- Also used in: P04 CSV to JSON/SQL, P06 Multi-File Join
References
- The AWK Programming Language, Ch. 1-2
- Effective awk Programming, Ch. 4
- POSIX awk specification (
awk(1p))
Key insights
Field extraction is just controlled splitting and rejoining—get the separators right and everything else becomes predictable.
Summary
Records and fields are AWK’s core model. If you can predict how FS and OFS shape input and output, you can build reliable text tools with very little code.
Homework/Exercises to practice the concept
- Extract fields 1 and 3 from a space-delimited file and join with
|. - Switch
FSfrom space to colon and observeNFchanges. - Modify
$2and print$0to see record rebuilding.
Solutions to the homework/exercises
# 1
BEGIN { OFS="|" }
{ print $1, $3 }
# 2
BEGIN { FS=":" }
{ print NF }
# 3
{ $2 = "X"; print $0 }
2.2 Field Selection Syntax and Range Expansion
Fundamentals
A field extractor needs a compact way to describe which columns to print. Humans use lists and ranges: 1,3,5-7,10-. The tool’s job is to parse that description, expand it into a concrete ordered list, and then print fields in that order. The tricky part is validating field indices, handling open-ended ranges, and deciding what to do when fields are missing. This is not an AWK language feature; it is a small parsing problem you solve in the wrapper or in AWK itself.
Deep Dive into the concept
Range expansion is a tiny domain-specific language. The input string is a comma-separated list of tokens. Each token is either a single positive integer (field index) or a range with a hyphen. Ranges can be start-end or start- (open ended). Your parser must detect invalid tokens (non-numeric, zero, negative) and decide on failure behavior. For example, -3 should be rejected, and 3-2 is either an error or treated as a descending range. Define this explicitly and test for it.
Once you parse the list, you need to expand it into a deterministic sequence. If a user writes 1,3-5,3, do you allow duplicates? Many CLI tools preserve order and duplicates, which can be useful for reordering or duplication. That behavior should be documented because it affects output. A safe default is to preserve order but allow a --unique flag that removes duplicates. For this project, you can preserve order and avoid de-duplication to keep behavior simple and explainable.
Open-ended ranges require a decision on what - means. A common interpretation is start- meaning “from start to NF.” In AWK, NF is only known at runtime for each record. That means you cannot fully expand such ranges ahead of time unless you treat them as a special token resolved per record. A clean design is to parse ranges into a list of selectors that can be evaluated per record. For example, store range_start and a flag range_to_end. At print time, you loop from range_start to NF and print each field. This keeps your parser simple and your runtime logic clear.
The wrapper vs. pure AWK design choice matters. A Bash wrapper can parse command-line options, then pass a generated AWK program or variable values to AWK. Alternatively, you can implement parsing inside AWK using ARGV and split() on the -f option. The wrapper approach is easier to read and test for beginners, while the pure-AWK approach is more self-contained and portable. In this project, you can use a small shell wrapper and keep the AWK script focused on field selection.
There is also a subtle formatting consideration: if a requested field index is greater than NF, AWK returns an empty string. If you print a list of fields, you will still get separators. This can produce trailing delimiters. Decide whether to omit missing fields or output empty fields. The more predictable behavior is to output empty fields, because it preserves column positions. But for a CLI tool, many users prefer to drop missing fields. You can offer a --strict or --skip-missing flag. For this project, define a clear default and document it in the usage.
Finally, consider error handling. If the user passes -f without a value, or includes a malformed token like 1,,3, your tool should return a non-zero exit code and print a helpful message. This is part of being a production-quality CLI, even for a small tool.
How this fit on projects
Field selection and range expansion is the “business logic” of the tool. Everything else is setup and I/O.
Definitions & key terms
- Field list -> A comma-separated list of field selectors.
- Range -> A selector with
start-endorstart-. - Open-ended range -> A range that ends at
NF. - Selector -> A single field index or a range definition.
Mental model diagram
"1,3-5,7-" -> tokens -> [1] [3-5] [7-] -> runtime expansion -> [1,3,4,5,7..NF]
How it works (step-by-step)
- Parse
-fargument into comma-separated tokens. - For each token, detect single index or range.
- Validate indices (positive integers).
- Store selectors in order.
- For each input record, expand ranges using
NFand print fields.
Invariants and failure modes:
- Invariant: all selectors are validated integers.
- Failure: invalid tokens should abort with exit code 2.
Minimal concrete example
# Given selectors in an array sel[1..n]
for (i = 1; i <= n; i++) {
if (sel[i] ~ /-/) {
split(sel[i], r, "-")
start = r[1]
end = (r[2] == "" ? NF : r[2])
for (j = start; j <= end; j++) out[++k] = $j
} else {
out[++k] = $(sel[i])
}
}
print join(out, OFS)
Common misconceptions
- “Ranges can be expanded before reading input.” (Open-ended ranges depend on
NF.) - “Missing fields are errors.” (In AWK, missing fields are empty strings.)
- “You should always de-duplicate fields.” (Order and duplicates can be meaningful.)
Check-your-understanding questions
- Why can’t
3-be expanded without reading the record? - What happens when you print a field index larger than
NF? - Should
1,1,2print field 1 twice?
Check-your-understanding answers
- Because the end depends on
NF, which is record-specific. - AWK prints an empty string for that field.
- It depends on design; by default preserving order means yes.
Real-world applications
- Command-line replacements for
cutwith range support - Data extraction in ETL pipelines
- Reordering columns for data import/export
Where you’ll apply it
- See §5.5 “Questions to Guide Your Design” and §5.10 Phase 2.
- Also used in: P07 Report Generator, P16 Data Pipeline
References
- The AWK Programming Language, Ch. 2
- Effective awk Programming, Ch. 5 (programming with arrays)
Key insights
Field selection is a tiny parsing language; treat it with the same rigor as any CLI interface.
Summary
Parsing and expanding field ranges is what turns a simple one-liner into a reliable CLI tool that people can reuse.
Homework/Exercises to practice the concept
- Write a parser that expands
2-4,1,6-into explicit indices. - Decide how to handle invalid tokens like
3--4ora. - Implement a
--uniquemode that removes duplicates.
Solutions to the homework/exercises
# 1 (sketch)
function expand(sel, out, n, i, r, start, end, j) {
n = split(sel, s, ",")
for (i=1; i<=n; i++) {
if (s[i] ~ /-/) {
split(s[i], r, "-")
start = r[1]; end = r[2] ? r[2] : NF
for (j=start; j<=end; j++) out[++k]=j
} else out[++k]=s[i]
}
}
2.3 Robust CSV Parsing with FPAT
Fundamentals
CSV looks simple but gets tricky when fields contain commas or quotes. If you split on commas with FS=",", a value like "Smith, John" will be split incorrectly. GNU awk’s FPAT solves this by defining what a field is rather than what separates fields. A typical CSV FPAT matches either a quoted field or an unquoted field. This allows you to preserve commas inside quotes and then unescape quotes for output.
Deep Dive into the concept
CSV has a small but strict grammar: fields are separated by commas, fields may be quoted with double quotes, and quotes inside quoted fields are escaped by doubling them (""). If you parse with FS=",", you cannot distinguish commas inside quoted fields from separators. FPAT reverses the logic. Instead of describing separators, you define a regex that matches valid field content. For CSV, a common pattern is:
FPAT = "([^,]*)|(
\"([^\"]|\"\")*\"
)"
In practice you can use a single-line regex like:
FPAT = "([^,]+)|(""([^""]|"")*"")"
This matches either unquoted text without commas or a quoted string that may contain escaped quotes. Once fields are matched, AWK treats each match as a field and sets $1..$NF accordingly.
However, FPAT is a GNU awk extension. That means portability is limited. Your CLI tool should detect GNU awk (gawk --version) or provide a --csv mode that explicitly requires it. For users on macOS, the default /usr/bin/awk is not GNU awk; this is a common portability problem. You can either instruct users to install gawk and run gawk -f fieldex.awk, or bundle a wrapper that checks and fails with a clear error message.
After parsing, you still need to unquote fields. CSV quoted fields preserve quotes, so "Alice" becomes a field value including quotes. If you output raw fields, your output may include quotes in unexpected places. For a field extractor tool, you typically want to output the value, so you should strip surrounding quotes and unescape doubled quotes. This can be done with a simple helper: if a field starts and ends with ", remove them, then replace "" with ".
Edge cases: empty fields, trailing commas, and line endings. FPAT patterns should allow empty fields (e.g., "" or nothing between commas). This is easy to miss. Also, Windows CSV files may end with \r\n; you should normalize line endings or set RS appropriately. If you ignore this, your output may carry stray \r characters.
Finally, note that CSV can include newlines inside quoted fields. Handling that would require multi-line record splitting by setting RS to a regex that matches a full CSV record. This is advanced and not required for this project, but your documentation should state that multi-line CSV is not supported. Clear boundaries are part of robust CLI tool design.
How this fit on projects
If you want a “real” field extractor, it must not break on quoted CSV. FPAT is the simplest practical approach in GNU awk.
Definitions & key terms
- CSV -> Comma-separated values with quoted field rules.
- FPAT -> GNU awk variable defining field patterns.
- Quoted field -> A field enclosed in double quotes.
- Escaped quote -> A doubled quote inside a quoted field (
"").
Mental model diagram
line: Bob,"Smith, John",42
FPAT -> [Bob] ["Smith, John"] [42]
How it works (step-by-step)
- Set
FPATto a regex that matches CSV fields. - AWK scans the line and extracts matches as fields.
- Optionally unquote and unescape quoted fields.
- Print selected fields with
OFS.
Invariants and failure modes:
- Invariant: quoted commas do not split fields.
- Failure: non-GNU awk ignores
FPATand breaks parsing.
Minimal concrete example
BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")"; OFS="|" }
{ f = $2; gsub(/^"|"$/, "", f); gsub(/""/, "\"", f); print $1, f }
Common misconceptions
- “CSV is just
FS=",".” (Quoted commas break this.) - “FPAT is portable.” (It is GNU awk only.)
- “Quoted fields should keep quotes.” (Depends on output requirements.)
Check-your-understanding questions
- Why does
FS=","fail for"Doe, Jane"? - What does
FPATchange about field splitting? - How do you handle doubled quotes inside a quoted field?
Check-your-understanding answers
- The comma inside quotes is treated as a separator.
- It matches field content instead of separators.
- Replace
""with"after stripping outer quotes.
Real-world applications
- Parsing CSV exports from spreadsheets
- Cleaning CRM or sales data
- Extracting columns from log CSVs with quoted strings
Where you’ll apply it
- See §5.4 “Concepts You Must Understand First” and §5.10 Phase 2.
- Also used in: P04 CSV to JSON/SQL, P10 Text Spreadsheet
References
- GNU awk User’s Guide:
FPAT - RFC 4180 (CSV format)
Key insights
CSV parsing is about defining field content, not just separators; FPAT is the practical AWK solution.
Summary
A reliable field extractor must handle quoted CSV. FPAT makes that possible with a clear regex and post-processing.
Homework/Exercises to practice the concept
- Write a
FPATthat matches empty fields and quoted fields. - Parse a CSV row with embedded commas and output the second column only.
- Add a flag to keep quotes rather than stripping them.
Solutions to the homework/exercises
BEGIN { FPAT = "([^,]*)|(\"([^\"]|\"\")*\")" }
{ f = $2; gsub(/^"|"$/, "", f); gsub(/""/, "\"", f); print f }
3. Project Specification
3.1 What You Will Build
A CLI tool called fieldex that extracts one or more fields from each input record. It supports delimiter configuration, field lists and ranges, and predictable output formatting. You will deliver a small Bash wrapper plus an AWK script, with tests and usage documentation.
Included:
-ffor field list/range selection-dfor delimiter (FS) and--ofsfor output--csvmode using GNU awkFPAT- Deterministic test fixtures
Excluded:
- Multi-line CSV parsing
- Binary formats
- GUI output
3.2 Functional Requirements
- Field selection: Accept
-f "1,3,5-7"and-f "2-". - Delimiter control: Accept
-d ","to setFS. - Output separator: Accept
--ofs "|"to setOFS. - CSV mode:
--csvusesFPATand unquotes fields. - Error handling: Invalid field specs return exit code 2.
3.3 Non-Functional Requirements
- Performance: Stream input without loading entire files.
- Reliability: Deterministic output for same inputs.
- Usability: Clear usage and helpful error messages.
3.4 Example Usage / Output
$ printf 'a b c\n1 2 3\n' | ./fieldex -f 1,3
# Output:
a c
1 3
3.5 Data Formats / Schemas / Protocols
- Input: delimited text, one record per line
- Output: delimited text, one record per line
- CSV mode: RFC 4180 style quoted fields (single-line only)
3.6 Edge Cases
- Field index exceeds
NF-> output empty field - Empty lines -> output empty line
-fmissing or invalid -> error with exit code 2-dempty string -> error with exit code 2
3.7 Real World Outcome
A user can run a single command to extract and reformat columns from mixed-delimiter files. They can see exact, stable outputs and understand errors when inputs or flags are invalid.
3.7.1 How to Run (Copy/Paste)
cd /path/to/fieldex
chmod +x fieldex
./fieldex -f 1,3 -d ',' --ofs '|' data.csv
3.7.2 Golden Path Demo (Deterministic)
Input file data.csv:
name,age,role
"Smith, John",32,engineer
Alice,29,manager
Command:
./fieldex --csv -f 1,2 --ofs '|' data.csv
Output:
Smith, John|32
Alice|29
3.7.3 Failure Demo (Deterministic)
$ ./fieldex -f 3--4 data.csv
fieldex: invalid field selector: 3--4
exit=2
3.7.4 Exit Codes
0success2invalid arguments or field specs3input file not found/unreadable
4. Solution Architecture
4.1 High-Level Design
+-----------+ +----------------+ +----------------+
| CLI args | ---> | selector parser | ---> | AWK executor |
+-----------+ +----------------+ +----------------+
|
v
formatted output
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Bash wrapper | Parse flags, validate, build args | Keep parsing simple, fail fast |
| Selector parser | Expand ranges and store selectors | Preserve order and duplicates |
| AWK script | Apply selectors and print fields | Single action for deterministic run |
4.3 Data Structures (No Full Code)
# selectors[1..n] holds either "N" or "N-" or "N-M"
# output buffer holds selected fields per record
4.4 Algorithm Overview
Key Algorithm: Selector Expansion
- Split
-fstring by commas. - Validate each token.
- Store tokens in array.
- For each record, expand ranges and print fields.
Complexity Analysis:
- Time: O(R * (F + S)) where R records, F fields, S selectors
- Space: O(S) selectors per record
5. Implementation Guide
5.1 Development Environment Setup
# Ensure GNU awk is available
awk --version | head -1
5.2 Project Structure
fieldex/
├── fieldex # wrapper
├── fieldex.awk # core logic
├── tests/
│ ├── sample.csv
│ └── expected.txt
└── README.md
5.3 The Core Question You’re Answering
“How do I treat text like a table and reshape it with almost no code?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Field splitting (
FS,FPAT) - Output formatting (
OFS,printvs concatenation) - Array iteration and range expansion
5.5 Questions to Guide Your Design
- Should invalid selectors abort the program or be ignored?
- Should duplicates be preserved or removed?
- Should missing fields be printed as empty or skipped?
5.6 Thinking Exercise
Manually expand 2-4,1,6- for a record with 7 fields.
5.7 The Interview Questions They’ll Ask
- What is the difference between
FSandFPAT? - Why does
print $1, $2insertOFS? - How do you handle field ranges that depend on
NF?
5.8 Hints in Layers
Hint 1: Start with a fixed list of fields and print them.
Hint 2: Parse -f into an array with split().
Hint 3: Support ranges by detecting - and looping.
Hint 4: Add --csv only after the basic tool works.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Field variables | The AWK Programming Language | Ch. 1 |
| Separators | Effective awk Programming | Ch. 4 |
| Arrays | Effective awk Programming | Ch. 5 |
5.10 Implementation Phases
Phase 1: Foundation (2-3 hours)
Goals:
- Parse
-fand-d - Print fixed fields
Tasks:
- Write a minimal wrapper that passes
FSto AWK. - Hardcode
print $1, $3and validate output.
Checkpoint: Correct output for a small sample file.
Phase 2: Core Functionality (3-4 hours)
Goals:
- Parse field lists and ranges
- Set
OFS
Tasks:
- Implement selector parsing.
- Expand ranges per record.
- Add
--ofsoption.
Checkpoint: -f 1,3,5- works on files with variable NF.
Phase 3: Polish & Edge Cases (2-3 hours)
Goals:
- Add
--csvmode - Add tests and error handling
Tasks:
- Implement
FPATand unquote logic. - Write tests for invalid selectors.
- Document usage.
Checkpoint: Tests pass; CLI returns correct exit codes.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Range duplicates | Keep vs remove | Keep | Preserves explicit user intent |
| Missing fields | Empty vs skip | Empty | Predictable column count |
| CSV support | FS vs FPAT | FPAT | Correct handling of quoted commas |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Selector parsing | 1,3-5 expansion |
| Integration Tests | Full CLI invocation | ./fieldex -f 1,3 file |
| Edge Case Tests | Missing fields, invalid arg | -f 0, -f 3--4 |
6.2 Critical Test Cases
- CSV quoted field: Ensure
"Smith, John"stays intact. - Open-ended range:
-f 2-expands to2..NF. - Missing field:
-f 9on 3-field line prints empty.
6.3 Test Data
# input
A,B,"C,D"
1,2,3
Expected output for -f 1,3:
A|C,D
1|3
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
Wrong FS |
Fields misaligned | Verify delimiter and whitespace |
Missing OFS |
Output glued together | Set OFS in BEGIN |
| Range parsing errors | Incorrect columns printed | Add validation + tests |
7.2 Debugging Strategies
- Print
NFand fields to confirm splitting before selection. - Echo parsed selectors to verify range expansion.
7.3 Performance Traps
- Printing extra debug output to stdout can corrupt data streams; use stderr.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a
--headermode that preserves a header line. - Add a
--uniqueselector mode.
8.2 Intermediate Extensions
- Support multiple delimiters via regex
FS. - Add
--nulloutput (NUL-separated) for pipeline safety.
8.3 Advanced Extensions
- Multi-line CSV support with custom
RS. - Add schema validation for numeric columns.
9. Real-World Connections
9.1 Industry Applications
- ETL pipelines: Extract columns before loading into databases.
- Log processing: Select fields for analytics or alerting.
9.2 Related Open Source Projects
- GNU coreutils
cut: Similar tool with fewer CSV features. - xsv: Rust CSV toolkit that inspires robust CSV handling.
9.3 Interview Relevance
- Parsing CLI args, text processing, and basic streaming concepts.
10. Resources
10.1 Essential Reading
- The AWK Programming Language by Aho, Kernighan, Weinberger - Ch. 1-2
- Effective awk Programming by Arnold Robbins - Ch. 4-5
10.2 Video Resources
- AWK basics and field splitting demos (YouTube, GNU awk talks)
10.3 Tools & Documentation
- GNU awk User’s Guide:
FPAT,FIELDWIDTHS,getline - POSIX awk spec: portability notes
10.4 Related Projects in This Series
- P02 Line Stats Calculator for aggregation
- P04 CSV to JSON/SQL for structured output
11. Self-Assessment Checklist
11.1 Understanding
- I can explain
FS,OFS, and how they affect output - I can parse and expand field ranges
- I can explain when
FPATis required
11.2 Implementation
- All functional requirements are met
- All test cases pass
- Errors return correct exit codes
- Output is deterministic
11.3 Growth
- I can explain this tool in a job interview
- I documented key design decisions
12. Submission / Completion Criteria
Minimum Viable Completion:
- Implements
-f,-d, and--ofs - Handles invalid selectors with exit code 2
- Passes at least 5 tests
Full Completion:
- CSV mode implemented with
FPAT - Deterministic golden demo documented
Excellence (Going Above & Beyond):
- Multi-delimiter and NUL output modes
- Robust CLI help and man-page style docs
13. Additional Content Rules (Hard Requirements)
13.1 Determinism
- Use fixed sample input files under
tests/fixtures/. - In golden demos, always show exact input files and expected output.
- Avoid locale-dependent sorting or formatting.
13.2 Outcome Completeness
- Include a success demo (§3.7.2) and a failure demo (§3.7.3).
- Exit codes are specified in §3.7.4.
13.3 Cross-Linking
- Concepts reference §5.4 and §5.10.
- Cross-links provided to P04, P06, P07, and P16.
13.4 No Placeholder Text
All sections are complete and concrete.