Project 2: Line Number and Statistics Calculator
Build a CLI that annotates lines with numbers and computes summary statistics over numeric columns.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Beginner+ |
| Time Estimate | 1 week |
| Main Programming Language | AWK (Alternatives: Python, Perl) |
| Alternative Programming Languages | Python, Perl |
| Coolness Level | Level 2: Practical |
| Business Potential | 2: Useful internal utility |
| Prerequisites | FS/OFS basics, simple AWK programs |
| Key Topics | NR/FNR, BEGIN/END, aggregates, numeric coercion |
1. Learning Objectives
By completing this project, you will:
- Use
NRandFNRto annotate input lines and multi-file streams. - Implement running aggregates (count, sum, min, max, mean).
- Handle numeric conversion and missing/invalid data.
- Produce deterministic reports with headers and footers.
- Build tests for statistical correctness and edge cases.
2. All Theory Needed (Per-Concept Breakdown)
2.1 The Implicit Loop, NR/FNR, and Aggregation
Fundamentals
AWK processes records in a single implicit loop. NR counts total records seen across all files, while FNR resets to 1 for each new file. These counters let you label lines, track file boundaries, and implement global vs per-file statistics. The BEGIN and END blocks are essential for aggregation: initialize counters in BEGIN, update per record, and print summaries in END. This pattern is the foundation for any statistics tool built in AWK.
Deep Dive into the concept
The implicit loop is the runtime “engine” of AWK. For each record, AWK sets $0, splits fields, and evaluates rules. NR increments for every record processed across all inputs, which means it is a global counter. FNR resets to 1 for each new file, so it is a per-file counter. This difference is subtle but vital when you build tools that accept multiple input files. If you want a single numbering sequence across all inputs, use NR. If you want each file to restart line numbers, use FNR and a BEGINFILE hook (GNU awk) to print file headers.
Aggregation relies on state that persists across records. AWK variables are global by default and remain in memory until the program ends. This makes it easy to accumulate counts: count++, sum += $3, min = (NR==1 ? $3 : (min<$3?min:$3)), and so on. But it also creates pitfalls: if your variables are not initialized properly, AWK’s automatic numeric/string coercion can hide bugs. For example, an uninitialized variable used in numeric context is 0. That’s convenient for sum += x, but incorrect for min if you forget to initialize it to the first value.
BEGIN and END blocks also influence determinism. If you print a header in BEGIN, the output is stable regardless of input. If you print a footer in END, you guarantee a summary even when the input is empty. This is essential for a statistics CLI: an empty input should produce a deterministic output (e.g., counts of 0) rather than nothing. Additionally, this structure makes tests easier to write because the output is always in the same shape.
When using NR for line numbering, remember that the implicit loop reads and processes lines in order. If you add getline or other manual reads later, NR will increment for those lines too, which can distort numbering. For this project, avoid getline in the core flow. Keep the logic within the main rule and treat the input stream as the source of truth. If you need to skip lines, do so with patterns rather than manual reads.
Finally, consider the difference between per-record updates and per-file summaries. GNU awk provides BEGINFILE and ENDFILE to let you reset counters for each file. If you want per-file stats, use these hooks; otherwise, stick to a single END summary. Explicitly document the behavior your tool implements so users understand what NR means in your output.
How this fit on projects
Line numbering and stats are the core outputs of this tool, and the NR/FNR model determines whether your stats are global or per-file.
Definitions & key terms
- NR -> Total number of records processed.
- FNR -> Record number within the current file.
- BEGIN/END -> Blocks before and after main processing.
- Aggregate -> Summary computed over many records.
Mental model diagram
BEGIN -> for each record: update stats -> END -> print summary
How it works (step-by-step)
- Initialize counters in
BEGIN. - For each record, increment counters and update stats.
- Track min/max by comparing current value.
- Print line numbers alongside selected fields.
- In
END, compute derived metrics (mean) and output summary.
Invariants and failure modes:
- Invariant:
NRincreases by 1 per processed record. - Failure: min/max incorrect if not initialized on first numeric record.
Minimal concrete example
BEGIN { count=0; sum=0 }
{ count++; sum += $2; print NR ":" $0 }
END { if (count) print "avg", sum/count; else print "avg", 0 }
Common misconceptions
- “
NRresets per file.” (That isFNR.) - “
ENDwon’t run if there’s no input.” (It always runs.) - “You must write loops to count records.” (AWK does it for you.)
Check-your-understanding questions
- What is the difference between
NRandFNRwith two input files? - Why is
BEGINessential for deterministic headers? - How would you compute per-file averages?
Check-your-understanding answers
NRis global;FNRresets for each file.- It ensures headers print even if input is empty.
- Use
BEGINFILE/ENDFILEor reset counters whenFNR==1.
Real-world applications
- Line-numbered logs for debugging
- Quick statistics on CSV columns
- Per-file summaries in batch processing
Where you’ll apply it
- See §5.4 concepts and §5.10 Phases 1-2.
- Also used in: P07 Report Generator, P16 Data Pipeline
References
- The AWK Programming Language, Ch. 1
- Effective awk Programming, Ch. 3-4
Key insights
Aggregations in AWK are just persistent variables updated in the implicit loop.
Summary
NR, FNR, and BEGIN/END are the control points that let a streaming language produce stable summaries.
Homework/Exercises to practice the concept
- Print line numbers only for lines that match
/ERROR/. - Compute per-file line counts with
FNR. - Build a summary that prints min, max, and mean.
Solutions to the homework/exercises
/ERROR/ { print NR ":" $0 }
# per-file counts
FNR==1 { if (NR>1) print prev, count; count=0; prev=FILENAME }
{ count++ }
END { print prev, count }
2.2 Numeric Coercion and Robust Statistical Computations
Fundamentals
AWK is dynamically typed. The same variable can be treated as a string or a number depending on context. This is powerful but dangerous for statistics, because malformed input (like "N/A") can silently coerce to 0, skewing results. You must explicitly detect numeric fields and decide how to handle invalid data. You also need to understand how to compute statistics incrementally: count, sum, min, max, mean, and optionally variance. These operations can be done in one pass, which fits AWK’s streaming model.
Deep Dive into the concept
Numeric coercion in AWK follows simple rules: when a value appears in a numeric context (e.g., x+0, x > 3), AWK converts it to a number. Strings that start with digits are converted to that numeric prefix; strings without digits become 0. This means that a field containing "12ms" becomes 12, which might be acceptable or might be a silent bug. For a statistics CLI, you should define a numeric validation rule such as /^-?[0-9]+(\.[0-9]+)?$/. If a field does not match, you can skip it, count it as invalid, or treat it as zero. The choice must be explicit and documented.
Min/max computation has a standard pitfall: if you initialize min to 0, you break on all-positive data. The correct pattern is to initialize on the first valid numeric value. This can be done with a seen flag or by testing count==0 before assignment. Mean is straightforward (sum/count), but if you want variance or standard deviation, you should use a numerically stable online algorithm (e.g., Welford’s method) to avoid catastrophic cancellation. Even if you only implement mean, understanding these stability concerns helps you design correct tools.
For line-numbered statistics, you may want to output the original line and annotate with computed metrics. This is a user-interface decision: do you output stats at the end only, or inline? For this project, the requirement is to print line numbers and a final summary. This is common in log analysis: the annotated lines are “details,” and the summary is “roll-up.” Ensuring both outputs share the same delimiter and formatting avoids confusion.
Another subtlety is localization and numeric formatting. AWK’s printf can format numbers with precision control. For deterministic output, you should specify decimal places for averages (e.g., printf "%.3f"). If you rely on default formatting, the number of decimal places may vary. For tests, fixed formatting is essential.
Finally, think about missing values. If a line has fewer fields than expected, $n becomes an empty string. Your numeric validation should treat empty strings as invalid rather than 0, unless you explicitly want zeros. A good CLI gives the user control: a --strict mode that fails on invalid data, and a default mode that skips invalid rows but reports a count of skipped lines.
How this fit on projects
Statistics are the core deliverable: correct numeric handling is what makes your tool reliable.
Definitions & key terms
- Numeric coercion -> Automatic conversion between strings and numbers.
- Validation regex -> Pattern to detect numeric strings.
- Online algorithm -> One-pass computation of stats.
- Welford’s method -> Stable algorithm for variance.
Mental model diagram
field -> validate -> numeric value -> update {count,sum,min,max} -> report
How it works (step-by-step)
- Extract target field.
- Validate with regex.
- If valid, coerce to number and update aggregates.
- Track invalid count separately.
- Print summary with fixed formatting.
Invariants and failure modes:
- Invariant: counts reflect only valid numeric rows.
- Failure: invalid strings silently treated as 0 if not validated.
Minimal concrete example
{ if ($2 ~ /^-?[0-9]+(\.[0-9]+)?$/) {
v=$2+0; count++; sum+=v;
if (count==1 || v<min) min=v;
if (count==1 || v>max) max=v;
} else invalid++ }
END { if (count) printf "avg=%.2f\n", sum/count; print "invalid", invalid }
Common misconceptions
- “Non-numeric fields cause errors.” (They do not; they coerce to 0.)
- “
printfis optional.” (It’s needed for deterministic formatting.) - “Min/max can start at 0.” (Only if data is known to be non-negative.)
Check-your-understanding questions
- What does AWK do with the string
"12ms"in a numeric context? - Why is Welford’s method preferred for variance?
- How do you ensure averages print consistently across runs?
Check-your-understanding answers
- It converts it to
12. - It is numerically stable in one pass.
- Use
printfwith fixed precision.
Real-world applications
- Computing metrics from server logs
- Summarizing CSV exports
- Validating numeric data in ETL pipelines
Where you’ll apply it
- See §5.4 and §5.10 Phase 2.
- Also used in: P09 Network Log Analyzer, P16 Data Pipeline
References
- The AWK Programming Language, Ch. 2-3
- Donald Knuth, “The Art of Computer Programming” Vol. 2 (numerical stability)
Key insights
Reliable stats depend more on validation and initialization than on formulas.
Summary
AWK’s dynamic typing is powerful, but for statistics you must explicitly validate and format to avoid silent errors.
Homework/Exercises to practice the concept
- Add a
--strictflag that errors on invalid numeric fields. - Implement a running variance using Welford’s method.
- Format averages to exactly three decimal places.
Solutions to the homework/exercises
# strict mode sketch
if ($2 !~ /^-?[0-9]+(\.[0-9]+)?$/) { print "invalid" > "/dev/stderr"; exit 2 }
3. Project Specification
3.1 What You Will Build
A CLI tool that prints each line prefixed with a line number and produces a summary of numeric statistics for a chosen field (or fields). It supports custom delimiters and deterministic output formatting.
3.2 Functional Requirements
- Line numbering: Prefix each line with
NRorFNR(configurable). - Field selection: Choose which numeric field to analyze.
- Summary stats: Output count, min, max, sum, average, invalid count.
- Formatting: Fixed decimal precision.
- Error handling: Invalid options return exit code 2.
3.3 Non-Functional Requirements
- Performance: One-pass streaming computation.
- Reliability: Deterministic output with fixed precision.
- Usability: Clear help text and examples.
3.4 Example Usage / Output
$ printf 'a 10\nb 20\nc X\n' | ./linestats -f 2
1: a 10
2: b 20
3: c X
--
count=2 sum=30 min=10 max=20 avg=15.00 invalid=1
3.5 Data Formats / Schemas / Protocols
- Input: delimited text, one record per line
- Output: prefixed lines + summary footer
3.6 Edge Cases
- Empty input -> summary with count 0, avg 0
- Non-numeric values -> skipped or error (configurable)
- Missing fields -> treated as invalid
3.7 Real World Outcome
You can annotate logs with line numbers and compute numeric summaries in a single pass without writing a full program.
3.7.1 How to Run (Copy/Paste)
chmod +x linestats
./linestats -f 2 --precision 2 input.txt
3.7.2 Golden Path Demo (Deterministic)
# input.txt
alice 10
bob 20
carl 30
Command:
./linestats -f 2 --precision 2 input.txt
Output:
1: alice 10
2: bob 20
3: carl 30
--
count=3 sum=60 min=10 max=30 avg=20.00 invalid=0
3.7.3 Failure Demo (Deterministic)
$ ./linestats -f X input.txt
linestats: invalid field index: X
exit=2
3.7.4 Exit Codes
0success2invalid arguments3input file unreadable
4. Solution Architecture
4.1 High-Level Design
+--------+ +----------------+ +------------------+
| input | -> | line annotator | -> | stats aggregator |
+--------+ +----------------+ +------------------+
|
v
summary output
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| CLI parser | Parse -f, --precision |
Fail fast on invalid args |
| Annotator | Print line numbers + line | Use NR or FNR |
| Aggregator | Update stats and print summary | Skip invalid by default |
4.3 Data Structures (No Full Code)
# stats: count, sum, min, max, invalid
# mode: global vs per-file
4.4 Algorithm Overview
Key Algorithm: Single-Pass Aggregation
- Validate field value.
- Update counters and sums.
- Track min/max via comparisons.
- Print summary in
END.
Complexity Analysis:
- Time: O(R)
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
awk --version | head -1
5.2 Project Structure
linestats/
├── linestats
├── linestats.awk
├── tests/
└── README.md
5.3 The Core Question You’re Answering
“How do I compute reliable summaries from a stream without loading it into memory?”
5.4 Concepts You Must Understand First
NR/FNRandBEGIN/END- Numeric coercion and validation
printfformatting
5.5 Questions to Guide Your Design
- Should invalid rows be skipped or cause failure?
- Should numbering reset per file?
- What precision should be default?
5.6 Thinking Exercise
Compute min/max/avg by hand for a 5-line file with one invalid line.
5.7 The Interview Questions They’ll Ask
- How do
NRandFNRdiffer? - Why does AWK treat
"12ms"as 12? - How do you compute stats in one pass?
5.8 Hints in Layers
Hint 1: Start with print NR ":" $0.
Hint 2: Add sum += $2 and count++.
Hint 3: Initialize min on first numeric row.
Hint 4: Use printf for fixed precision.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| NR/FNR model | The AWK Programming Language | Ch. 1 |
| Numeric operations | The AWK Programming Language | Ch. 3 |
5.10 Implementation Phases
Phase 1: Foundation (2-3 hours)
- Build line numbering and header/footer
Phase 2: Core Stats (3-4 hours)
- Add sum/min/max/avg and validation
Phase 3: Polish (2-3 hours)
- Add CLI options, tests, and docs
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Invalid data | Skip vs fail | Skip + count | Usable for messy inputs |
| Numbering | NR vs FNR | NR default | Global sequence |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validation logic | regex on numeric fields |
| Integration Tests | CLI output | full input + summary |
| Edge Case Tests | Empty and invalid | no rows, all invalid |
6.2 Critical Test Cases
- Empty input returns count 0.
- Invalid numeric field increments invalid count.
- Mixed integers and decimals compute correct average.
6.3 Test Data
a 10
b 20
c X
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Min/max wrong | Always 0 | Initialize on first valid |
| Invalid data counted | Average too low | Validate before sum |
| Precision inconsistent | Flaky tests | Use fixed printf |
7.2 Debugging Strategies
- Print intermediate stats to stderr for a small sample file.
7.3 Performance Traps
- Avoid storing all values in arrays if only basic stats are needed.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--per-filesummaries.
8.2 Intermediate Extensions
- Add variance and standard deviation.
8.3 Advanced Extensions
- Add percentile calculation with streaming approximation.
9. Real-World Connections
9.1 Industry Applications
- Monitoring pipelines for outliers and anomalies.
9.2 Related Open Source Projects
awkone-liners in DevOps monitoring scripts.
9.3 Interview Relevance
- Understanding streaming algorithms and numerical stability.
10. Resources
10.1 Essential Reading
- The AWK Programming Language, Ch. 1-3
- Effective awk Programming, Ch. 5
10.2 Video Resources
- “AWK for data summaries” tutorials
10.3 Tools & Documentation
- GNU awk manual: numeric/string conversion
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain NR vs FNR
- I can validate numeric fields reliably
11.2 Implementation
- Summary stats are correct
- Output is deterministic
11.3 Growth
- I can defend my validation choices
12. Submission / Completion Criteria
Minimum Viable Completion:
- Line numbering works
- Summary stats correct for clean input
Full Completion:
- Handles invalid data and reports counts
- Deterministic golden demo documented
Excellence (Going Above & Beyond):
- Implements variance and percentiles
13. Additional Content Rules (Hard Requirements)
13.1 Determinism
- Fix precision with
--precisionand show in demos. - Use static fixtures for tests.
13.2 Outcome Completeness
- Success demo and failure demo are included.
- Exit codes specified in §3.7.4.
13.3 Cross-Linking
- Links to P01, P07, P09, P16.
13.4 No Placeholder Text
All sections are complete and concrete.