Project 1: my_wc - The Word Counter
Build a
wc-style tool that counts lines, words, and bytes from stdin or files.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time Estimate | 1 weekend |
| Language | C (Alternatives: Rust, Go, Python) |
| Prerequisites | Basic C, loops, conditionals, file I/O |
| Key Topics | Streams, state machines, stdio buffering |
1. Learning Objectives
By completing this project, you will:
- Implement a streaming text processor that handles stdin and files.
- Design a simple state machine for word boundaries.
- Produce consistent CLI output and exit codes.
- Compare byte counting vs character counting in a real tool.
2. Theoretical Foundation
2.1 Core Concepts
- Streams as infinite byte sequences: Unix tools treat input as a stream. Your program should not assume a fixed size and must handle EOF correctly.
- Word boundaries as state: A word is a run of non-whitespace characters. Counting words is a state machine: you are either inside a word or outside a word.
- Buffering layers:
getchar()andfgetc()are buffered, but the OS only seesread()syscalls. Buffering reduces syscall overhead.
2.2 Why This Matters
Most Unix tools are stream processors. If you can implement wc, you can implement log analyzers, metrics counters, and simple text transforms. This is a foundational mental model for high-performance CLI work.
2.3 Historical Context / Background
wc is one of the earliest Unix utilities. Its simplicity is intentional: one job, done well, composed with pipes. This tool embodies the Unix philosophy of composable filters.
2.4 Common Misconceptions
- “A word is any token”: In
wc, a word is a sequence of non-whitespace characters, not a language token. - “Bytes equal characters”: In UTF-8, one character can be multiple bytes.
wccounts bytes, not characters, unless you implement multibyte logic.
3. Project Specification
3.1 What You Will Build
A command-line tool my_wc that reads from stdin or from one or more files and prints counts for lines, words, and bytes. When multiple files are provided, it prints per-file counts and a total line.
3.2 Functional Requirements
- Count from stdin when no files are provided.
- Count per file when one or more file paths are provided.
- Print totals when multiple files are processed.
- Return non-zero exit code if any file fails to open.
3.3 Non-Functional Requirements
- Performance: Must handle large files without loading entire file into memory.
- Reliability: Properly handle files with no trailing newline.
- Usability: Output format should match
wcspacing and order.
3.4 Example Usage / Output
$ ./my_wc my_wc.c
250 850 5000 my_wc.c
$ echo "hello world from my wc" | ./my_wc
1 5 23
3.5 Real World Outcome
Run ./my_wc README.md and you see a single line with three aligned counts followed by the filename. Pipe data into the tool and it prints the totals without any filename. This is exactly the behavior you would see with the GNU wc command, making the tool a drop-in replacement for simple scripts.
4. Solution Architecture
4.1 High-Level Design
stdin/file -> buffered reader -> state machine -> counters -> formatted output
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Input reader | Read bytes from stdin or file | Use stdio for simplicity |
| State machine | Track word boundaries | Whitespace detection with isspace |
| Counter aggregator | Maintain line/word/byte counts | Use 64-bit counters |
| Formatter | Print aligned columns | Match wc format |
4.3 Data Structures
typedef struct {
unsigned long long lines;
unsigned long long words;
unsigned long long bytes;
int in_word;
} WcState;
4.4 Algorithm Overview
Key Algorithm: Word boundary scan
- Read one byte at a time.
- Increment byte count always.
- If byte is
\n, increment line count. - If byte is whitespace and
in_wordis true, setin_wordto false. - If byte is non-whitespace and
in_wordis false, increment word count and setin_wordto true.
Complexity Analysis:
- Time: O(n)
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
cc -Wall -Wextra -O2 -o my_wc my_wc.c
5.2 Project Structure
my_wc/
├── src/
│ └── my_wc.c
├── tests/
│ └── test_samples.sh
├── Makefile
└── README.md
5.3 The Core Question You’re Answering
“How do I reliably count structure in a stream when I never see the whole file at once?”
The answer is to create a tiny state machine that updates counts as each byte passes through.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Standard Input vs Files
- What does it mean to read from stdin?
- How does
stdindiffer from a file opened withfopen()? - Book Reference: K&R Ch. 7
- Whitespace Classification
- What characters count as whitespace in C?
- Why should you use
isspace()instead of checking only spaces?
- EOF and Return Values
- Why does
getc()returnint? - How do you detect EOF without losing a valid byte?
- Why does
5.5 Questions to Guide Your Design
Before implementing, think through these:
- How will you handle files that cannot be opened?
- Should totals be printed even if one file fails?
- How will you align columns without knowing max width in advance?
5.6 Thinking Exercise
Trace a Small Input by Hand
Input: "Hi there\n"
Bytes: H i _ _ t h e r e \n
Questions while tracing:
- When do you increment the word count?
- What is the final line count?
- How many bytes are in the input?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “How would you count words in a stream without storing the whole input?”
- “Why does
getc()returnintand notchar?” - “What is the difference between a byte and a character?”
5.8 Hints in Layers
Hint 1: Start with lines
Count \n first and print only line counts.
Hint 2: Add the word state
Use an in_word flag to detect transitions.
Hint 3: Handle multiple files
Loop over argv, keep a total state, and print per-file counts.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Streams and stdio | “The C Programming Language” | Ch. 7 |
| CLI arguments | “The C Programming Language” | Ch. 5 |
| File I/O basics | “The Linux Programming Interface” | Ch. 5 |
5.10 Implementation Phases
Phase 1: Foundation (2-3 hours)
Goals:
- Read from stdin until EOF
- Count bytes and lines
Tasks:
- Implement a loop around
getc(). - Increment counters and print results.
Checkpoint: Running with stdin prints accurate line and byte counts.
Phase 2: Core Functionality (3-5 hours)
Goals:
- Count words with state machine
- Add file input
Tasks:
- Add
in_wordlogic. - Add file loop using
fopen()andfclose().
Checkpoint: Results match wc for simple files.
Phase 3: Polish & Edge Cases (2-3 hours)
Goals:
- Multiple file totals
- Error handling
Tasks:
- Track totals across files.
- Print errors to stderr and continue.
Checkpoint: Output matches wc formatting for multi-file input.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Input method | getc vs fread |
getc |
Simpler for state machine |
| Count type | int vs long long |
long long |
Avoid overflow on big files |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate word boundary logic | Small in-memory strings |
| Integration Tests | Compare with system wc |
diff <(wc file) <(my_wc file) |
| Edge Case Tests | Handle empty or whitespace-only files | Empty file, only newlines |
6.2 Critical Test Cases
- Empty file: Should output
0 0 0. - No trailing newline: Line count should reflect actual newlines.
- Multiple spaces and tabs: Words should be counted correctly.
6.3 Test Data
"" (empty)
"one"
"two words"
"line1\nline2\n"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
Using char for getc |
EOF never detected | Store in int |
| Missing last word | Off-by-one word count | Count on transition into word |
Not handling fopen errors |
Crash on missing file | Check for NULL and perror |
7.2 Debugging Strategies
- Print state transitions when a word starts or ends.
- Compare against
wcfor known inputs.
7.3 Performance Traps
Using getc is fine for correctness, but you can experiment with fread for speed after it works.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add flags
-l,-w,-cto select counts. - Print a header row describing columns.
8.2 Intermediate Extensions
- Implement UTF-8 character counting.
- Support reading from a pipe and a list of files in one run.
8.3 Advanced Extensions
- Implement memory-mapped file counting (
mmap). - Make a parallel version that splits large files.
9. Real-World Connections
9.1 Industry Applications
- Log analytics: Counting lines and tokens in log files.
- Data pipelines: Quick sanity checks on large datasets.
9.2 Related Open Source Projects
- GNU coreutils: The canonical implementation of
wc.
9.3 Interview Relevance
Counting in a stream is a common way to test your understanding of state machines and I/O loops.
10. Resources
10.1 Essential Reading
- “The C Programming Language” by Kernighan and Ritchie - Ch. 5, 7
- “The Linux Programming Interface” by Michael Kerrisk - Ch. 5
10.2 Video Resources
- C stdio walkthroughs (any POSIX C tutorial series)
10.3 Tools & Documentation
man 3 getc: Return values and EOF handlingman 3 isspace: Whitespace classification
10.4 Related Projects in This Series
- my_cat: Teaches efficient read/write loops.
- my_grep: Builds on line-based parsing and matching.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how a word boundary is detected.
- I understand why EOF handling uses
int. - I can describe the difference between bytes and characters.
11.2 Implementation
- All functional requirements are met.
- Output matches
wcfor sample files. - Error handling is correct.
11.3 Growth
- I can identify at least one optimization I would attempt next.
- I have documented lessons learned.
- I can explain this project in a job interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Counts lines, words, and bytes from stdin.
- Counts lines, words, and bytes from one file.
- Matches
wcfor simple inputs.
Full Completion:
- Handles multiple files and prints totals.
- Produces formatted output aligned with GNU
wc.
Excellence (Going Above & Beyond):
- Supports flags like
-l,-w,-c. - Includes automated tests comparing against system
wc.
This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.