Project 1: my_wc - The Word Counter

Build a wc-style tool that counts lines, words, and bytes from stdin or files.

Quick Reference

Attribute	Value
Difficulty	Beginner
Time Estimate	1 weekend
Language	C (Alternatives: Rust, Go, Python)
Prerequisites	Basic C, loops, conditionals, file I/O
Key Topics	Streams, state machines, stdio buffering

1. Learning Objectives

By completing this project, you will:

Implement a streaming text processor that handles stdin and files.
Design a simple state machine for word boundaries.
Produce consistent CLI output and exit codes.
Compare byte counting vs character counting in a real tool.

2. Theoretical Foundation

2.1 Core Concepts

Streams as infinite byte sequences: Unix tools treat input as a stream. Your program should not assume a fixed size and must handle EOF correctly.
Word boundaries as state: A word is a run of non-whitespace characters. Counting words is a state machine: you are either inside a word or outside a word.
Buffering layers: getchar() and fgetc() are buffered, but the OS only sees read() syscalls. Buffering reduces syscall overhead.

2.2 Why This Matters

Most Unix tools are stream processors. If you can implement wc, you can implement log analyzers, metrics counters, and simple text transforms. This is a foundational mental model for high-performance CLI work.

2.3 Historical Context / Background

wc is one of the earliest Unix utilities. Its simplicity is intentional: one job, done well, composed with pipes. This tool embodies the Unix philosophy of composable filters.

2.4 Common Misconceptions

“A word is any token”: In wc, a word is a sequence of non-whitespace characters, not a language token.
“Bytes equal characters”: In UTF-8, one character can be multiple bytes. wc counts bytes, not characters, unless you implement multibyte logic.

3. Project Specification

3.1 What You Will Build

A command-line tool my_wc that reads from stdin or from one or more files and prints counts for lines, words, and bytes. When multiple files are provided, it prints per-file counts and a total line.

3.2 Functional Requirements

Count from stdin when no files are provided.
Count per file when one or more file paths are provided.
Print totals when multiple files are processed.
Return non-zero exit code if any file fails to open.

3.3 Non-Functional Requirements

Performance: Must handle large files without loading entire file into memory.
Reliability: Properly handle files with no trailing newline.
Usability: Output format should match wc spacing and order.

3.4 Example Usage / Output

$ ./my_wc my_wc.c
  250  850 5000 my_wc.c

$ echo "hello world from my wc" | ./my_wc
    1   5   23

3.5 Real World Outcome

Run ./my_wc README.md and you see a single line with three aligned counts followed by the filename. Pipe data into the tool and it prints the totals without any filename. This is exactly the behavior you would see with the GNU wc command, making the tool a drop-in replacement for simple scripts.

4. Solution Architecture

4.1 High-Level Design

stdin/file -> buffered reader -> state machine -> counters -> formatted output

4.2 Key Components

Component	Responsibility	Key Decisions
Input reader	Read bytes from stdin or file	Use stdio for simplicity
State machine	Track word boundaries	Whitespace detection with `isspace`
Counter aggregator	Maintain line/word/byte counts	Use 64-bit counters
Formatter	Print aligned columns	Match `wc` format

4.3 Data Structures

typedef struct {
    unsigned long long lines;
    unsigned long long words;
    unsigned long long bytes;
    int in_word;
} WcState;

4.4 Algorithm Overview

Key Algorithm: Word boundary scan

Read one byte at a time.
Increment byte count always.
If byte is \n, increment line count.
If byte is whitespace and in_word is true, set in_word to false.
If byte is non-whitespace and in_word is false, increment word count and set in_word to true.

Complexity Analysis:

Time: O(n)
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -o my_wc my_wc.c

5.2 Project Structure

my_wc/
├── src/
│   └── my_wc.c
├── tests/
│   └── test_samples.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“How do I reliably count structure in a stream when I never see the whole file at once?”

The answer is to create a tiny state machine that updates counts as each byte passes through.

5.4 Concepts You Must Understand First

Stop and research these before coding:

Standard Input vs Files
- What does it mean to read from stdin?
- How does stdin differ from a file opened with fopen()?
- Book Reference: K&R Ch. 7
Whitespace Classification
- What characters count as whitespace in C?
- Why should you use isspace() instead of checking only spaces?
EOF and Return Values
- Why does getc() return int?
- How do you detect EOF without losing a valid byte?

5.5 Questions to Guide Your Design

Before implementing, think through these:

How will you handle files that cannot be opened?
Should totals be printed even if one file fails?
How will you align columns without knowing max width in advance?

5.6 Thinking Exercise

Trace a Small Input by Hand

Input: "Hi  there\n"
Bytes: H i _ _ t h e r e \n

Questions while tracing:

When do you increment the word count?
What is the final line count?
How many bytes are in the input?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

“How would you count words in a stream without storing the whole input?”
“Why does getc() return int and not char?”
“What is the difference between a byte and a character?”

5.8 Hints in Layers

Hint 1: Start with lines Count \n first and print only line counts.

Hint 2: Add the word state Use an in_word flag to detect transitions.

Hint 3: Handle multiple files Loop over argv, keep a total state, and print per-file counts.

5.9 Books That Will Help

Topic	Book	Chapter
Streams and stdio	“The C Programming Language”	Ch. 7
CLI arguments	“The C Programming Language”	Ch. 5
File I/O basics	“The Linux Programming Interface”	Ch. 5

5.10 Implementation Phases

Phase 1: Foundation (2-3 hours)

Goals:

Read from stdin until EOF
Count bytes and lines

Tasks:

Implement a loop around getc().
Increment counters and print results.

Checkpoint: Running with stdin prints accurate line and byte counts.

Phase 2: Core Functionality (3-5 hours)

Goals:

Count words with state machine
Add file input

Tasks:

Add in_word logic.
Add file loop using fopen() and fclose().

Checkpoint: Results match wc for simple files.

Phase 3: Polish & Edge Cases (2-3 hours)

Goals:

Multiple file totals
Error handling

Tasks:

Track totals across files.
Print errors to stderr and continue.

Checkpoint: Output matches wc formatting for multi-file input.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Input method	`getc` vs `fread`	`getc`	Simpler for state machine
Count type	`int` vs `long long`	`long long`	Avoid overflow on big files

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate word boundary logic	Small in-memory strings
Integration Tests	Compare with system `wc`	`diff <(wc file) <(my_wc file)`
Edge Case Tests	Handle empty or whitespace-only files	Empty file, only newlines

6.2 Critical Test Cases

Empty file: Should output 0 0 0.
No trailing newline: Line count should reflect actual newlines.
Multiple spaces and tabs: Words should be counted correctly.

6.3 Test Data

"" (empty)
"one"
"two words"
"line1\nline2\n"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Using `char` for `getc`	EOF never detected	Store in `int`
Missing last word	Off-by-one word count	Count on transition into word
Not handling `fopen` errors	Crash on missing file	Check for NULL and `perror`

7.2 Debugging Strategies

Print state transitions when a word starts or ends.
Compare against wc for known inputs.

7.3 Performance Traps

Using getc is fine for correctness, but you can experiment with fread for speed after it works.

8. Extensions & Challenges

8.1 Beginner Extensions

Add flags -l, -w, -c to select counts.
Print a header row describing columns.

8.2 Intermediate Extensions

Implement UTF-8 character counting.
Support reading from a pipe and a list of files in one run.

8.3 Advanced Extensions

Implement memory-mapped file counting (mmap).
Make a parallel version that splits large files.

9. Real-World Connections

9.1 Industry Applications

Log analytics: Counting lines and tokens in log files.
Data pipelines: Quick sanity checks on large datasets.

GNU coreutils: The canonical implementation of wc.

9.3 Interview Relevance

Counting in a stream is a common way to test your understanding of state machines and I/O loops.

10. Resources

10.1 Essential Reading

“The C Programming Language” by Kernighan and Ritchie - Ch. 5, 7
“The Linux Programming Interface” by Michael Kerrisk - Ch. 5

10.2 Video Resources

C stdio walkthroughs (any POSIX C tutorial series)

10.3 Tools & Documentation

man 3 getc: Return values and EOF handling
man 3 isspace: Whitespace classification

my_cat: Teaches efficient read/write loops.
my_grep: Builds on line-based parsing and matching.

11. Self-Assessment Checklist

11.1 Understanding

I can explain how a word boundary is detected.
I understand why EOF handling uses int.
I can describe the difference between bytes and characters.

11.2 Implementation

All functional requirements are met.
Output matches wc for sample files.
Error handling is correct.

11.3 Growth

I can identify at least one optimization I would attempt next.
I have documented lessons learned.
I can explain this project in a job interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Counts lines, words, and bytes from stdin.
Counts lines, words, and bytes from one file.
Matches wc for simple inputs.

Full Completion:

Handles multiple files and prints totals.
Produces formatted output aligned with GNU wc.

Excellence (Going Above & Beyond):

Supports flags like -l, -w, -c.
Includes automated tests comparing against system wc.

This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.

Project 1: my_wc - The Word Counter

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

Trace a Small Input by Hand

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2-3 hours)

Phase 2: Core Functionality (3-5 hours)

Phase 3: Polish & Edge Cases (2-3 hours)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria