Project 1: my_wc - The Word Counter

Build a wc-style tool that counts lines, words, and bytes from stdin or files.

Quick Reference

Attribute Value
Difficulty Beginner
Time Estimate 1 weekend
Language C (Alternatives: Rust, Go, Python)
Prerequisites Basic C, loops, conditionals, file I/O
Key Topics Streams, state machines, stdio buffering

1. Learning Objectives

By completing this project, you will:

  1. Implement a streaming text processor that handles stdin and files.
  2. Design a simple state machine for word boundaries.
  3. Produce consistent CLI output and exit codes.
  4. Compare byte counting vs character counting in a real tool.

2. Theoretical Foundation

2.1 Core Concepts

  • Streams as infinite byte sequences: Unix tools treat input as a stream. Your program should not assume a fixed size and must handle EOF correctly.
  • Word boundaries as state: A word is a run of non-whitespace characters. Counting words is a state machine: you are either inside a word or outside a word.
  • Buffering layers: getchar() and fgetc() are buffered, but the OS only sees read() syscalls. Buffering reduces syscall overhead.

2.2 Why This Matters

Most Unix tools are stream processors. If you can implement wc, you can implement log analyzers, metrics counters, and simple text transforms. This is a foundational mental model for high-performance CLI work.

2.3 Historical Context / Background

wc is one of the earliest Unix utilities. Its simplicity is intentional: one job, done well, composed with pipes. This tool embodies the Unix philosophy of composable filters.

2.4 Common Misconceptions

  • “A word is any token”: In wc, a word is a sequence of non-whitespace characters, not a language token.
  • “Bytes equal characters”: In UTF-8, one character can be multiple bytes. wc counts bytes, not characters, unless you implement multibyte logic.

3. Project Specification

3.1 What You Will Build

A command-line tool my_wc that reads from stdin or from one or more files and prints counts for lines, words, and bytes. When multiple files are provided, it prints per-file counts and a total line.

3.2 Functional Requirements

  1. Count from stdin when no files are provided.
  2. Count per file when one or more file paths are provided.
  3. Print totals when multiple files are processed.
  4. Return non-zero exit code if any file fails to open.

3.3 Non-Functional Requirements

  • Performance: Must handle large files without loading entire file into memory.
  • Reliability: Properly handle files with no trailing newline.
  • Usability: Output format should match wc spacing and order.

3.4 Example Usage / Output

$ ./my_wc my_wc.c
  250  850 5000 my_wc.c

$ echo "hello world from my wc" | ./my_wc
    1   5   23

3.5 Real World Outcome

Run ./my_wc README.md and you see a single line with three aligned counts followed by the filename. Pipe data into the tool and it prints the totals without any filename. This is exactly the behavior you would see with the GNU wc command, making the tool a drop-in replacement for simple scripts.


4. Solution Architecture

4.1 High-Level Design

stdin/file -> buffered reader -> state machine -> counters -> formatted output

4.2 Key Components

Component Responsibility Key Decisions
Input reader Read bytes from stdin or file Use stdio for simplicity
State machine Track word boundaries Whitespace detection with isspace
Counter aggregator Maintain line/word/byte counts Use 64-bit counters
Formatter Print aligned columns Match wc format

4.3 Data Structures

typedef struct {
    unsigned long long lines;
    unsigned long long words;
    unsigned long long bytes;
    int in_word;
} WcState;

4.4 Algorithm Overview

Key Algorithm: Word boundary scan

  1. Read one byte at a time.
  2. Increment byte count always.
  3. If byte is \n, increment line count.
  4. If byte is whitespace and in_word is true, set in_word to false.
  5. If byte is non-whitespace and in_word is false, increment word count and set in_word to true.

Complexity Analysis:

  • Time: O(n)
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -o my_wc my_wc.c

5.2 Project Structure

my_wc/
├── src/
│   └── my_wc.c
├── tests/
│   └── test_samples.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“How do I reliably count structure in a stream when I never see the whole file at once?”

The answer is to create a tiny state machine that updates counts as each byte passes through.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Standard Input vs Files
    • What does it mean to read from stdin?
    • How does stdin differ from a file opened with fopen()?
    • Book Reference: K&R Ch. 7
  2. Whitespace Classification
    • What characters count as whitespace in C?
    • Why should you use isspace() instead of checking only spaces?
  3. EOF and Return Values
    • Why does getc() return int?
    • How do you detect EOF without losing a valid byte?

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. How will you handle files that cannot be opened?
  2. Should totals be printed even if one file fails?
  3. How will you align columns without knowing max width in advance?

5.6 Thinking Exercise

Trace a Small Input by Hand

Input: "Hi  there\n"
Bytes: H i _ _ t h e r e \n

Questions while tracing:

  • When do you increment the word count?
  • What is the final line count?
  • How many bytes are in the input?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How would you count words in a stream without storing the whole input?”
  2. “Why does getc() return int and not char?”
  3. “What is the difference between a byte and a character?”

5.8 Hints in Layers

Hint 1: Start with lines Count \n first and print only line counts.

Hint 2: Add the word state Use an in_word flag to detect transitions.

Hint 3: Handle multiple files Loop over argv, keep a total state, and print per-file counts.

5.9 Books That Will Help

Topic Book Chapter
Streams and stdio “The C Programming Language” Ch. 7
CLI arguments “The C Programming Language” Ch. 5
File I/O basics “The Linux Programming Interface” Ch. 5

5.10 Implementation Phases

Phase 1: Foundation (2-3 hours)

Goals:

  • Read from stdin until EOF
  • Count bytes and lines

Tasks:

  1. Implement a loop around getc().
  2. Increment counters and print results.

Checkpoint: Running with stdin prints accurate line and byte counts.

Phase 2: Core Functionality (3-5 hours)

Goals:

  • Count words with state machine
  • Add file input

Tasks:

  1. Add in_word logic.
  2. Add file loop using fopen() and fclose().

Checkpoint: Results match wc for simple files.

Phase 3: Polish & Edge Cases (2-3 hours)

Goals:

  • Multiple file totals
  • Error handling

Tasks:

  1. Track totals across files.
  2. Print errors to stderr and continue.

Checkpoint: Output matches wc formatting for multi-file input.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Input method getc vs fread getc Simpler for state machine
Count type int vs long long long long Avoid overflow on big files

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate word boundary logic Small in-memory strings
Integration Tests Compare with system wc diff <(wc file) <(my_wc file)
Edge Case Tests Handle empty or whitespace-only files Empty file, only newlines

6.2 Critical Test Cases

  1. Empty file: Should output 0 0 0.
  2. No trailing newline: Line count should reflect actual newlines.
  3. Multiple spaces and tabs: Words should be counted correctly.

6.3 Test Data

"" (empty)
"one"
"two words"
"line1\nline2\n"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Using char for getc EOF never detected Store in int
Missing last word Off-by-one word count Count on transition into word
Not handling fopen errors Crash on missing file Check for NULL and perror

7.2 Debugging Strategies

  • Print state transitions when a word starts or ends.
  • Compare against wc for known inputs.

7.3 Performance Traps

Using getc is fine for correctness, but you can experiment with fread for speed after it works.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add flags -l, -w, -c to select counts.
  • Print a header row describing columns.

8.2 Intermediate Extensions

  • Implement UTF-8 character counting.
  • Support reading from a pipe and a list of files in one run.

8.3 Advanced Extensions

  • Implement memory-mapped file counting (mmap).
  • Make a parallel version that splits large files.

9. Real-World Connections

9.1 Industry Applications

  • Log analytics: Counting lines and tokens in log files.
  • Data pipelines: Quick sanity checks on large datasets.
  • GNU coreutils: The canonical implementation of wc.

9.3 Interview Relevance

Counting in a stream is a common way to test your understanding of state machines and I/O loops.


10. Resources

10.1 Essential Reading

  • “The C Programming Language” by Kernighan and Ritchie - Ch. 5, 7
  • “The Linux Programming Interface” by Michael Kerrisk - Ch. 5

10.2 Video Resources

  • C stdio walkthroughs (any POSIX C tutorial series)

10.3 Tools & Documentation

  • man 3 getc: Return values and EOF handling
  • man 3 isspace: Whitespace classification
  • my_cat: Teaches efficient read/write loops.
  • my_grep: Builds on line-based parsing and matching.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how a word boundary is detected.
  • I understand why EOF handling uses int.
  • I can describe the difference between bytes and characters.

11.2 Implementation

  • All functional requirements are met.
  • Output matches wc for sample files.
  • Error handling is correct.

11.3 Growth

  • I can identify at least one optimization I would attempt next.
  • I have documented lessons learned.
  • I can explain this project in a job interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Counts lines, words, and bytes from stdin.
  • Counts lines, words, and bytes from one file.
  • Matches wc for simple inputs.

Full Completion:

  • Handles multiple files and prints totals.
  • Produces formatted output aligned with GNU wc.

Excellence (Going Above & Beyond):

  • Supports flags like -l, -w, -c.
  • Includes automated tests comparing against system wc.

This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.