Project 2: my_cat - The Concatenator

Build a cat-style utility that copies input to stdout using both stdio and raw syscalls.

Quick Reference

Attribute Value
Difficulty Beginner
Time Estimate A few hours
Language C (Alternatives: Rust, Go)
Prerequisites Project 1 or equivalent C I/O experience
Key Topics Syscalls, buffering, error handling

1. Learning Objectives

By completing this project, you will:

  1. Implement the canonical read/write loop for file copying.
  2. Compare stdio buffering with raw syscalls.
  3. Handle partial writes and system call errors.
  4. Design a CLI that supports stdin and multiple files.

2. Theoretical Foundation

2.1 Core Concepts

  • File descriptors: open() returns an integer handle. read() and write() operate on these low-level identifiers.
  • Buffering: The C library buffers reads and writes to reduce syscall overhead. Syscalls are direct and require you to manage the buffer.
  • Partial writes: write() may write fewer bytes than requested, especially to pipes or sockets.

2.2 Why This Matters

Every Unix tool eventually reduces to read/transform/write. If you can implement cat, you can implement compressors, filters, and stream transformers reliably and efficiently.

2.3 Historical Context / Background

cat is short for “concatenate.” It was designed to glue files together and to serve as the simplest possible stream tool in pipelines.

2.4 Common Misconceptions

  • “read() returns exactly what I asked for”: It can return fewer bytes. You must loop.
  • “stdio and syscalls are the same”: stdio can mask performance issues and error behaviors.

3. Project Specification

3.1 What You Will Build

A command-line tool my_cat that prints file contents to stdout. If no files are provided, it reads from stdin. Include two implementations: stdio-based and syscall-based.

3.2 Functional Requirements

  1. Print stdin when no files are given.
  2. Print each file in order when multiple files are given.
  3. Handle errors for missing files while continuing to process others.
  4. Support a syscall mode (flag or separate build).

3.3 Non-Functional Requirements

  • Performance: Use an efficient buffer size (4 KB or larger).
  • Reliability: Handle partial writes and EOF correctly.
  • Usability: Output should be byte-for-byte identical to input.

3.4 Example Usage / Output

# Print a file
$ ./my_cat file1.txt
Contents of file1.txt

# Concatenate two files
$ ./my_cat file1.txt file2.txt > combined.txt

# Act as a pass-through for stdin
$ echo "Hello from stdin" | ./my_cat
Hello from stdin

3.5 Real World Outcome

Run ./my_cat file1.txt file2.txt and your terminal prints the contents of both files in order. If you redirect output to another file, the destination file contains an exact byte-for-byte copy. This is the same behavior you get from GNU cat.


4. Solution Architecture

4.1 High-Level Design

input fd -> read buffer -> write buffer -> stdout fd

4.2 Key Components

Component Responsibility Key Decisions
Arg parser Decide stdin vs file list No flags initially
Reader Read chunks from fd read() loop
Writer Write chunks to stdout Handle partial writes
Error handler Report and continue perror on failures

4.3 Data Structures

#define BUF_SIZE 4096
static char buf[BUF_SIZE];

4.4 Algorithm Overview

Key Algorithm: Copy loop

  1. Read up to BUF_SIZE bytes.
  2. Loop to write all bytes read, handling partial writes.
  3. Repeat until read() returns 0.

Complexity Analysis:

  • Time: O(n)
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -o my_cat my_cat.c

5.2 Project Structure

my_cat/
├── src/
│   └── my_cat.c
├── tests/
│   └── test_cat.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“How do I move bytes from one file descriptor to another without losing or duplicating data?”

This requires a careful read/write loop that handles partial writes and EOF.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. File Descriptors
    • What is a file descriptor number?
    • How do open, read, write, and close work together?
    • Book Reference: TLPI Ch. 3
  2. Partial Writes
    • When can write() return fewer bytes?
    • Why do pipes and terminals behave differently?
  3. Error Handling
    • How does errno work?
    • What does perror() print?

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. How will you reuse the same copy loop for stdin and files?
  2. What buffer size is a good starting point?
  3. Should a missing file stop the program or just report and continue?

5.6 Thinking Exercise

Trace a Partial Write Scenario

Assume read() returns 1000 bytes, but write() only writes 400. What must your loop do with the remaining 600 bytes?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why can write() return fewer bytes than requested?”
  2. “What is the difference between stdio buffering and read()?”
  3. “How do you safely copy a file without reading it all into memory?”

5.8 Hints in Layers

Hint 1: Start with stdin only Implement a loop that copies stdin to stdout.

Hint 2: Add file handling Open each file and reuse the same copy function.

Hint 3: Handle partial writes Use a nested loop to write remaining bytes after each read().

5.9 Books That Will Help

Topic Book Chapter
File I/O syscalls “The Linux Programming Interface” Ch. 3-5
stdio buffering “The C Programming Language” Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (1-2 hours)

Goals:

  • Copy stdin to stdout

Tasks:

  1. Implement read()/write() loop on fd 0 and 1.

Checkpoint: echo hi | ./my_cat prints hi.

Phase 2: Core Functionality (1-2 hours)

Goals:

  • Add file input

Tasks:

  1. Loop over filenames in argv.
  2. Call open(), copy, close().

Checkpoint: Concatenating multiple files works.

Phase 3: Polish & Edge Cases (1 hour)

Goals:

  • Error handling
  • Optional stdio implementation

Tasks:

  1. Print errors to stderr for missing files.
  2. Implement stdio version for comparison.

Checkpoint: Missing files do not crash the tool.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Buffer size 1 KB, 4 KB, 64 KB 4 KB Good baseline, matches page size
I/O API stdio vs syscalls syscalls Teaches low-level behavior

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate copy loop In-memory pipe tests
Integration Tests Compare with cat diff output files
Edge Case Tests Empty file, large file Zero bytes, 100 MB

6.2 Critical Test Cases

  1. Empty file: Output should be empty.
  2. Binary file: Output should be byte-identical.
  3. Stdin passthrough: Works with piped input.

6.3 Test Data

Empty file
Random binary (dd if=/dev/urandom)
Text file with long lines

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Ignoring partial writes Truncated output Loop until all bytes written
Not checking read() errors Silent failures Check for -1 and errno
Forgetting close() File descriptor leaks Close each file descriptor

7.2 Debugging Strategies

  • Use strace or dtruss to see syscalls.
  • Compare output with cmp or diff.

7.3 Performance Traps

A buffer size of 1 byte works but is extremely slow. Use at least 4 KB.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add line numbering (-n).
  • Add visible tabs (-T).

8.2 Intermediate Extensions

  • Add -A, -B, -E style flags.
  • Implement a --squeeze-blank option.

8.3 Advanced Extensions

  • Add a zero-copy mode using sendfile().
  • Implement a multi-threaded copy for very large files.

9. Real-World Connections

9.1 Industry Applications

  • Build systems: Concatenate generated artifacts.
  • DevOps: Stream logs between processes and tools.
  • GNU coreutils: Reference implementation of cat.

9.3 Interview Relevance

The read/write loop is a classic question to test your understanding of syscalls and buffers.


10. Resources

10.1 Essential Reading

  • “The Linux Programming Interface” by Michael Kerrisk - Ch. 3-5
  • “The C Programming Language” by K&R - Ch. 7

10.2 Video Resources

  • POSIX I/O tutorials (any reputable systems programming series)

10.3 Tools & Documentation

  • man 2 read: Behavior and return values
  • man 2 write: Partial write handling
  • my_wc: State machines over streams.
  • my_ls: More complex filesystem interaction.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain when partial writes occur.
  • I can describe the difference between stdio and syscalls.
  • I understand how EOF is signaled by read().

11.2 Implementation

  • The tool works with stdin and files.
  • Output is byte-identical to input.
  • Errors are reported correctly.

11.3 Growth

  • I can identify one performance improvement.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Reads from stdin and writes to stdout.
  • Works for a single file.

Full Completion:

  • Supports multiple files with error handling.
  • Includes syscall and stdio implementations.

Excellence (Going Above & Beyond):

  • Implements sendfile() or other zero-copy optimization.
  • Includes automated tests with binary file comparisons.

This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.