Project 2: my_cat - The Concatenator

Build a cat-style utility that copies input to stdout using both stdio and raw syscalls.

Quick Reference

Attribute	Value
Difficulty	Beginner
Time Estimate	A few hours
Language	C (Alternatives: Rust, Go)
Prerequisites	Project 1 or equivalent C I/O experience
Key Topics	Syscalls, buffering, error handling

1. Learning Objectives

By completing this project, you will:

Implement the canonical read/write loop for file copying.
Compare stdio buffering with raw syscalls.
Handle partial writes and system call errors.
Design a CLI that supports stdin and multiple files.

2. Theoretical Foundation

2.1 Core Concepts

File descriptors: open() returns an integer handle. read() and write() operate on these low-level identifiers.
Buffering: The C library buffers reads and writes to reduce syscall overhead. Syscalls are direct and require you to manage the buffer.
Partial writes: write() may write fewer bytes than requested, especially to pipes or sockets.

2.2 Why This Matters

Every Unix tool eventually reduces to read/transform/write. If you can implement cat, you can implement compressors, filters, and stream transformers reliably and efficiently.

2.3 Historical Context / Background

cat is short for “concatenate.” It was designed to glue files together and to serve as the simplest possible stream tool in pipelines.

2.4 Common Misconceptions

“read() returns exactly what I asked for”: It can return fewer bytes. You must loop.
“stdio and syscalls are the same”: stdio can mask performance issues and error behaviors.

3. Project Specification

3.1 What You Will Build

A command-line tool my_cat that prints file contents to stdout. If no files are provided, it reads from stdin. Include two implementations: stdio-based and syscall-based.

3.2 Functional Requirements

Print stdin when no files are given.
Print each file in order when multiple files are given.
Handle errors for missing files while continuing to process others.
Support a syscall mode (flag or separate build).

3.3 Non-Functional Requirements

Performance: Use an efficient buffer size (4 KB or larger).
Reliability: Handle partial writes and EOF correctly.
Usability: Output should be byte-for-byte identical to input.

3.4 Example Usage / Output

# Print a file
$ ./my_cat file1.txt
Contents of file1.txt

# Concatenate two files
$ ./my_cat file1.txt file2.txt > combined.txt

# Act as a pass-through for stdin
$ echo "Hello from stdin" | ./my_cat
Hello from stdin

3.5 Real World Outcome

Run ./my_cat file1.txt file2.txt and your terminal prints the contents of both files in order. If you redirect output to another file, the destination file contains an exact byte-for-byte copy. This is the same behavior you get from GNU cat.

4. Solution Architecture

4.1 High-Level Design

input fd -> read buffer -> write buffer -> stdout fd

4.2 Key Components

Component	Responsibility	Key Decisions
Arg parser	Decide stdin vs file list	No flags initially
Reader	Read chunks from fd	`read()` loop
Writer	Write chunks to stdout	Handle partial writes
Error handler	Report and continue	`perror` on failures

4.3 Data Structures

#define BUF_SIZE 4096
static char buf[BUF_SIZE];

4.4 Algorithm Overview

Key Algorithm: Copy loop

Read up to BUF_SIZE bytes.
Loop to write all bytes read, handling partial writes.
Repeat until read() returns 0.

Complexity Analysis:

Time: O(n)
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -o my_cat my_cat.c

5.2 Project Structure

my_cat/
├── src/
│   └── my_cat.c
├── tests/
│   └── test_cat.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“How do I move bytes from one file descriptor to another without losing or duplicating data?”

This requires a careful read/write loop that handles partial writes and EOF.

5.4 Concepts You Must Understand First

Stop and research these before coding:

File Descriptors
- What is a file descriptor number?
- How do open, read, write, and close work together?
- Book Reference: TLPI Ch. 3
Partial Writes
- When can write() return fewer bytes?
- Why do pipes and terminals behave differently?
Error Handling
- How does errno work?
- What does perror() print?

5.5 Questions to Guide Your Design

Before implementing, think through these:

How will you reuse the same copy loop for stdin and files?
What buffer size is a good starting point?
Should a missing file stop the program or just report and continue?

5.6 Thinking Exercise

Trace a Partial Write Scenario

Assume read() returns 1000 bytes, but write() only writes 400. What must your loop do with the remaining 600 bytes?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

“Why can write() return fewer bytes than requested?”
“What is the difference between stdio buffering and read()?”
“How do you safely copy a file without reading it all into memory?”

5.8 Hints in Layers

Hint 1: Start with stdin only Implement a loop that copies stdin to stdout.

Hint 2: Add file handling Open each file and reuse the same copy function.

Hint 3: Handle partial writes Use a nested loop to write remaining bytes after each read().

5.9 Books That Will Help

Topic	Book	Chapter
File I/O syscalls	“The Linux Programming Interface”	Ch. 3-5
stdio buffering	“The C Programming Language”	Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (1-2 hours)

Goals:

Copy stdin to stdout

Tasks:

Implement read()/write() loop on fd 0 and 1.

Checkpoint: echo hi | ./my_cat prints hi.

Phase 2: Core Functionality (1-2 hours)

Goals:

Add file input

Tasks:

Loop over filenames in argv.
Call open(), copy, close().

Checkpoint: Concatenating multiple files works.

Phase 3: Polish & Edge Cases (1 hour)

Goals:

Error handling
Optional stdio implementation

Tasks:

Print errors to stderr for missing files.
Implement stdio version for comparison.

Checkpoint: Missing files do not crash the tool.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Buffer size	1 KB, 4 KB, 64 KB	4 KB	Good baseline, matches page size
I/O API	stdio vs syscalls	syscalls	Teaches low-level behavior

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate copy loop	In-memory pipe tests
Integration Tests	Compare with `cat`	`diff` output files
Edge Case Tests	Empty file, large file	Zero bytes, 100 MB

6.2 Critical Test Cases

Empty file: Output should be empty.
Binary file: Output should be byte-identical.
Stdin passthrough: Works with piped input.

6.3 Test Data

Empty file
Random binary (dd if=/dev/urandom)
Text file with long lines

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Ignoring partial writes	Truncated output	Loop until all bytes written
Not checking `read()` errors	Silent failures	Check for -1 and `errno`
Forgetting `close()`	File descriptor leaks	Close each file descriptor

7.2 Debugging Strategies

Use strace or dtruss to see syscalls.
Compare output with cmp or diff.

7.3 Performance Traps

A buffer size of 1 byte works but is extremely slow. Use at least 4 KB.

8. Extensions & Challenges

8.1 Beginner Extensions

Add line numbering (-n).
Add visible tabs (-T).

8.2 Intermediate Extensions

Add -A, -B, -E style flags.
Implement a --squeeze-blank option.

8.3 Advanced Extensions

Add a zero-copy mode using sendfile().
Implement a multi-threaded copy for very large files.

9. Real-World Connections

9.1 Industry Applications

Build systems: Concatenate generated artifacts.
DevOps: Stream logs between processes and tools.

GNU coreutils: Reference implementation of cat.

9.3 Interview Relevance

The read/write loop is a classic question to test your understanding of syscalls and buffers.

10. Resources

10.1 Essential Reading

“The Linux Programming Interface” by Michael Kerrisk - Ch. 3-5
“The C Programming Language” by K&R - Ch. 7

10.2 Video Resources

POSIX I/O tutorials (any reputable systems programming series)

10.3 Tools & Documentation

man 2 read: Behavior and return values
man 2 write: Partial write handling

my_wc: State machines over streams.
my_ls: More complex filesystem interaction.

11. Self-Assessment Checklist

11.1 Understanding

I can explain when partial writes occur.
I can describe the difference between stdio and syscalls.
I understand how EOF is signaled by read().

11.2 Implementation

The tool works with stdin and files.
Output is byte-identical to input.
Errors are reported correctly.

11.3 Growth

I can identify one performance improvement.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Reads from stdin and writes to stdout.
Works for a single file.

Full Completion:

Supports multiple files with error handling.
Includes syscall and stdio implementations.

Excellence (Going Above & Beyond):

Implements sendfile() or other zero-copy optimization.
Includes automated tests with binary file comparisons.

This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.

Project 2: my_cat - The Concatenator

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

Trace a Partial Write Scenario

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (1-2 hours)

Phase 2: Core Functionality (1-2 hours)

Phase 3: Polish & Edge Cases (1 hour)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria