Project 2: my_cat - The Concatenator
Build a
cat-style utility that copies input to stdout using both stdio and raw syscalls.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time Estimate | A few hours |
| Language | C (Alternatives: Rust, Go) |
| Prerequisites | Project 1 or equivalent C I/O experience |
| Key Topics | Syscalls, buffering, error handling |
1. Learning Objectives
By completing this project, you will:
- Implement the canonical read/write loop for file copying.
- Compare stdio buffering with raw syscalls.
- Handle partial writes and system call errors.
- Design a CLI that supports stdin and multiple files.
2. Theoretical Foundation
2.1 Core Concepts
- File descriptors:
open()returns an integer handle.read()andwrite()operate on these low-level identifiers. - Buffering: The C library buffers reads and writes to reduce syscall overhead. Syscalls are direct and require you to manage the buffer.
- Partial writes:
write()may write fewer bytes than requested, especially to pipes or sockets.
2.2 Why This Matters
Every Unix tool eventually reduces to read/transform/write. If you can implement cat, you can implement compressors, filters, and stream transformers reliably and efficiently.
2.3 Historical Context / Background
cat is short for “concatenate.” It was designed to glue files together and to serve as the simplest possible stream tool in pipelines.
2.4 Common Misconceptions
- “read() returns exactly what I asked for”: It can return fewer bytes. You must loop.
- “stdio and syscalls are the same”: stdio can mask performance issues and error behaviors.
3. Project Specification
3.1 What You Will Build
A command-line tool my_cat that prints file contents to stdout. If no files are provided, it reads from stdin. Include two implementations: stdio-based and syscall-based.
3.2 Functional Requirements
- Print stdin when no files are given.
- Print each file in order when multiple files are given.
- Handle errors for missing files while continuing to process others.
- Support a syscall mode (flag or separate build).
3.3 Non-Functional Requirements
- Performance: Use an efficient buffer size (4 KB or larger).
- Reliability: Handle partial writes and EOF correctly.
- Usability: Output should be byte-for-byte identical to input.
3.4 Example Usage / Output
# Print a file
$ ./my_cat file1.txt
Contents of file1.txt
# Concatenate two files
$ ./my_cat file1.txt file2.txt > combined.txt
# Act as a pass-through for stdin
$ echo "Hello from stdin" | ./my_cat
Hello from stdin
3.5 Real World Outcome
Run ./my_cat file1.txt file2.txt and your terminal prints the contents of both files in order. If you redirect output to another file, the destination file contains an exact byte-for-byte copy. This is the same behavior you get from GNU cat.
4. Solution Architecture
4.1 High-Level Design
input fd -> read buffer -> write buffer -> stdout fd
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Arg parser | Decide stdin vs file list | No flags initially |
| Reader | Read chunks from fd | read() loop |
| Writer | Write chunks to stdout | Handle partial writes |
| Error handler | Report and continue | perror on failures |
4.3 Data Structures
#define BUF_SIZE 4096
static char buf[BUF_SIZE];
4.4 Algorithm Overview
Key Algorithm: Copy loop
- Read up to
BUF_SIZEbytes. - Loop to write all bytes read, handling partial writes.
- Repeat until
read()returns 0.
Complexity Analysis:
- Time: O(n)
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
cc -Wall -Wextra -O2 -o my_cat my_cat.c
5.2 Project Structure
my_cat/
├── src/
│ └── my_cat.c
├── tests/
│ └── test_cat.sh
├── Makefile
└── README.md
5.3 The Core Question You’re Answering
“How do I move bytes from one file descriptor to another without losing or duplicating data?”
This requires a careful read/write loop that handles partial writes and EOF.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- File Descriptors
- What is a file descriptor number?
- How do
open,read,write, andclosework together? - Book Reference: TLPI Ch. 3
- Partial Writes
- When can
write()return fewer bytes? - Why do pipes and terminals behave differently?
- When can
- Error Handling
- How does
errnowork? - What does
perror()print?
- How does
5.5 Questions to Guide Your Design
Before implementing, think through these:
- How will you reuse the same copy loop for stdin and files?
- What buffer size is a good starting point?
- Should a missing file stop the program or just report and continue?
5.6 Thinking Exercise
Trace a Partial Write Scenario
Assume read() returns 1000 bytes, but write() only writes 400. What must your loop do with the remaining 600 bytes?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “Why can
write()return fewer bytes than requested?” - “What is the difference between stdio buffering and
read()?” - “How do you safely copy a file without reading it all into memory?”
5.8 Hints in Layers
Hint 1: Start with stdin only Implement a loop that copies stdin to stdout.
Hint 2: Add file handling Open each file and reuse the same copy function.
Hint 3: Handle partial writes
Use a nested loop to write remaining bytes after each read().
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| File I/O syscalls | “The Linux Programming Interface” | Ch. 3-5 |
| stdio buffering | “The C Programming Language” | Ch. 7 |
5.10 Implementation Phases
Phase 1: Foundation (1-2 hours)
Goals:
- Copy stdin to stdout
Tasks:
- Implement
read()/write()loop on fd 0 and 1.
Checkpoint: echo hi | ./my_cat prints hi.
Phase 2: Core Functionality (1-2 hours)
Goals:
- Add file input
Tasks:
- Loop over filenames in
argv. - Call
open(), copy,close().
Checkpoint: Concatenating multiple files works.
Phase 3: Polish & Edge Cases (1 hour)
Goals:
- Error handling
- Optional stdio implementation
Tasks:
- Print errors to stderr for missing files.
- Implement stdio version for comparison.
Checkpoint: Missing files do not crash the tool.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Buffer size | 1 KB, 4 KB, 64 KB | 4 KB | Good baseline, matches page size |
| I/O API | stdio vs syscalls | syscalls | Teaches low-level behavior |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate copy loop | In-memory pipe tests |
| Integration Tests | Compare with cat |
diff output files |
| Edge Case Tests | Empty file, large file | Zero bytes, 100 MB |
6.2 Critical Test Cases
- Empty file: Output should be empty.
- Binary file: Output should be byte-identical.
- Stdin passthrough: Works with piped input.
6.3 Test Data
Empty file
Random binary (dd if=/dev/urandom)
Text file with long lines
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Ignoring partial writes | Truncated output | Loop until all bytes written |
Not checking read() errors |
Silent failures | Check for -1 and errno |
Forgetting close() |
File descriptor leaks | Close each file descriptor |
7.2 Debugging Strategies
- Use
straceordtrussto see syscalls. - Compare output with
cmpordiff.
7.3 Performance Traps
A buffer size of 1 byte works but is extremely slow. Use at least 4 KB.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add line numbering (
-n). - Add visible tabs (
-T).
8.2 Intermediate Extensions
- Add
-A,-B,-Estyle flags. - Implement a
--squeeze-blankoption.
8.3 Advanced Extensions
- Add a zero-copy mode using
sendfile(). - Implement a multi-threaded copy for very large files.
9. Real-World Connections
9.1 Industry Applications
- Build systems: Concatenate generated artifacts.
- DevOps: Stream logs between processes and tools.
9.2 Related Open Source Projects
- GNU coreutils: Reference implementation of
cat.
9.3 Interview Relevance
The read/write loop is a classic question to test your understanding of syscalls and buffers.
10. Resources
10.1 Essential Reading
- “The Linux Programming Interface” by Michael Kerrisk - Ch. 3-5
- “The C Programming Language” by K&R - Ch. 7
10.2 Video Resources
- POSIX I/O tutorials (any reputable systems programming series)
10.3 Tools & Documentation
man 2 read: Behavior and return valuesman 2 write: Partial write handling
10.4 Related Projects in This Series
- my_wc: State machines over streams.
- my_ls: More complex filesystem interaction.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain when partial writes occur.
- I can describe the difference between stdio and syscalls.
- I understand how EOF is signaled by
read().
11.2 Implementation
- The tool works with stdin and files.
- Output is byte-identical to input.
- Errors are reported correctly.
11.3 Growth
- I can identify one performance improvement.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Reads from stdin and writes to stdout.
- Works for a single file.
Full Completion:
- Supports multiple files with error handling.
- Includes syscall and stdio implementations.
Excellence (Going Above & Beyond):
- Implements
sendfile()or other zero-copy optimization. - Includes automated tests with binary file comparisons.
This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.