Project 1: High-Performance File Copy Utility

Build a file copy utility that uses low-level system calls (open, read, write, close) with configurable buffer sizes and performance measurement.

Quick Reference

Attribute Value
Difficulty Level 2 - Intermediate
Time Estimate 10-20 hours
Language C (primary), Rust/Go (alternatives)
Prerequisites C programming, basic UNIX usage, pointers
Key Topics File I/O, System Calls, File Descriptors, Error Handling

1. Learning Objectives

After completing this project, you will:

  • Understand file descriptors as the universal handle for all I/O in UNIX
  • Master the open(), read(), write(), and close() system calls
  • Comprehend the real cost of system calls and user-kernel boundary crossings
  • Know how buffer size affects I/O performance and why
  • Handle partial reads and writes correctly in all situations
  • Implement proper error handling using errno and perror()
  • Use strace to trace and analyze system call behavior
  • Understand the difference between buffered (stdio) and unbuffered (syscall) I/O

2. Theoretical Foundation

2.1 Core Concepts

File Descriptors: The Universal Handle

In UNIX, everything is a file—regular files, directories, devices, network connections, pipes. A file descriptor is a small non-negative integer that serves as a handle to any open “file.”

File Descriptor Table (Per-Process)
┌─────┬────────────────────────────────────────────────────────────┐
│ FD  │  What It Points To                                         │
├─────┼────────────────────────────────────────────────────────────┤
│  0  │  Standard Input  (keyboard, or pipe, or file)              │
│  1  │  Standard Output (terminal, or pipe, or file)              │
│  2  │  Standard Error  (terminal, or file)                       │
│  3  │  /home/user/data.txt (regular file, opened for reading)    │
│  4  │  /dev/null (device file)                                   │
│  5  │  TCP socket to 192.168.1.1:80 (network connection)         │
│  6  │  Pipe read end (IPC with child process)                    │
│  7  │  Unix domain socket (local IPC)                            │
└─────┴────────────────────────────────────────────────────────────┘

       File Descriptor → File Table Entry → Inode/Socket/Pipe
             │                  │                    │
             │                  │                    │
        [Integer]          [Offset,Flags]      [Actual Resource]

When you call read(fd, buf, n), you don’t specify whether fd is a file, socket, or pipe. The kernel dispatches to the correct driver. This abstraction enables composition: cat file.txt | grep pattern works because both sides speak “file descriptor.”

The System Call Interface

System calls are the gateway between user space and kernel space. Each call involves:

User Space                              Kernel Space
┌─────────────────────────┐            ┌─────────────────────────┐
│                         │            │                         │
│  Your C Program         │            │   Kernel                │
│  ┌───────────────────┐  │            │  ┌───────────────────┐  │
│  │  read(fd, buf, n) │──┼───────────▶│  │ sys_read()        │  │
│  └───────────────────┘  │  syscall   │  │ - validate fd     │  │
│                         │  trap      │  │ - check perms     │  │
│                         │            │  │ - copy data       │  │
│                         │◀───────────┼──│ - update offset   │  │
│  Process resumes        │  return    │  └───────────────────┘  │
│                         │            │                         │
└─────────────────────────┘            └─────────────────────────┘

Each system call costs approximately 100-1000 CPU cycles due to:

  • Mode switch (user → kernel → user)
  • TLB and cache effects
  • Argument validation
  • Security checks

Buffer Size and Performance

The relationship between buffer size and I/O performance is crucial:

Buffer Size vs. System Call Overhead

Small buffers (1-64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ read(1 byte) → syscall overhead >>> data transfer time         │
│ 1 million syscalls for 1MB file = VERY SLOW                    │
└────────────────────────────────────────────────────────────────┘

Medium buffers (4KB-64KB):
┌────────────────────────────────────────────────────────────────┐
│ read(4096 bytes) → syscall overhead << data transfer time      │
│ ~250 syscalls for 1MB file = OPTIMAL                           │
└────────────────────────────────────────────────────────────────┘

Large buffers (1MB+):
┌────────────────────────────────────────────────────────────────┐
│ read(1MB) → diminishing returns, memory pressure               │
│ 1 syscall for 1MB file but: cache pollution, malloc overhead   │
└────────────────────────────────────────────────────────────────┘

Performance curve:
                    ▲ Throughput
                    │           ╭────────────────────
                    │          ╱
                    │         ╱
                    │        ╱
                    │       ╱
                    │      ╱
                    │     ╱
                    │____╱
                    └─────────────────────────────────▶ Buffer Size
                        1B    4KB    64KB   1MB

2.2 Why This Matters

This is the “Hello World” of systems programming. Every UNIX application ultimately does I/O through these calls. Understanding file descriptors, system calls, and buffer management is foundational to:

  • Performance optimization: Knowing why buffer size matters helps you tune any I/O-heavy application
  • Debugging: When strace shows 1 million read() calls, you’ll know why the program is slow
  • Library implementation: Understanding how stdio buffers work internally
  • Networking: Sockets use the same read/write interface
  • Tool building: cp, cat, dd, rsync all use these primitives

Industry usage:

  • Database engines optimize I/O with O_DIRECT and aligned buffers
  • Web servers like nginx carefully manage buffer sizes
  • Container runtimes need efficient file copying for layer management

2.3 Historical Context

The file descriptor abstraction dates to the original UNIX (1969) at Bell Labs. Ken Thompson and Dennis Ritchie made a radical decision: unify I/O under a single interface.

Before UNIX, operating systems had different APIs for:

  • Reading files
  • Reading terminals
  • Communicating between programs
  • Accessing devices

UNIX simplified this to: open(), read(), write(), close(). This decision has proven remarkably durable—the same system calls work today, 55+ years later.

The POSIX standard (1988) codified this interface, ensuring portability across UNIX-like systems. Code written for these APIs is remarkably portable across Linux, macOS, FreeBSD, and others.

2.4 Common Misconceptions

Misconception 1: “read() always returns the requested number of bytes”

  • Reality: read() can return fewer bytes than requested (partial read). This is normal and must be handled.

Misconception 2: “write() guarantees all data is written”

  • Reality: write() can also return fewer bytes than requested. You need a loop.

Misconception 3: “Bigger buffers are always better”

  • Reality: Past ~64KB, returns diminish. Very large buffers can hurt cache performance.

Misconception 4: “fread/fwrite are slower than read/write”

  • Reality: stdio buffering often makes fread/fwrite faster for small operations, not slower.

Misconception 5: “close() always succeeds”

  • Reality: close() can fail (especially on NFS). The error should be checked.

3. Project Specification

3.1 What You Will Build

A file copy utility called mycp that:

  • Copies any file from source to destination using low-level system calls
  • Supports configurable buffer sizes via command-line argument
  • Measures and reports copy performance (bytes/second)
  • Handles all error conditions properly
  • Produces byte-for-byte identical copies

3.2 Functional Requirements

  1. Basic copying: mycp source destination copies source to destination
  2. Buffer size option: -b SIZE sets buffer size in bytes (default: 4096)
  3. Size suffixes: Support K, M, G suffixes (e.g., -b 64K)
  4. Preserve permissions: Copy file permissions from source to destination
  5. Force overwrite: -f flag to overwrite without prompting
  6. Verbose mode: -v shows progress during copy
  7. Performance report: Always show bytes copied and throughput at end

3.3 Non-Functional Requirements

  1. Correctness: md5sum/sha256sum of source and destination must match
  2. Error handling: Every system call return must be checked
  3. Memory safety: No buffer overflows, no memory leaks
  4. Resource cleanup: Files always closed, even on error
  5. Signal handling: Ctrl-C should clean up partial destination file
  6. Performance: With optimal buffer size, match or approach system cp speed

3.4 Example Usage / Output

# 1. Create a test file
$ dd if=/dev/urandom of=testfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.542 s, 193 MB/s

# 2. Run your copy utility
$ ./mycp testfile copyfile
mycp: copied 104857600 bytes in 0.089 seconds (1.1 GB/s)

# 3. Verify the copy is identical
$ md5sum testfile copyfile
7f9e5d1a2b3c4d5e6f7a8b9c0d1e2f3a  testfile
7f9e5d1a2b3c4d5e6f7a8b9c0d1e2f3a  copyfile

# 4. TEST: Different buffer sizes
$ ./mycp -b 1 testfile copy1       # 1 byte buffer
mycp: copied 104857600 bytes in 47.3 seconds (2.1 MB/s)  # SLOW!

$ ./mycp -b 4096 testfile copy2    # 4KB buffer
mycp: copied 104857600 bytes in 0.21 seconds (476 MB/s)

$ ./mycp -b 65536 testfile copy3   # 64KB buffer
mycp: copied 104857600 bytes in 0.089 seconds (1.1 GB/s)

$ ./mycp -b 1048576 testfile copy4 # 1MB buffer
mycp: copied 104857600 bytes in 0.087 seconds (1.1 GB/s)  # Diminishing returns

# 5. Trace system calls
$ strace -c ./mycp testfile copyfile
% time     calls  syscall
---------- ------ --------
 50.21      1601  read
 49.12      1601  write
  0.34         3  openat
  0.18         2  close
  0.15         1  fstat

3.5 Real World Outcome

What success looks like:

  1. A working copy utility: Copies any file correctly, byte-for-byte identical
  2. Performance comparison: You can measure time with different buffer sizes and see the dramatic difference
  3. strace output: You can trace exactly which system calls your program makes and understand why
  4. Deep understanding: You can explain to someone why a 1-byte buffer is ~1000x slower than a 64KB buffer

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                         mycp                                     │
│                                                                  │
│  ┌─────────────────┐                                            │
│  │  Parse Args     │ ← -b SIZE, -f, -v, source, dest            │
│  └────────┬────────┘                                            │
│           │                                                      │
│           v                                                      │
│  ┌─────────────────┐                                            │
│  │  Validate       │ ← Check source exists, dest writable       │
│  └────────┬────────┘                                            │
│           │                                                      │
│           v                                                      │
│  ┌─────────────────┐     ┌─────────────────┐                    │
│  │  open(source)   │────▶│   FD: 3         │                    │
│  └────────┬────────┘     └─────────────────┘                    │
│           │                                                      │
│           v                                                      │
│  ┌─────────────────┐     ┌─────────────────┐                    │
│  │  open(dest)     │────▶│   FD: 4         │                    │
│  └────────┬────────┘     └─────────────────┘                    │
│           │                                                      │
│           v                                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                     Copy Loop                            │    │
│  │                                                          │    │
│  │    ┌────────────────────┐                               │    │
│  │    │ read(src_fd, buf)  │◀────────────────────┐         │    │
│  │    └─────────┬──────────┘                     │         │    │
│  │              │                                │         │    │
│  │              v                                │         │    │
│  │    ┌────────────────────┐                     │         │    │
│  │    │ bytes_read > 0?    │──No──▶ EOF, done    │         │    │
│  │    └─────────┬──────────┘                     │         │    │
│  │              │ Yes                            │         │    │
│  │              v                                │         │    │
│  │    ┌────────────────────┐                     │         │    │
│  │    │ write(dst_fd, buf) │                     │         │    │
│  │    └─────────┬──────────┘                     │         │    │
│  │              │                                │         │    │
│  │              v                                │         │    │
│  │    ┌────────────────────┐                     │         │    │
│  │    │ All bytes written? │──No──▶ Keep writing │         │    │
│  │    └─────────┬──────────┘                     │         │    │
│  │              │ Yes                            │         │    │
│  │              └────────────────────────────────┘         │    │
│  └─────────────────────────────────────────────────────────┘    │
│           │                                                      │
│           v                                                      │
│  ┌─────────────────┐                                            │
│  │ close(src_fd)   │                                            │
│  │ close(dst_fd)   │                                            │
│  └────────┬────────┘                                            │
│           │                                                      │
│           v                                                      │
│  ┌─────────────────┐                                            │
│  │ Print stats     │ ← Bytes copied, elapsed time, throughput   │
│  └─────────────────┘                                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component Purpose Key Functions
Argument Parser Handle -b, -f, -v flags and parse size suffixes getopt() or manual parsing
File Handler Open, validate, and manage file descriptors open(), close(), stat()
Copy Engine Main read/write loop with proper partial handling read(), write() in loops
Error Handler Consistent error reporting, cleanup on failure errno, perror(), strerror()
Stats Tracker Time measurement and throughput calculation clock_gettime()

4.3 Data Structures

// Configuration from command line
typedef struct {
    const char *src_path;
    const char *dst_path;
    size_t buffer_size;      // Default: 4096
    int force_overwrite;     // -f flag
    int verbose;             // -v flag
} config_t;

// Copy statistics
typedef struct {
    size_t bytes_copied;
    double elapsed_seconds;
    size_t read_calls;
    size_t write_calls;
} copy_stats_t;

// File context
typedef struct {
    int fd;
    const char *path;
    mode_t mode;             // File permissions
    off_t size;              // File size (for progress)
} file_ctx_t;

4.4 Algorithm Overview

FUNCTION copy_file(source, dest, buffer_size):
    // Phase 1: Setup
    buffer = allocate(buffer_size)
    src_fd = open(source, O_RDONLY)
    dst_fd = open(dest, O_WRONLY | O_CREAT | O_TRUNC, source_permissions)

    start_time = get_monotonic_time()
    total_copied = 0

    // Phase 2: Copy loop
    LOOP:
        bytes_read = read(src_fd, buffer, buffer_size)

        IF bytes_read < 0:
            IF errno == EINTR: CONTINUE  // Interrupted, retry
            ELSE: HANDLE_ERROR

        IF bytes_read == 0:
            BREAK  // EOF reached

        // Write all read bytes (handle partial writes)
        bytes_to_write = bytes_read
        write_offset = 0

        WHILE bytes_to_write > 0:
            bytes_written = write(dst_fd, buffer + write_offset, bytes_to_write)

            IF bytes_written < 0:
                IF errno == EINTR: CONTINUE
                ELSE: HANDLE_ERROR

            bytes_to_write -= bytes_written
            write_offset += bytes_written

        total_copied += bytes_read

    // Phase 3: Cleanup
    close(src_fd)
    close(dst_fd)
    free(buffer)

    elapsed = get_monotonic_time() - start_time
    PRINT statistics(total_copied, elapsed)

5. Implementation Guide

5.1 Development Environment Setup

# Verify C compiler
$ gcc --version
gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0

# Verify make
$ make --version
GNU Make 4.3

# Verify strace (critical for this project)
$ strace --version
strace -- version 6.5

# Create project directory
$ mkdir -p ~/projects/mycp
$ cd ~/projects/mycp

# Create initial file structure
$ touch mycp.c Makefile

# Create test files
$ dd if=/dev/zero of=test_small bs=1K count=10   # 10KB
$ dd if=/dev/urandom of=test_medium bs=1M count=10  # 10MB
$ dd if=/dev/urandom of=test_large bs=1M count=100  # 100MB

5.2 Project Structure

mycp/
├── mycp.c              # Main source file
├── Makefile            # Build configuration
├── test_small          # 10KB test file
├── test_medium         # 10MB test file
├── test_large          # 100MB test file
└── README.md           # Documentation

Recommended Makefile:

CC = gcc
CFLAGS = -Wall -Wextra -Werror -O2 -g
TARGET = mycp

$(TARGET): mycp.c
	$(CC) $(CFLAGS) -o $@ $<

clean:
	rm -f $(TARGET) *.o

test: $(TARGET)
	./$(TARGET) test_small test_small.copy && cmp test_small test_small.copy
	./$(TARGET) test_medium test_medium.copy && cmp test_medium test_medium.copy
	@echo "All tests passed!"

.PHONY: clean test

5.3 The Core Question You’re Answering

“What is the actual cost of a system call, and how does buffer size affect I/O performance?”

Before you write any code, sit with this question. Each read() and write() crosses the user-kernel boundary—there’s a context switch, privilege level change, and cache pollution. But larger buffers mean more memory. The sweet spot depends on the storage device, OS, and workload.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. File Descriptors
    • What number does open() return? Why that number?
    • What happens to file descriptors across fork()?
    • Book Reference: “APUE” Ch. 3.2 - Stevens
  2. The open() System Call
    • What are O_RDONLY, O_WRONLY, O_CREAT, O_TRUNC?
    • What is the mode_t argument and when is it used?
    • Book Reference: “APUE” Ch. 3.3
  3. Partial Reads and Writes
    • Why might read(fd, buf, 4096) return only 1000?
    • What must you do when write() returns less than requested?
    • Book Reference: “APUE” Ch. 3.6
  4. Error Handling with errno
    • What is errno and when is it valid?
    • What does perror() do vs strerror()?

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Buffer Management
    • Where should the buffer be allocated—stack or heap?
    • What if the file is smaller than the buffer?
    • What’s the maximum reasonable buffer size?
  2. Error Paths
    • What if the source file doesn’t exist?
    • What if you don’t have permission to write the destination?
    • What if you run out of disk space mid-copy?
    • What if the disk is removed during copy?
  3. Edge Cases
    • Should you handle copying a file onto itself?
    • What about symbolic links—follow them or copy the link?
    • What about empty files (0 bytes)?
    • What about /dev/null or other special files?

5.6 Thinking Exercise

Trace Through a System Call

Before coding, trace what happens when you call read(fd, buf, 4096):

User Space                    Kernel Space
    │
    │ read(3, buf, 4096)
    │
    └──────────────────────────────────────────────┐
                                                   │
    ┌──────────────────────────────────────────────┘
    │
    │  1. Trap into kernel (syscall instruction)
    │  2. Save user registers
    │  3. Look up fd 3 in process's fd table
    │  4. Find the file object (inode, position)
    │  5. Check if data in page cache
    │  6.   If not: schedule disk I/O, sleep
    │  7.   If yes: copy from page cache to buf
    │  8. Update file position
    │  9. Restore user registers
    │  10. Return to user space
    │
    └──────────────────────────────────────────────┐
                                                   │
    ┌──────────────────────────────────────────────┘
    │
    │ Returns: number of bytes read (or -1 on error)

Questions while tracing:

  • Why is step 6 expensive?
  • What makes step 7 fast?
  • Why does buffer size matter given step 7?

5.7 Hints in Layers

Hint 1: Basic Structure Your main loop reads from source, writes to destination, until read returns 0 (EOF). Start with the simplest version that works.

Hint 2: The Read Loop Pattern You need to handle partial reads. The pattern is: read into buffer, then write everything that was read. Check return values for errors.

Hint 3: Handling Partial Writes

// Pseudocode for robust write
ssize_t write_all(int fd, const void *buf, size_t count) {
    size_t bytes_to_write = count;
    size_t offset = 0;

    while (bytes_to_write > 0) {
        ssize_t written = write(fd, (char*)buf + offset, bytes_to_write);
        if (written < 0) {
            if (errno == EINTR) continue;  // Interrupted, retry
            return -1;  // Real error
        }
        bytes_to_write -= written;
        offset += written;
    }
    return count;
}

Hint 4: Measuring Performance Use clock_gettime(CLOCK_MONOTONIC, &ts) for accurate timing. Print bytes/second at the end.

struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// ... do copy ...
clock_gettime(CLOCK_MONOTONIC, &end);

double elapsed = (end.tv_sec - start.tv_sec) +
                 (end.tv_nsec - start.tv_nsec) / 1e9;
double throughput = bytes_copied / elapsed;

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What’s the difference between read() and fread()?”
    • read() is a system call, unbuffered, returns bytes read
    • fread() is a library function, uses stdio buffering, returns items read
    • read() is lower-level, fread() can be more efficient for small reads
  2. “Why might you prefer open() over fopen()?”
    • More control over flags (O_DIRECT, O_SYNC, O_NOFOLLOW)
    • File descriptor for use with select/poll/epoll
    • Necessary for some operations (flock, ftruncate)
  3. “How would you efficiently copy a 10GB file?”
    • Large buffer (32KB-1MB), mmap, or sendfile()
    • O_DIRECT for bypassing page cache if memory-limited
    • Consider using copy_file_range() on modern Linux
  4. “What happens if read() is interrupted by a signal?”
    • Returns -1 with errno == EINTR
    • Must retry the read (or use SA_RESTART)
    • This is why read loops check for EINTR
  5. “Explain the O_DIRECT flag and when you’d use it.”
    • Bypasses the page cache
    • Requires aligned buffers and sizes
    • Used by databases for their own caching

5.9 Books That Will Help

Topic Book Chapter
File I/O fundamentals “APUE” by Stevens Ch. 3
System call mechanics “Computer Systems: A Programmer’s Perspective” Ch. 8, 10
I/O performance “The Linux Programming Interface” by Kerrisk Ch. 13
Advanced file I/O “The Linux Programming Interface” by Kerrisk Ch. 5

5.10 Implementation Phases

Phase 1: Minimal Working Copy (2-3 hours)

  • Parse source and destination arguments
  • Open both files with basic flags
  • Simple read/write loop (ignoring partial writes)
  • Verify it works with cmp source dest

Phase 2: Robust Error Handling (2-3 hours)

  • Check all system call returns
  • Handle partial reads and writes properly
  • Add errno-based error messages
  • Clean up on failure (close files, delete partial dest)

Phase 3: Command Line Options (2-3 hours)

  • Add -b flag for buffer size
  • Parse size suffixes (K, M, G)
  • Add -f and -v flags
  • Use getopt() for proper parsing

Phase 4: Performance and Polish (2-3 hours)

  • Add timing with clock_gettime()
  • Print throughput statistics
  • Preserve file permissions with fchmod() or open mode
  • Test with various file sizes

Phase 5: Edge Cases and Testing (2-3 hours)

  • Test empty files
  • Test copying file onto itself
  • Test with insufficient permissions
  • Test interrupted by Ctrl-C
  • Run with valgrind for memory leaks

5.11 Key Implementation Decisions

Decision Trade-offs
Buffer on stack vs heap Stack: faster allocation, size limited. Heap: flexible size, must free
Fixed vs dynamic buffer size Fixed: simpler. Dynamic: allows -b option
O_TRUNC vs unlink first O_TRUNC: simpler. Unlink: preserves inode if replacing
Preserve permissions Use fstat() on source, pass mode to open() of dest
Handle EINTR Must retry read/write on EINTR, or copy will fail on signals

6. Testing Strategy

6.1 Unit Tests

Test individual functions:

Test Description Expected Result
parse_size(“4096”) Parse plain number 4096
parse_size(“64K”) Parse with K suffix 65536
parse_size(“1M”) Parse with M suffix 1048576
write_all() partial Mock write returning less Still writes all bytes

6.2 Integration Tests

# Create test files of various sizes
$ dd if=/dev/zero of=tiny bs=1 count=1
$ dd if=/dev/zero of=small bs=1K count=10
$ dd if=/dev/zero of=medium bs=1M count=10
$ dd if=/dev/urandom of=large bs=1M count=100

# Copy each and verify
$ for f in tiny small medium large; do
    ./mycp $f ${f}_copy && \
    cmp $f ${f}_copy && echo "$f: OK" || echo "$f: FAILED"
done

# Test different buffer sizes
$ for bs in 1 512 4096 65536 1048576; do
    echo "Testing buffer size $bs"
    time ./mycp -b $bs large large_copy
    cmp large large_copy
done

6.3 Edge Cases to Test

Case Command Expected Behavior
Empty file touch empty && ./mycp empty empty.copy Creates empty destination
Non-existent source ./mycp nosuchfile dest Error: No such file
No write permission ./mycp source /root/dest Error: Permission denied
Source = dest ./mycp file file Error or warning
Symbolic link ln -s target link && ./mycp link link.copy Copies target content
Binary file ./mycp /bin/ls ls.copy Identical binary copy
File with holes Sparse file Preserves or expands holes

6.4 Verification Commands

# Trace system calls
$ strace -c ./mycp source dest
# Should show reasonable number of read/write calls

# Check for memory leaks
$ valgrind --leak-check=full ./mycp source dest
# Should show "no leaks are possible"

# Verify identical content
$ md5sum source dest
# Both should have same hash

$ cmp source dest
# Should produce no output (files identical)

# Check file permissions preserved
$ ls -l source dest
# Permissions should match (or close to it)

# Performance comparison
$ time ./mycp large large.copy
$ time cp large large.copy2
# Should be comparable

7. Common Pitfalls & Debugging

Problem 1: “Copy works but destination has wrong size”

  • Why: Not handling partial writes, or not writing all bytes read
  • Fix: Use a write loop that handles partial writes
  • Quick test: ls -l source dest and compare sizes

Problem 2: “Program hangs on large files”

  • Why: Possibly using 1-byte buffer
  • Debug: Run with strace -c to count syscalls
  • Fix: Use at least 4KB buffer

Problem 3: “Permission denied on destination”

  • Why: Missing mode argument to open() with O_CREAT
  • Fix: open(dest, O_WRONLY|O_CREAT|O_TRUNC, 0644)

Problem 4: “Segmentation fault”

  • Why: Buffer overflow or NULL pointer dereference
  • Debug: Run with gdb ./mycp and get backtrace
  • Fix: Check buffer allocation and bounds

Problem 5: “Resource temporarily unavailable (EAGAIN)”

  • Why: File opened with O_NONBLOCK
  • Fix: Don’t use O_NONBLOCK for regular files

Problem 6: “File exists after Ctrl-C”

  • Why: Partial file left behind
  • Fix: Install signal handler for SIGINT that cleans up

8. Extensions & Challenges

8.1 Easy Extensions

Extension Description Learning
Progress bar Show percentage complete Working with file sizes
Preserve timestamps Copy mtime/atime utime() or utimensat()
Recursive copy Copy directories Directory traversal
Dry run mode -n flag to show what would be copied Useful for testing

8.2 Advanced Challenges

Challenge Description Learning
sendfile() Use sendfile() for zero-copy Kernel-space data movement
copy_file_range() Use Linux 4.5+ efficient copy Modern syscalls
O_DIRECT Bypass page cache DMA, alignment requirements
mmap-based copy Use memory mapping Virtual memory, page faults
Parallel copy Multiple threads/processes Concurrency
Sparse file handling Preserve holes in sparse files lseek() SEEK_HOLE/SEEK_DATA

8.3 Research Topics

  • How does cp --reflink work on filesystems like btrfs?
  • What is copy-on-write at the filesystem level?
  • How does rsync achieve incremental efficient copies?
  • What is io_uring and how does it improve I/O?

9. Real-World Connections

9.1 Production Systems Using This

System How It Uses File I/O Notable Feature
GNU coreutils cp read/write with smart buffering Sparse file detection
rsync Block-level delta copying Checksum-based transfer
Docker Layer copying and extraction Union filesystem overlay
dd Direct disk copying Block size optimization
tar Archive creation/extraction Streaming I/O

9.2 How the Pros Do It

GNU cp (coreutils):

  • Uses copy_file_range() when available
  • Falls back to sendfile()
  • Falls back to read/write with 128KB buffer
  • Handles sparse files specially

rsync:

  • Checksums file blocks
  • Only transfers changed blocks
  • Uses zero-copy when possible

Database engines:

  • Use O_DIRECT to bypass page cache
  • Manage their own buffer pools
  • Use aligned memory allocation

9.3 Reading the Source

# View GNU coreutils cp source
$ git clone https://github.com/coreutils/coreutils
$ less coreutils/src/copy.c

# Key functions to study:
# - copy_reg() for regular file copy
# - sparse_copy() for sparse file handling
# - copy_internal() for the main copy logic

10. Resources

10.1 Man Pages

$ man 2 open        # open() system call
$ man 2 read        # read() system call
$ man 2 write       # write() system call
$ man 2 close       # close() system call
$ man 2 stat        # stat() for file info
$ man 2 fstat       # fstat() on file descriptor
$ man 3 errno       # Error number reference
$ man 3 perror      # Print error message
$ man 1 strace      # System call tracer

10.2 Online Resources

10.3 Book Chapters

Book Chapters Topics Covered
“APUE” by Stevens Ch. 3, 4, 5 File I/O, File Metadata, Buffered I/O
“TLPI” by Kerrisk Ch. 4, 5, 13 File I/O, Further Details, File I/O Buffering
“CS:APP” by Bryant Ch. 10 System-Level I/O
“Linux System Programming” by Love Ch. 2, 3 File I/O, Buffered I/O

11. Self-Assessment Checklist

Before considering this project complete, verify:

  • I can explain what a file descriptor is and how the kernel manages them
  • I understand why system calls are expensive and how buffering helps
  • My implementation handles partial reads correctly
  • My implementation handles partial writes correctly
  • Every system call return value is checked
  • Files are always closed, even when errors occur
  • I can use strace to trace my program’s system calls
  • I can use valgrind and my program has zero memory errors
  • I understand the trade-offs of different buffer sizes
  • I can explain the difference between read()/write() and fread()/fwrite()
  • My copy produces byte-identical output (verified with md5sum/cmp)
  • I can answer all the interview questions listed above

12. Submission / Completion Criteria

Your project is complete when:

  1. Functionality
    • ./mycp source dest produces byte-identical copy
    • -b SIZE option works with K/M/G suffixes
    • Performance statistics are printed
  2. Quality
    • Compiles with gcc -Wall -Wextra -Werror with no warnings
    • Zero valgrind errors on all test cases
    • All error paths have proper messages
  3. Testing
    • Works with empty files
    • Works with large files (100MB+)
    • Works with binary files
    • Handles permission errors gracefully
  4. Understanding
    • Can explain each system call used
    • Can demonstrate performance difference with buffer sizes
    • Can read and interpret strace output
  5. Documentation
    • README explains how to build and use
    • Code has meaningful comments for non-obvious parts

Next Steps

After completing this project, you’ll be well-prepared for:

  • Project 2: File Information Tool - Learn stat() and file metadata
  • Project 3: Directory Walker - Apply I/O concepts to directory traversal
  • Project 12: Network Echo Server - Apply read/write to sockets

The patterns you’ve learned here—file descriptors, system call handling, error management, and buffering—are foundational to everything that follows.