Project 3: Recursive Directory Walker

Build a recursive directory traversal tool that walks entire directory trees, reporting files, computing sizes, and optionally filtering by criteria (name pattern, size, type).

Quick Reference

Attribute Value
Difficulty Level 3 - Advanced
Time Estimate 20-40 hours
Language C (primary), Rust/Go (alternatives)
Prerequisites Projects 1-2, opendir/readdir, recursion
Key Topics Directory Operations, Recursion, Path Construction, Cycle Detection

1. Learning Objectives

After completing this project, you will:

  • Master opendir(), readdir(), and closedir() for directory operations
  • Understand how directories are structured as files containing name→inode mappings
  • Handle path construction safely without buffer overflows
  • Detect and handle symbolic link cycles to prevent infinite loops
  • Implement depth-first vs breadth-first traversal strategies
  • Use nftw() for standardized directory walking (and understand when to roll your own)
  • Implement file filtering with fnmatch() for glob patterns
  • Calculate directory sizes like du while handling hard link deduplication
  • Handle permissions errors and disappearing files gracefully (TOCTOU issues)

2. Theoretical Foundation

2.1 Core Concepts

Directories as Files

In UNIX, a directory is a special file containing a list of (name, inode) pairs:

Directory File Content (Conceptual View)
┌────────────────────────────────────────────────────────────────┐
│                    /home/user/project                          │
│                                                                │
│  ┌─────────────────────┬──────────────────────────────────┐   │
│  │ Entry Name          │ Inode Number                      │   │
│  ├─────────────────────┼──────────────────────────────────┤   │
│  │ .                   │ 12345 (this directory)           │   │
│  │ ..                  │ 12340 (parent directory)          │   │
│  │ README.md           │ 12346                             │   │
│  │ src                 │ 12350 (subdirectory)              │   │
│  │ Makefile            │ 12347                             │   │
│  │ build               │ 12360 (subdirectory)              │   │
│  └─────────────────────┴──────────────────────────────────┘   │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Key Points:
- "." always points to self
- ".." always points to parent
- Entries are NOT sorted (order depends on creation/deletion history)
- Names are just labels; the inode holds all metadata

Directory Traversal Flow

readdir() Loop Pattern
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  dir = opendir("/path/to/dir")                                  │
│       │                                                         │
│       ▼                                                         │
│  ┌────────────────────┐                                         │
│  │ readdir(dir) ◄─────┼─────────────────────────────────────┐   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ entry == NULL?     │──Yes──▶ closedir(dir), DONE         │   │
│  └─────────┬──────────┘                                     │   │
│            │ No                                             │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ Skip "." and ".."  │                                     │   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ Build full path:   │                                     │   │
│  │ dir + "/" + name   │                                     │   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ lstat(full_path)   │                                     │   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐         ┌───────────────────────┐   │   │
│  │ Is directory?      │──Yes───▶│ Recurse into subdir   │───┘   │
│  └─────────┬──────────┘         └───────────────────────┘       │
│            │ No                                                 │
│            ▼                                                    │
│  ┌────────────────────┐                                         │
│  │ Process file       │──────────────────────────────────────┶   │
│  │ (print, count, etc)│                                         │
│  └────────────────────┘                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Symlinks can create infinite loops that naive traversal will never escape:

Filesystem with Cycle:
/home/user/
├── project/
│   ├── src/
│   │   └── link -> ../..   ← Points back to /home/user!
│   └── data/
└── docs/

Traversal without cycle detection:
1. Enter /home/user/project
2. Enter /home/user/project/src
3. Follow link -> /home/user
4. Enter /home/user/project (AGAIN!)
5. Enter /home/user/project/src
6. Follow link -> /home/user
7. ... INFINITE LOOP ...

Detection Strategy:
Track (device, inode) pairs of directories visited.
Before entering a directory, check if (dev, ino) already seen.

┌─────────────────────────────────────────────────────────┐
│              Visited Set (Hash Table)                    │
│                                                          │
│  Key: (dev, ino)     Seen?                              │
│  ─────────────────────────────                          │
│  (8, 12345)          YES   ← /home/user                 │
│  (8, 12350)          YES   ← /home/user/project         │
│  (8, 12351)          YES   ← /home/user/project/src     │
│                                                          │
│  When link resolves to (8, 12345), we skip it!          │
└─────────────────────────────────────────────────────────┘

Path Construction Safety

Building paths by concatenation is error-prone:

WRONG: Potential buffer overflow
char path[100];
strcpy(path, dir);      // What if dir is 99 chars?
strcat(path, "/");      // Now 100 chars
strcat(path, entry);    // OVERFLOW if entry has any length!

RIGHT: Safe with bounds checking
char path[PATH_MAX];
int len = snprintf(path, sizeof(path), "%s/%s", dir, entry->d_name);
if (len >= sizeof(path)) {
    // Path too long! Handle error
    fprintf(stderr, "Path too long: %s/%s\n", dir, entry->d_name);
    continue;  // Skip this entry
}

RIGHT: Using realpath() for canonical paths
char resolved[PATH_MAX];
if (realpath(path, resolved) == NULL) {
    // Path doesn't exist or other error
    perror(path);
}

nftw() vs Manual Recursion

The nftw() function provides standardized directory walking:

// nftw callback signature
int callback(const char *fpath, const struct stat *sb,
             int typeflag, struct FTW *ftwbuf) {
    // fpath: full path to file
    // sb: stat structure (already called for you!)
    // typeflag: FTW_F (file), FTW_D (directory), FTW_SL (symlink), etc.
    // ftwbuf->level: depth in tree (0 = starting directory)
    // ftwbuf->base: index in fpath where basename starts

    printf("Level %d: %s\n", ftwbuf->level, fpath);
    return 0;  // 0 to continue, non-zero to stop
}

// Usage
nftw("/path", callback, 20, FTW_PHYS);
// 20 = max file descriptors to use
// FTW_PHYS = don't follow symlinks (like lstat)

When to use nftw():

  • Simple traversal with standard behavior
  • Portability is important
  • Don’t need fine-grained control

When to roll your own:

  • Need custom cycle detection
  • Need to modify traversal order
  • Need parallel traversal
  • Building interactive tool (need to pause/resume)

2.2 Why This Matters

Directory traversal is fundamental to countless tools:

  • Backup software: Walk directories to find files to backup
  • Search tools: find, locate, grep -r
  • Build systems: make, cmake (find source files)
  • Sync tools: rsync, dropbox
  • Disk usage: du, ncdu
  • Security scanning: Find setuid files, check permissions
  • Code editors: File tree views, project search

Real-world scale:

  • A typical Linux system has 200,000+ files
  • Large source trees (Linux kernel) have 70,000+ files
  • Backup software must handle millions of files efficiently

2.3 Historical Context

The opendir/readdir API dates to the original UNIX but was standardized in POSIX.1. The key innovation was treating directories as files that you open and read entries from, rather than a special API.

The nftw() function (new file tree walk) replaced older ftw() to handle symbolic links properly. The FTW_PHYS flag was added specifically because symlink handling was a major source of bugs and security issues.

2.4 Common Misconceptions

Misconception 1: “readdir() returns entries in sorted order”

  • Reality: Order depends on filesystem and creation/deletion history. Sort yourself if needed.

Misconception 2: “d_type in dirent is always valid”

  • Reality: Some filesystems don’t support d_type (returns DT_UNKNOWN). Always stat() if you need the type.

Misconception 3: “Symlinks to directories should always be followed”

  • Reality: Following symlinks can cause loops and security issues. Default to not following.

Misconception 4: “PATH_MAX is the absolute maximum path length”

  • Reality: Some filesystems and operations can exceed PATH_MAX. Handle long paths gracefully.

Misconception 5: “File won’t disappear between readdir and stat”

  • Reality: This is a TOCTOU (time-of-check-time-of-use) race. Handle ENOENT from stat().

3. Project Specification

3.1 What You Will Build

A directory traversal tool called myfind that combines features of tree, du, and find:

  • Tree mode: Display directory structure visually
  • Disk usage mode: Calculate sizes like du
  • Search mode: Find files matching patterns
  • All modes with proper cycle detection and error handling

3.2 Functional Requirements

  1. Tree display: myfind --tree DIR shows directory structure
  2. Disk usage: myfind --du DIR shows sizes like du
  3. Name search: myfind --name "PATTERN" DIR finds matching files
  4. Size filter: myfind --size +1M DIR finds files larger than 1MB
  5. Type filter: myfind --type f|d|l DIR filters by file type
  6. Symlink handling: Default don’t follow; -L to follow
  7. Max depth: --maxdepth N limits recursion depth
  8. Execute action: --exec CMD {} runs command on matches

3.3 Non-Functional Requirements

  1. Cycle safety: Must detect and skip symbolic link cycles
  2. Memory efficiency: Handle directories with millions of files
  3. Error resilience: Continue on permission denied, missing files
  4. Performance: Comparable to standard tools on same workload
  5. Correct sizes: Handle hard links correctly for du mode

3.4 Example Usage / Output

# 1. Basic tree display
$ ./myfind --tree /usr/include
/usr/include
├── assert.h
├── complex.h
├── errno.h
├── linux/
│   ├── capability.h
│   ├── fs.h
│   └── types.h
├── sys/
│   ├── socket.h
│   ├── stat.h
│   └── types.h
└── unistd.h
4 directories, 10 files

# 2. Disk usage mode
$ ./myfind --du /var/log
12K     /var/log/apt
4K      /var/log/apt/history.log
8K      /var/log/apt/term.log
256K    /var/log/journal
1.2M    /var/log
Total: 1.2M in 23 files, 5 directories

# 3. Find files matching pattern
$ ./myfind --name "*.h" /usr/include | head -5
/usr/include/assert.h
/usr/include/complex.h
/usr/include/errno.h
/usr/include/features.h
/usr/include/float.h

# 4. Find files larger than 1MB
$ ./myfind --size +1M /usr
/usr/lib/libc.so.6
/usr/lib/libLLVM-14.so.1
/usr/bin/clang
...

# 5. Handle symbolic link cycle gracefully
$ ln -s . loop
$ ./myfind --tree .
.
├── file.txt
└── loop -> . (symlink, skipping to avoid cycle)

# 6. Find with type filter
$ ./myfind --type d /etc | head -5
/etc
/etc/apt
/etc/apt/apt.conf.d
/etc/apt/preferences.d
/etc/alternatives

# 7. Execute command on matches
$ ./myfind --name "*.log" --exec "wc -l {}" /var/log
42 /var/log/syslog
127 /var/log/auth.log
...

3.5 Real World Outcome

What success looks like:

  1. Tree display: Properly indented, shows structure clearly
  2. Accurate sizes: Matches du output (accounting for hard links)
  3. Pattern matching: Works with *, ?, [abc] patterns
  4. Cycle safety: Never hangs on symlink loops
  5. Error handling: Continues gracefully on permission denied
  6. Performance: Handles /usr (100k+ files) in seconds

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                          myfind                                  │
│                                                                  │
│  ┌─────────────────┐                                            │
│  │  Parse Args     │ ← --tree, --du, --name, --size, etc.       │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │  Initialize     │ ← Create visited set for cycle detection   │
│  │  State          │ ← Set up filters, output mode              │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Recursive Walker                            │    │
│  │                                                          │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ Check depth     │ ← Respect --maxdepth               │    │
│  │  └────────┬────────┘                                    │    │
│  │           │                                              │    │
│  │           ▼                                              │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ Check cycle     │ ← Is (dev,ino) in visited set?     │    │
│  │  └────────┬────────┘                                    │    │
│  │           │ Not a cycle                                  │    │
│  │           ▼                                              │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ opendir()       │                                    │    │
│  │  └────────┬────────┘                                    │    │
│  │           │                                              │    │
│  │           ▼                                              │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │            For each entry (readdir)             │    │    │
│  │  │                                                  │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Skip . and ..   │                            │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Build path      │                            │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ lstat()         │                            │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Apply filters   │ ← name, size, type         │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │ Matches                              │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Output/Action   │ ← print, accumulate size   │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐         ┌────────────────┐ │    │    │
│  │  │  │ Is directory?   │──Yes───▶│ Recurse        │ │    │    │
│  │  │  └─────────────────┘         └────────────────┘ │    │    │
│  │  │                                                  │    │    │
│  │  └──────────────────────────────────────────────────┘    │    │
│  │                                                          │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ closedir()      │                                    │    │
│  │  └─────────────────┘                                    │    │
│  │                                                          │    │
│  └──────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────┐                                            │
│  │  Print Summary  │ ← Total files, dirs, size                  │
│  └─────────────────┘                                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component Purpose Key Functions
Argument Parser Parse command line options getopt_long()
Cycle Detector Track visited (dev,ino) pairs Hash set operations
Path Builder Safely construct paths snprintf(), realpath()
Walker Recursive directory traversal opendir(), readdir(), closedir()
Filter Engine Apply name/size/type filters fnmatch(), stat comparisons
Output Formatter Tree, du, or plain output printf with formatting
Stats Tracker Count files, dirs, accumulate sizes Running totals

4.3 Data Structures

// Configuration from command line
typedef struct {
    const char *start_path;

    // Mode (mutually exclusive)
    int tree_mode;           // --tree
    int du_mode;             // --du
    int find_mode;           // (default or with filters)

    // Filters
    const char *name_pattern;  // --name "*.c"
    off_t min_size;            // --size +1M (positive = minimum)
    off_t max_size;            // --size -1M (negative = maximum)
    int file_type;             // --type f/d/l

    // Options
    int follow_symlinks;       // -L
    int max_depth;             // --maxdepth N, -1 = unlimited
    const char *exec_cmd;      // --exec CMD
    int verbose;               // -v
} config_t;

// Traversal state
typedef struct {
    // Cycle detection
    struct {
        dev_t dev;
        ino_t ino;
    } *visited;
    size_t visited_count;
    size_t visited_capacity;

    // Statistics
    size_t file_count;
    size_t dir_count;
    size_t symlink_count;
    off_t total_size;

    // For du mode: track seen inodes (hard link dedup)
    struct {
        dev_t dev;
        ino_t ino;
    } *seen_inodes;
    size_t seen_count;
    size_t seen_capacity;

    // Current depth
    int current_depth;
} walker_state_t;

// Directory entry info
typedef struct {
    char path[PATH_MAX];
    struct stat st;
    int stat_failed;       // 1 if stat() failed
    int is_cycle;          // 1 if this is a cycle
} entry_info_t;

4.4 Algorithm Overview

FUNCTION walk_directory(path, depth, state, config):
    // Check depth limit
    IF depth > config.max_depth AND config.max_depth >= 0:
        RETURN

    // lstat the directory itself
    IF lstat(path, &st) == -1:
        PRINT error
        RETURN

    // Check for cycle (for directories only)
    IF is_directory(st.st_mode):
        IF (st.st_dev, st.st_ino) IN state.visited:
            IF config.verbose:
                PRINT "Skipping cycle: " + path
            RETURN
        ADD (st.st_dev, st.st_ino) TO state.visited

    // Open directory
    dir = opendir(path)
    IF dir == NULL:
        PRINT error
        RETURN

    // Read entries
    WHILE (entry = readdir(dir)) != NULL:
        IF entry.d_name == "." OR entry.d_name == "..":
            CONTINUE

        // Build full path
        full_path = path + "/" + entry.d_name
        IF path_too_long:
            PRINT warning
            CONTINUE

        // Get entry info
        IF lstat(full_path, &entry_st) == -1:
            IF errno == ENOENT:
                // File disappeared (TOCTOU race), skip silently
                CONTINUE
            PRINT error
            CONTINUE

        // Apply filters
        IF NOT matches_filters(full_path, entry_st, config):
            // Still recurse into directories even if they don't match
            IF is_directory(entry_st.st_mode):
                walk_directory(full_path, depth + 1, state, config)
            CONTINUE

        // Process matching entry
        process_entry(full_path, entry_st, depth, state, config)

        // Recurse into directories
        IF is_directory(entry_st.st_mode):
            IF NOT is_symlink(entry_st.st_mode) OR config.follow_symlinks:
                walk_directory(full_path, depth + 1, state, config)

    closedir(dir)


FUNCTION matches_filters(path, st, config):
    // Name filter
    IF config.name_pattern != NULL:
        IF NOT fnmatch(config.name_pattern, basename(path)):
            RETURN FALSE

    // Size filter
    IF config.min_size > 0 AND st.st_size < config.min_size:
        RETURN FALSE
    IF config.max_size > 0 AND st.st_size > config.max_size:
        RETURN FALSE

    // Type filter
    IF config.file_type != 0:
        IF config.file_type == 'f' AND NOT is_regular(st.st_mode):
            RETURN FALSE
        IF config.file_type == 'd' AND NOT is_directory(st.st_mode):
            RETURN FALSE
        IF config.file_type == 'l' AND NOT is_symlink(st.st_mode):
            RETURN FALSE

    RETURN TRUE

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
$ mkdir -p ~/projects/myfind
$ cd ~/projects/myfind

# Create test directory structure
$ mkdir -p testdir/subdir1/subsubdir
$ mkdir -p testdir/subdir2
$ touch testdir/file1.txt testdir/file2.c
$ touch testdir/subdir1/file3.h
$ touch testdir/subdir1/subsubdir/file4.cpp
$ touch testdir/subdir2/file5.py

# Create symlink cycle for testing
$ ln -s .. testdir/subdir1/parent_link

# Create various file types
$ mkfifo testdir/mypipe
$ ln -s file1.txt testdir/link_to_file1

# Verify structure
$ find testdir -type f -o -type d -o -type l | sort

5.2 Project Structure

myfind/
├── myfind.c              # Main source file
├── Makefile              # Build configuration
├── testdir/              # Test directory structure
│   ├── file1.txt
│   ├── file2.c
│   ├── link_to_file1 -> file1.txt
│   ├── mypipe            # FIFO
│   ├── subdir1/
│   │   ├── file3.h
│   │   ├── parent_link -> ..  # Cycle!
│   │   └── subsubdir/
│   │       └── file4.cpp
│   └── subdir2/
│       └── file5.py
└── README.md

5.3 The Core Question You’re Answering

“How do you traverse an entire filesystem efficiently while handling all the edge cases (permissions, symlinks, mount points)?”

This question underlies every backup tool, search utility, and file synchronizer ever written. The naive approach (recursion with string concatenation) fails on deep trees and cyclic links.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Directory as a File
    • A directory is a file containing name→inode mappings
    • What does opendir/readdir/closedir return?
    • Book Reference: “APUE” Ch. 4.21-4.22
  2. Path Construction
    • How to join directory path with filename?
    • Maximum path length (PATH_MAX)
    • Book Reference: “APUE” Ch. 4.14
  3. Symbolic Link Handling
    • When to follow vs. when to skip?
    • How to detect cycles?
    • lstat() vs stat() for link detection
  4. nftw() vs Manual Recursion
    • What does nftw() provide?
    • When would you prefer manual control?
    • Book Reference: “APUE” Ch. 4.22

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Recursion Strategy
    • Use actual recursion or maintain a work queue?
    • How deep can the call stack go? (Default stack ~8MB, each frame ~4KB = ~2000 levels)
    • For extreme depth, consider iterative with explicit stack
  2. Memory Management
    • How to handle paths—static buffer or malloc?
    • How to track visited inodes for cycle detection?
    • How to handle very large directories (millions of entries)?
  3. Error Handling
    • What if you can’t read a directory (permission)?
    • What if a file disappears mid-traversal (TOCTOU)?
    • Should errors stop traversal or just be reported?

5.6 Thinking Exercise

Model the Directory Structure

Consider this filesystem:

/home/user/
├── file1.txt (inode 100)
├── subdir/ (inode 101)
│   └── file2.txt (inode 102)
├── link1 -> subdir (symlink, inode 103)
└── hardlink.txt (inode 100, same as file1.txt!)

What happens if you:
1. Traverse without following symlinks?
   - Visit: /home/user, subdir, file2.txt
   - Skip link1 (it's a symlink)
   - Total: 3 entries

2. Follow symlinks but don't track inodes?
   - Visit: /home/user, subdir, file2.txt
   - Follow link1 -> enter subdir AGAIN
   - Visit file2.txt AGAIN
   - Could loop forever if symlink points to ancestor!

3. Count bytes for hardlink.txt—double count or not?
   - file1.txt and hardlink.txt share inode 100
   - du counts each inode only once
   - Track (dev, ino) pairs to avoid double-counting

Trace this mentally before coding!

5.7 Hints in Layers

Hint 1: Basic Structure Open directory with opendir(), loop with readdir(), skip “.” and “..”, recurse on directories, close with closedir().

void walk(const char *path, int depth) {
    DIR *dir = opendir(path);
    if (!dir) { perror(path); return; }

    struct dirent *entry;
    while ((entry = readdir(dir)) != NULL) {
        if (strcmp(entry->d_name, ".") == 0 ||
            strcmp(entry->d_name, "..") == 0)
            continue;

        // Build full path and process...
    }
    closedir(dir);
}

Hint 2: Path Building

// Safe path construction
char full_path[PATH_MAX];
int len = snprintf(full_path, sizeof(full_path), "%s/%s",
                   dir_path, entry->d_name);
if (len >= sizeof(full_path)) {
    fprintf(stderr, "Path too long: %s/%s\n", dir_path, entry->d_name);
    continue;  // Skip this entry
}

Hint 3: Using nftw()

#define _XOPEN_SOURCE 500  // Required for nftw
#include <ftw.h>

int callback(const char *fpath, const struct stat *sb,
             int typeflag, struct FTW *ftwbuf) {
    // typeflag: FTW_F (file), FTW_D (directory pre),
    //           FTW_DP (directory post), FTW_SL (symlink)
    // ftwbuf->level: depth in tree
    // ftwbuf->base: index where basename starts

    printf("%*s%s\n", ftwbuf->level * 2, "", fpath + ftwbuf->base);
    return 0;  // Continue walking
}

int main(int argc, char *argv[]) {
    // FTW_PHYS: Don't follow symlinks
    // 20: max file descriptors to use
    nftw(argv[1], callback, 20, FTW_PHYS);
    return 0;
}

Hint 4: Cycle Detection Track (device, inode) pairs in a hash set. Before entering a directory, check if already visited.

typedef struct {
    dev_t dev;
    ino_t ino;
} dir_id_t;

int is_visited(dir_id_t *visited, size_t count, dev_t dev, ino_t ino) {
    for (size_t i = 0; i < count; i++) {
        if (visited[i].dev == dev && visited[i].ino == ino)
            return 1;
    }
    return 0;
}

5.8 The Interview Questions They’ll Ask

  1. “How would you find all files modified in the last 24 hours?”
    • Use stat() to get st_mtime
    • Compare to current time minus 24 hours
    • find -mtime -1 equivalent
  2. “How does du avoid counting hardlinked files twice?”
    • Tracks (dev, ino) pairs of all files seen
    • Only counts size for first occurrence of each inode
    • This is why du can be slower than ls
  3. “What’s the difference between depth-first and breadth-first traversal for filesystems?”
    • DFS: Process subdirectories before siblings (typical)
    • BFS: Process all entries at one level before going deeper
    • DFS uses less memory, BFS better for finding nearby files
  4. “How would you make a parallel directory walker?”
    • Producer-consumer with directory queue
    • Multiple worker threads calling readdir
    • Must handle shared visited set
    • Consider io_uring for modern async I/O
  5. “What is a TOCTOU race in filesystem operations?”
    • Time-Of-Check-Time-Of-Use
    • File can change between stat() and open()
    • Or disappear between readdir() and stat()
    • Handle with retry or openat() for atomic operations

5.9 Books That Will Help

Topic Book Chapter
Directory operations “APUE” by Stevens Ch. 4.21-4.22
nftw() details “The Linux Programming Interface” Ch. 18
Filesystem internals “Understanding the Linux Kernel” Ch. 12
Pattern matching “The Linux Programming Interface” Ch. 18.8

5.10 Implementation Phases

Phase 1: Basic Walk (3-4 hours)

  • Walk a single directory
  • Print all entries
  • Handle “.” and “..” correctly

Phase 2: Recursion (2-3 hours)

  • Add recursion for subdirectories
  • Track depth
  • Handle max depth option

Phase 3: Cycle Detection (2-3 hours)

  • Implement visited set
  • Test with symlink cycles
  • Add verbose output for skipped cycles

Phase 4: Tree Output (2-3 hours)

  • Format output with tree characters (├── └──)
  • Track last-entry state for correct formatting
  • Add file count summary

Phase 5: Filters (3-4 hours)

  • Add name pattern matching with fnmatch()
  • Add size filters
  • Add type filters

Phase 6: Du Mode (2-3 hours)

  • Accumulate sizes
  • Track inodes for hard link deduplication
  • Human-readable size output

5.11 Key Implementation Decisions

Decision Trade-offs
Recursion vs iteration Recursion is cleaner; iteration handles extreme depth
nftw() vs manual nftw() is portable; manual gives control
Hash table vs array for visited Hash is O(1); array is simpler but O(n)
Static vs dynamic path buffer Static is simpler; dynamic handles any length
Stop vs continue on error Continue is user-friendly; stop may be safer

6. Testing Strategy

6.1 Unit Tests

Test Input Expected Result
Skip . and .. readdir loop Only real entries
Path building “/foo” + “bar” “/foo/bar”
Path too long near PATH_MAX Warning, skip
fnmatch “*.c” “test.c” Match
fnmatch “*.c” “test.h” No match
Size filter +1M 2MB file Match
Size filter +1M 500KB file No match

6.2 Integration Tests

# Test basic tree
$ ./myfind --tree testdir
# Should show proper tree structure

# Test cycle handling
$ ln -s .. testdir/subdir1/cycle
$ ./myfind --tree testdir
# Should not hang, should note cycle

# Test name filter
$ ./myfind --name "*.c" testdir
# Should only show .c files

# Test du mode
$ ./myfind --du testdir
$ du -sh testdir
# Sizes should match

# Compare with find
$ ./myfind --name "*.h" /usr/include | wc -l
$ find /usr/include -name "*.h" | wc -l
# Counts should match

6.3 Edge Cases to Test

Case Setup Expected Behavior
Empty directory mkdir empty Shows directory, no entries
Permission denied chmod 000 dir Error message, continue
Symlink cycle ln -s . loop Detect, skip with message
Dangling symlink ln -s /nonexistent link Report link, note dangling
Very deep tree 1000+ levels Handle without stack overflow
Huge directory 100k+ files Complete in reasonable time
Hard links Multiple links to same file Count once for du
Special files FIFO, socket Handle correctly

6.4 Verification Commands

# Test for memory leaks
$ valgrind --leak-check=full ./myfind --tree /usr/include

# Compare output with standard tools
$ diff <(./myfind --name "*.c" src | sort) <(find src -name "*.c" | sort)

# Performance comparison
$ time ./myfind --tree /usr >/dev/null
$ time tree /usr >/dev/null

# Test with strace
$ strace -c ./myfind --tree /home
# Check syscall counts are reasonable

7. Common Pitfalls & Debugging

Problem 1: “Crashes on deep directory trees”

  • Why: Stack overflow from recursion
  • Fix: Use iterative approach with explicit stack, or increase stack size

Problem 2: “Infinite loop with symlinks”

  • Why: Following symlinks without cycle detection
  • Fix: Use FTW_PHYS flag with nftw(), or track visited inodes

Problem 3: “Wrong sizes reported (du mode)”

  • Why: Not using lstat() for symlinks, or double-counting hardlinks
  • Fix: Use lstat(), track (dev, ino) pairs for hardlink dedup

Problem 4: “Buffer overflow in path construction”

  • Why: Using strcpy/strcat without bounds checking
  • Fix: Use snprintf() with proper size checks

Problem 5: “Missing files in output”

  • Why: TOCTOU race—file disappeared between readdir and stat
  • Fix: Handle ENOENT from stat() gracefully

Problem 6: “Permission denied breaks entire walk”

  • Why: Error handling stops traversal
  • Fix: Report error and continue to next entry

8. Extensions & Challenges

8.1 Easy Extensions

Extension Description Learning
–mtime filter Find by modification time Time comparisons
–perm filter Find by permissions Bitmask operations
–empty Find empty files/dirs Size and link count
–prune Skip directories matching pattern Traversal control

8.2 Advanced Challenges

Challenge Description Learning
Parallel walk Multi-threaded traversal Concurrency
Content search grep -r equivalent File reading + traversal
Watch mode Continuous monitoring inotify integration
Network filesystems Handle NFS/CIFS Timeout, error handling
Persistent state Resume interrupted walk Checkpointing

8.3 Research Topics

  • How does locate achieve instant search? (mlocate database)
  • How does git efficiently check status of large repos?
  • What is fanotify and how does it differ from inotify?
  • How do backup tools handle millions of files efficiently?

9. Real-World Connections

9.1 Production Systems Using This

System How It Uses Directory Walking Notable Feature
find Recursive traversal with predicates Expression language
rsync Compare source/dest directories Incremental sync
tar Archive directory trees Preserve metadata
du Calculate disk usage Hard link tracking
git Index working directory Stat cache
webpack Watch source files File change detection

9.2 How the Pros Do It

GNU find:

  • Uses nftw() internally on most systems
  • Implements full predicate language (-name, -type, -exec, etc.)
  • Handles -prune for skipping directories

rsync:

  • Builds file lists from both source and destination
  • Compares metadata to find changes
  • Optimized for network transfer

fd (modern find replacement):

  • Uses parallel traversal
  • Respects .gitignore by default
  • Written in Rust for performance

9.3 Reading the Source

# View GNU findutils source
$ git clone https://git.savannah.gnu.org/git/findutils.git
$ less findutils/find/ftsfind.c

# View tree source
$ apt source tree
$ less tree-*/tree.c

# View fd source (Rust)
$ git clone https://github.com/sharkdp/fd
$ less fd/src/walk.rs

10. Resources

10.1 Man Pages

$ man 3 opendir     # Open directory stream
$ man 3 readdir     # Read directory entry
$ man 3 closedir    # Close directory stream
$ man 3 nftw        # File tree walk
$ man 3 fnmatch     # Filename matching
$ man 2 stat        # Get file status
$ man 2 lstat       # Get symlink status
$ man 7 path_resolution  # How paths are resolved

10.2 Online Resources

10.3 Book Chapters

Book Chapters Topics Covered
“APUE” by Stevens Ch. 4.21-4.22 Directory Reading, File Tree Walking
“TLPI” by Kerrisk Ch. 18 Directories and Links
“Linux System Programming” by Love Ch. 7 File and Directory Management

11. Self-Assessment Checklist

Before considering this project complete, verify:

  • I can explain how directories are stored as files
  • I understand the difference between opendir/readdir and nftw
  • My tool correctly handles symbolic link cycles
  • Path construction is safe from buffer overflows
  • Permission denied errors don’t crash the program
  • TOCTOU races are handled gracefully
  • Hard links are counted once in du mode
  • Tree output formatting is correct
  • Pattern matching with fnmatch() works correctly
  • I can answer all the interview questions listed above
  • My code has zero memory leaks (valgrind verified)

12. Submission / Completion Criteria

Your project is complete when:

  1. Functionality
    • --tree mode displays directory structure correctly
    • --du mode matches system du output
    • --name filter works with glob patterns
    • Cycle detection prevents infinite loops
  2. Quality
    • Compiles with gcc -Wall -Wextra -Werror with no warnings
    • Zero valgrind errors
    • Handles all error conditions gracefully
  3. Testing
    • Works on /usr/include (large directory)
    • Handles permission denied gracefully
    • Detects and reports symlink cycles
    • Hard links counted correctly
  4. Understanding
    • Can explain directory structure internals
    • Can explain cycle detection algorithm
    • Understands nftw() vs manual walking trade-offs

Next Steps

After completing this project, you’re ready for:

  • Project 4: Shell Implementation - Use directory walking for command completion
  • Project 5: Process Monitor - Walk /proc filesystem
  • Project 10: File Watcher - Combine walking with inotify

The directory traversal patterns you’ve learned here appear in every tool that deals with the filesystem.