Project 3: Recursive Directory Walker

Build a recursive directory traversal tool that walks entire directory trees, reporting files, computing sizes, and optionally filtering by criteria (name pattern, size, type).

Quick Reference

Attribute	Value
Difficulty	Level 3 - Advanced
Time Estimate	20-40 hours
Language	C (primary), Rust/Go (alternatives)
Prerequisites	Projects 1-2, opendir/readdir, recursion
Key Topics	Directory Operations, Recursion, Path Construction, Cycle Detection

1. Learning Objectives

After completing this project, you will:

Master opendir(), readdir(), and closedir() for directory operations
Understand how directories are structured as files containing name→inode mappings
Handle path construction safely without buffer overflows
Detect and handle symbolic link cycles to prevent infinite loops
Implement depth-first vs breadth-first traversal strategies
Use nftw() for standardized directory walking (and understand when to roll your own)
Implement file filtering with fnmatch() for glob patterns
Calculate directory sizes like du while handling hard link deduplication
Handle permissions errors and disappearing files gracefully (TOCTOU issues)

2. Theoretical Foundation

2.1 Core Concepts

Directories as Files

In UNIX, a directory is a special file containing a list of (name, inode) pairs:

Directory File Content (Conceptual View)
┌────────────────────────────────────────────────────────────────┐
│                    /home/user/project                          │
│                                                                │
│  ┌─────────────────────┬──────────────────────────────────┐   │
│  │ Entry Name          │ Inode Number                      │   │
│  ├─────────────────────┼──────────────────────────────────┤   │
│  │ .                   │ 12345 (this directory)           │   │
│  │ ..                  │ 12340 (parent directory)          │   │
│  │ README.md           │ 12346                             │   │
│  │ src                 │ 12350 (subdirectory)              │   │
│  │ Makefile            │ 12347                             │   │
│  │ build               │ 12360 (subdirectory)              │   │
│  └─────────────────────┴──────────────────────────────────┘   │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Key Points:
- "." always points to self
- ".." always points to parent
- Entries are NOT sorted (order depends on creation/deletion history)
- Names are just labels; the inode holds all metadata

Directory Traversal Flow

readdir() Loop Pattern
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  dir = opendir("/path/to/dir")                                  │
│       │                                                         │
│       ▼                                                         │
│  ┌────────────────────┐                                         │
│  │ readdir(dir) ◄─────┼─────────────────────────────────────┐   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ entry == NULL?     │──Yes──▶ closedir(dir), DONE         │   │
│  └─────────┬──────────┘                                     │   │
│            │ No                                             │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ Skip "." and ".."  │                                     │   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ Build full path:   │                                     │   │
│  │ dir + "/" + name   │                                     │   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐                                     │   │
│  │ lstat(full_path)   │                                     │   │
│  └─────────┬──────────┘                                     │   │
│            │                                                │   │
│            ▼                                                │   │
│  ┌────────────────────┐         ┌───────────────────────┐   │   │
│  │ Is directory?      │──Yes───▶│ Recurse into subdir   │───┘   │
│  └─────────┬──────────┘         └───────────────────────┘       │
│            │ No                                                 │
│            ▼                                                    │
│  ┌────────────────────┐                                         │
│  │ Process file       │──────────────────────────────────────┶   │
│  │ (print, count, etc)│                                         │
│  └────────────────────┘                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Symbolic Link Cycles

Symlinks can create infinite loops that naive traversal will never escape:

Filesystem with Cycle:
/home/user/
├── project/
│   ├── src/
│   │   └── link -> ../..   ← Points back to /home/user!
│   └── data/
└── docs/

Traversal without cycle detection:
1. Enter /home/user/project
2. Enter /home/user/project/src
3. Follow link -> /home/user
4. Enter /home/user/project (AGAIN!)
5. Enter /home/user/project/src
6. Follow link -> /home/user
7. ... INFINITE LOOP ...

Detection Strategy:
Track (device, inode) pairs of directories visited.
Before entering a directory, check if (dev, ino) already seen.

┌─────────────────────────────────────────────────────────┐
│              Visited Set (Hash Table)                    │
│                                                          │
│  Key: (dev, ino)     Seen?                              │
│  ─────────────────────────────                          │
│  (8, 12345)          YES   ← /home/user                 │
│  (8, 12350)          YES   ← /home/user/project         │
│  (8, 12351)          YES   ← /home/user/project/src     │
│                                                          │
│  When link resolves to (8, 12345), we skip it!          │
└─────────────────────────────────────────────────────────┘

Path Construction Safety

Building paths by concatenation is error-prone:

WRONG: Potential buffer overflow
char path[100];
strcpy(path, dir);      // What if dir is 99 chars?
strcat(path, "/");      // Now 100 chars
strcat(path, entry);    // OVERFLOW if entry has any length!

RIGHT: Safe with bounds checking
char path[PATH_MAX];
int len = snprintf(path, sizeof(path), "%s/%s", dir, entry->d_name);
if (len >= sizeof(path)) {
    // Path too long! Handle error
    fprintf(stderr, "Path too long: %s/%s\n", dir, entry->d_name);
    continue;  // Skip this entry
}

RIGHT: Using realpath() for canonical paths
char resolved[PATH_MAX];
if (realpath(path, resolved) == NULL) {
    // Path doesn't exist or other error
    perror(path);
}

nftw() vs Manual Recursion

The nftw() function provides standardized directory walking:

// nftw callback signature
int callback(const char *fpath, const struct stat *sb,
             int typeflag, struct FTW *ftwbuf) {
    // fpath: full path to file
    // sb: stat structure (already called for you!)
    // typeflag: FTW_F (file), FTW_D (directory), FTW_SL (symlink), etc.
    // ftwbuf->level: depth in tree (0 = starting directory)
    // ftwbuf->base: index in fpath where basename starts

    printf("Level %d: %s\n", ftwbuf->level, fpath);
    return 0;  // 0 to continue, non-zero to stop
}

// Usage
nftw("/path", callback, 20, FTW_PHYS);
// 20 = max file descriptors to use
// FTW_PHYS = don't follow symlinks (like lstat)

When to use nftw():

Simple traversal with standard behavior
Portability is important
Don’t need fine-grained control

When to roll your own:

Need custom cycle detection
Need to modify traversal order
Need parallel traversal
Building interactive tool (need to pause/resume)

2.2 Why This Matters

Directory traversal is fundamental to countless tools:

Backup software: Walk directories to find files to backup
Search tools: find, locate, grep -r
Build systems: make, cmake (find source files)
Sync tools: rsync, dropbox
Disk usage: du, ncdu
Security scanning: Find setuid files, check permissions
Code editors: File tree views, project search

Real-world scale:

A typical Linux system has 200,000+ files
Large source trees (Linux kernel) have 70,000+ files
Backup software must handle millions of files efficiently

2.3 Historical Context

The opendir/readdir API dates to the original UNIX but was standardized in POSIX.1. The key innovation was treating directories as files that you open and read entries from, rather than a special API.

The nftw() function (new file tree walk) replaced older ftw() to handle symbolic links properly. The FTW_PHYS flag was added specifically because symlink handling was a major source of bugs and security issues.

2.4 Common Misconceptions

Misconception 1: “readdir() returns entries in sorted order”

Reality: Order depends on filesystem and creation/deletion history. Sort yourself if needed.

Misconception 2: “d_type in dirent is always valid”

Reality: Some filesystems don’t support d_type (returns DT_UNKNOWN). Always stat() if you need the type.

Misconception 3: “Symlinks to directories should always be followed”

Reality: Following symlinks can cause loops and security issues. Default to not following.

Misconception 4: “PATH_MAX is the absolute maximum path length”

Reality: Some filesystems and operations can exceed PATH_MAX. Handle long paths gracefully.

Misconception 5: “File won’t disappear between readdir and stat”

Reality: This is a TOCTOU (time-of-check-time-of-use) race. Handle ENOENT from stat().

3. Project Specification

3.1 What You Will Build

A directory traversal tool called myfind that combines features of tree, du, and find:

Tree mode: Display directory structure visually
Disk usage mode: Calculate sizes like du
Search mode: Find files matching patterns
All modes with proper cycle detection and error handling

3.2 Functional Requirements

Tree display: myfind --tree DIR shows directory structure
Disk usage: myfind --du DIR shows sizes like du
Name search: myfind --name "PATTERN" DIR finds matching files
Size filter: myfind --size +1M DIR finds files larger than 1MB
Type filter: myfind --type f|d|l DIR filters by file type
Symlink handling: Default don’t follow; -L to follow
Max depth: --maxdepth N limits recursion depth
Execute action: --exec CMD {} runs command on matches

3.3 Non-Functional Requirements

Cycle safety: Must detect and skip symbolic link cycles
Memory efficiency: Handle directories with millions of files
Error resilience: Continue on permission denied, missing files
Performance: Comparable to standard tools on same workload
Correct sizes: Handle hard links correctly for du mode

3.4 Example Usage / Output

# 1. Basic tree display
$ ./myfind --tree /usr/include
/usr/include
├── assert.h
├── complex.h
├── errno.h
├── linux/
│   ├── capability.h
│   ├── fs.h
│   └── types.h
├── sys/
│   ├── socket.h
│   ├── stat.h
│   └── types.h
└── unistd.h
4 directories, 10 files

# 2. Disk usage mode
$ ./myfind --du /var/log
12K     /var/log/apt
4K      /var/log/apt/history.log
8K      /var/log/apt/term.log
256K    /var/log/journal
1.2M    /var/log
Total: 1.2M in 23 files, 5 directories

# 3. Find files matching pattern
$ ./myfind --name "*.h" /usr/include | head -5
/usr/include/assert.h
/usr/include/complex.h
/usr/include/errno.h
/usr/include/features.h
/usr/include/float.h

# 4. Find files larger than 1MB
$ ./myfind --size +1M /usr
/usr/lib/libc.so.6
/usr/lib/libLLVM-14.so.1
/usr/bin/clang
...

# 5. Handle symbolic link cycle gracefully
$ ln -s . loop
$ ./myfind --tree .
.
├── file.txt
└── loop -> . (symlink, skipping to avoid cycle)

# 6. Find with type filter
$ ./myfind --type d /etc | head -5
/etc
/etc/apt
/etc/apt/apt.conf.d
/etc/apt/preferences.d
/etc/alternatives

# 7. Execute command on matches
$ ./myfind --name "*.log" --exec "wc -l {}" /var/log
42 /var/log/syslog
127 /var/log/auth.log
...

3.5 Real World Outcome

What success looks like:

Tree display: Properly indented, shows structure clearly
Accurate sizes: Matches du output (accounting for hard links)
Pattern matching: Works with *, ?, [abc] patterns
Cycle safety: Never hangs on symlink loops
Error handling: Continues gracefully on permission denied
Performance: Handles /usr (100k+ files) in seconds

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                          myfind                                  │
│                                                                  │
│  ┌─────────────────┐                                            │
│  │  Parse Args     │ ← --tree, --du, --name, --size, etc.       │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │  Initialize     │ ← Create visited set for cycle detection   │
│  │  State          │ ← Set up filters, output mode              │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Recursive Walker                            │    │
│  │                                                          │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ Check depth     │ ← Respect --maxdepth               │    │
│  │  └────────┬────────┘                                    │    │
│  │           │                                              │    │
│  │           ▼                                              │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ Check cycle     │ ← Is (dev,ino) in visited set?     │    │
│  │  └────────┬────────┘                                    │    │
│  │           │ Not a cycle                                  │    │
│  │           ▼                                              │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ opendir()       │                                    │    │
│  │  └────────┬────────┘                                    │    │
│  │           │                                              │    │
│  │           ▼                                              │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │            For each entry (readdir)             │    │    │
│  │  │                                                  │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Skip . and ..   │                            │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Build path      │                            │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ lstat()         │                            │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Apply filters   │ ← name, size, type         │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │ Matches                              │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐                            │    │    │
│  │  │  │ Output/Action   │ ← print, accumulate size   │    │    │
│  │  │  └────────┬────────┘                            │    │    │
│  │  │           │                                      │    │    │
│  │  │           ▼                                      │    │    │
│  │  │  ┌─────────────────┐         ┌────────────────┐ │    │    │
│  │  │  │ Is directory?   │──Yes───▶│ Recurse        │ │    │    │
│  │  │  └─────────────────┘         └────────────────┘ │    │    │
│  │  │                                                  │    │    │
│  │  └──────────────────────────────────────────────────┘    │    │
│  │                                                          │    │
│  │  ┌─────────────────┐                                    │    │
│  │  │ closedir()      │                                    │    │
│  │  └─────────────────┘                                    │    │
│  │                                                          │    │
│  └──────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────┐                                            │
│  │  Print Summary  │ ← Total files, dirs, size                  │
│  └─────────────────┘                                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component	Purpose	Key Functions
Argument Parser	Parse command line options	getopt_long()
Cycle Detector	Track visited (dev,ino) pairs	Hash set operations
Path Builder	Safely construct paths	snprintf(), realpath()
Walker	Recursive directory traversal	opendir(), readdir(), closedir()
Filter Engine	Apply name/size/type filters	fnmatch(), stat comparisons
Output Formatter	Tree, du, or plain output	printf with formatting
Stats Tracker	Count files, dirs, accumulate sizes	Running totals

4.3 Data Structures

// Configuration from command line
typedef struct {
    const char *start_path;

    // Mode (mutually exclusive)
    int tree_mode;           // --tree
    int du_mode;             // --du
    int find_mode;           // (default or with filters)

    // Filters
    const char *name_pattern;  // --name "*.c"
    off_t min_size;            // --size +1M (positive = minimum)
    off_t max_size;            // --size -1M (negative = maximum)
    int file_type;             // --type f/d/l

    // Options
    int follow_symlinks;       // -L
    int max_depth;             // --maxdepth N, -1 = unlimited
    const char *exec_cmd;      // --exec CMD
    int verbose;               // -v
} config_t;

// Traversal state
typedef struct {
    // Cycle detection
    struct {
        dev_t dev;
        ino_t ino;
    } *visited;
    size_t visited_count;
    size_t visited_capacity;

    // Statistics
    size_t file_count;
    size_t dir_count;
    size_t symlink_count;
    off_t total_size;

    // For du mode: track seen inodes (hard link dedup)
    struct {
        dev_t dev;
        ino_t ino;
    } *seen_inodes;
    size_t seen_count;
    size_t seen_capacity;

    // Current depth
    int current_depth;
} walker_state_t;

// Directory entry info
typedef struct {
    char path[PATH_MAX];
    struct stat st;
    int stat_failed;       // 1 if stat() failed
    int is_cycle;          // 1 if this is a cycle
} entry_info_t;

4.4 Algorithm Overview

FUNCTION walk_directory(path, depth, state, config):
    // Check depth limit
    IF depth > config.max_depth AND config.max_depth >= 0:
        RETURN

    // lstat the directory itself
    IF lstat(path, &st) == -1:
        PRINT error
        RETURN

    // Check for cycle (for directories only)
    IF is_directory(st.st_mode):
        IF (st.st_dev, st.st_ino) IN state.visited:
            IF config.verbose:
                PRINT "Skipping cycle: " + path
            RETURN
        ADD (st.st_dev, st.st_ino) TO state.visited

    // Open directory
    dir = opendir(path)
    IF dir == NULL:
        PRINT error
        RETURN

    // Read entries
    WHILE (entry = readdir(dir)) != NULL:
        IF entry.d_name == "." OR entry.d_name == "..":
            CONTINUE

        // Build full path
        full_path = path + "/" + entry.d_name
        IF path_too_long:
            PRINT warning
            CONTINUE

        // Get entry info
        IF lstat(full_path, &entry_st) == -1:
            IF errno == ENOENT:
                // File disappeared (TOCTOU race), skip silently
                CONTINUE
            PRINT error
            CONTINUE

        // Apply filters
        IF NOT matches_filters(full_path, entry_st, config):
            // Still recurse into directories even if they don't match
            IF is_directory(entry_st.st_mode):
                walk_directory(full_path, depth + 1, state, config)
            CONTINUE

        // Process matching entry
        process_entry(full_path, entry_st, depth, state, config)

        // Recurse into directories
        IF is_directory(entry_st.st_mode):
            IF NOT is_symlink(entry_st.st_mode) OR config.follow_symlinks:
                walk_directory(full_path, depth + 1, state, config)

    closedir(dir)


FUNCTION matches_filters(path, st, config):
    // Name filter
    IF config.name_pattern != NULL:
        IF NOT fnmatch(config.name_pattern, basename(path)):
            RETURN FALSE

    // Size filter
    IF config.min_size > 0 AND st.st_size < config.min_size:
        RETURN FALSE
    IF config.max_size > 0 AND st.st_size > config.max_size:
        RETURN FALSE

    // Type filter
    IF config.file_type != 0:
        IF config.file_type == 'f' AND NOT is_regular(st.st_mode):
            RETURN FALSE
        IF config.file_type == 'd' AND NOT is_directory(st.st_mode):
            RETURN FALSE
        IF config.file_type == 'l' AND NOT is_symlink(st.st_mode):
            RETURN FALSE

    RETURN TRUE

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
$ mkdir -p ~/projects/myfind
$ cd ~/projects/myfind

# Create test directory structure
$ mkdir -p testdir/subdir1/subsubdir
$ mkdir -p testdir/subdir2
$ touch testdir/file1.txt testdir/file2.c
$ touch testdir/subdir1/file3.h
$ touch testdir/subdir1/subsubdir/file4.cpp
$ touch testdir/subdir2/file5.py

# Create symlink cycle for testing
$ ln -s .. testdir/subdir1/parent_link

# Create various file types
$ mkfifo testdir/mypipe
$ ln -s file1.txt testdir/link_to_file1

# Verify structure
$ find testdir -type f -o -type d -o -type l | sort

5.2 Project Structure

myfind/
├── myfind.c              # Main source file
├── Makefile              # Build configuration
├── testdir/              # Test directory structure
│   ├── file1.txt
│   ├── file2.c
│   ├── link_to_file1 -> file1.txt
│   ├── mypipe            # FIFO
│   ├── subdir1/
│   │   ├── file3.h
│   │   ├── parent_link -> ..  # Cycle!
│   │   └── subsubdir/
│   │       └── file4.cpp
│   └── subdir2/
│       └── file5.py
└── README.md

5.3 The Core Question You’re Answering

“How do you traverse an entire filesystem efficiently while handling all the edge cases (permissions, symlinks, mount points)?”

This question underlies every backup tool, search utility, and file synchronizer ever written. The naive approach (recursion with string concatenation) fails on deep trees and cyclic links.

5.4 Concepts You Must Understand First

Stop and research these before coding:

Directory as a File
- A directory is a file containing name→inode mappings
- What does opendir/readdir/closedir return?
- Book Reference: “APUE” Ch. 4.21-4.22
Path Construction
- How to join directory path with filename?
- Maximum path length (PATH_MAX)
- Book Reference: “APUE” Ch. 4.14
Symbolic Link Handling
- When to follow vs. when to skip?
- How to detect cycles?
- lstat() vs stat() for link detection
nftw() vs Manual Recursion
- What does nftw() provide?
- When would you prefer manual control?
- Book Reference: “APUE” Ch. 4.22

5.5 Questions to Guide Your Design

Before implementing, think through these:

Recursion Strategy
- Use actual recursion or maintain a work queue?
- How deep can the call stack go? (Default stack ~8MB, each frame ~4KB = ~2000 levels)
- For extreme depth, consider iterative with explicit stack
Memory Management
- How to handle paths—static buffer or malloc?
- How to track visited inodes for cycle detection?
- How to handle very large directories (millions of entries)?
Error Handling
- What if you can’t read a directory (permission)?
- What if a file disappears mid-traversal (TOCTOU)?
- Should errors stop traversal or just be reported?

5.6 Thinking Exercise

Model the Directory Structure

Consider this filesystem:

/home/user/
├── file1.txt (inode 100)
├── subdir/ (inode 101)
│   └── file2.txt (inode 102)
├── link1 -> subdir (symlink, inode 103)
└── hardlink.txt (inode 100, same as file1.txt!)

What happens if you:
1. Traverse without following symlinks?
   - Visit: /home/user, subdir, file2.txt
   - Skip link1 (it's a symlink)
   - Total: 3 entries

2. Follow symlinks but don't track inodes?
   - Visit: /home/user, subdir, file2.txt
   - Follow link1 -> enter subdir AGAIN
   - Visit file2.txt AGAIN
   - Could loop forever if symlink points to ancestor!

3. Count bytes for hardlink.txt—double count or not?
   - file1.txt and hardlink.txt share inode 100
   - du counts each inode only once
   - Track (dev, ino) pairs to avoid double-counting

Trace this mentally before coding!

5.7 Hints in Layers

Hint 1: Basic Structure Open directory with opendir(), loop with readdir(), skip “.” and “..”, recurse on directories, close with closedir().

void walk(const char *path, int depth) {
    DIR *dir = opendir(path);
    if (!dir) { perror(path); return; }

    struct dirent *entry;
    while ((entry = readdir(dir)) != NULL) {
        if (strcmp(entry->d_name, ".") == 0 ||
            strcmp(entry->d_name, "..") == 0)
            continue;

        // Build full path and process...
    }
    closedir(dir);
}

Hint 2: Path Building

// Safe path construction
char full_path[PATH_MAX];
int len = snprintf(full_path, sizeof(full_path), "%s/%s",
                   dir_path, entry->d_name);
if (len >= sizeof(full_path)) {
    fprintf(stderr, "Path too long: %s/%s\n", dir_path, entry->d_name);
    continue;  // Skip this entry
}

Hint 3: Using nftw()

#define _XOPEN_SOURCE 500  // Required for nftw
#include <ftw.h>

int callback(const char *fpath, const struct stat *sb,
             int typeflag, struct FTW *ftwbuf) {
    // typeflag: FTW_F (file), FTW_D (directory pre),
    //           FTW_DP (directory post), FTW_SL (symlink)
    // ftwbuf->level: depth in tree
    // ftwbuf->base: index where basename starts

    printf("%*s%s\n", ftwbuf->level * 2, "", fpath + ftwbuf->base);
    return 0;  // Continue walking
}

int main(int argc, char *argv[]) {
    // FTW_PHYS: Don't follow symlinks
    // 20: max file descriptors to use
    nftw(argv[1], callback, 20, FTW_PHYS);
    return 0;
}

Hint 4: Cycle Detection Track (device, inode) pairs in a hash set. Before entering a directory, check if already visited.

typedef struct {
    dev_t dev;
    ino_t ino;
} dir_id_t;

int is_visited(dir_id_t *visited, size_t count, dev_t dev, ino_t ino) {
    for (size_t i = 0; i < count; i++) {
        if (visited[i].dev == dev && visited[i].ino == ino)
            return 1;
    }
    return 0;
}

5.8 The Interview Questions They’ll Ask

“How would you find all files modified in the last 24 hours?”
- Use stat() to get st_mtime
- Compare to current time minus 24 hours
- find -mtime -1 equivalent
“How does du avoid counting hardlinked files twice?”
- Tracks (dev, ino) pairs of all files seen
- Only counts size for first occurrence of each inode
- This is why du can be slower than ls
“What’s the difference between depth-first and breadth-first traversal for filesystems?”
- DFS: Process subdirectories before siblings (typical)
- BFS: Process all entries at one level before going deeper
- DFS uses less memory, BFS better for finding nearby files
“How would you make a parallel directory walker?”
- Producer-consumer with directory queue
- Multiple worker threads calling readdir
- Must handle shared visited set
- Consider io_uring for modern async I/O
“What is a TOCTOU race in filesystem operations?”
- Time-Of-Check-Time-Of-Use
- File can change between stat() and open()
- Or disappear between readdir() and stat()
- Handle with retry or openat() for atomic operations

5.9 Books That Will Help

Topic	Book	Chapter
Directory operations	“APUE” by Stevens	Ch. 4.21-4.22
nftw() details	“The Linux Programming Interface”	Ch. 18
Filesystem internals	“Understanding the Linux Kernel”	Ch. 12
Pattern matching	“The Linux Programming Interface”	Ch. 18.8

5.10 Implementation Phases

Phase 1: Basic Walk (3-4 hours)

Walk a single directory
Print all entries
Handle “.” and “..” correctly

Phase 2: Recursion (2-3 hours)

Add recursion for subdirectories
Track depth
Handle max depth option

Phase 3: Cycle Detection (2-3 hours)

Implement visited set
Test with symlink cycles
Add verbose output for skipped cycles

Phase 4: Tree Output (2-3 hours)

Format output with tree characters (├── └──)
Track last-entry state for correct formatting
Add file count summary

Phase 5: Filters (3-4 hours)

Add name pattern matching with fnmatch()
Add size filters
Add type filters

Phase 6: Du Mode (2-3 hours)

Accumulate sizes
Track inodes for hard link deduplication
Human-readable size output

5.11 Key Implementation Decisions

Decision	Trade-offs
Recursion vs iteration	Recursion is cleaner; iteration handles extreme depth
nftw() vs manual	nftw() is portable; manual gives control
Hash table vs array for visited	Hash is O(1); array is simpler but O(n)
Static vs dynamic path buffer	Static is simpler; dynamic handles any length
Stop vs continue on error	Continue is user-friendly; stop may be safer

6. Testing Strategy

6.1 Unit Tests

Test	Input	Expected Result
Skip . and ..	readdir loop	Only real entries
Path building	“/foo” + “bar”	“/foo/bar”
Path too long	near PATH_MAX	Warning, skip
fnmatch “*.c”	“test.c”	Match
fnmatch “*.c”	“test.h”	No match
Size filter +1M	2MB file	Match
Size filter +1M	500KB file	No match

6.2 Integration Tests

# Test basic tree
$ ./myfind --tree testdir
# Should show proper tree structure

# Test cycle handling
$ ln -s .. testdir/subdir1/cycle
$ ./myfind --tree testdir
# Should not hang, should note cycle

# Test name filter
$ ./myfind --name "*.c" testdir
# Should only show.c files

# Test du mode
$ ./myfind --du testdir
$ du -sh testdir
# Sizes should match

# Compare with find
$ ./myfind --name "*.h" /usr/include | wc -l
$ find /usr/include -name "*.h" | wc -l
# Counts should match

6.3 Edge Cases to Test

Case	Setup	Expected Behavior
Empty directory	mkdir empty	Shows directory, no entries
Permission denied	chmod 000 dir	Error message, continue
Symlink cycle	ln -s . loop	Detect, skip with message
Dangling symlink	ln -s /nonexistent link	Report link, note dangling
Very deep tree	1000+ levels	Handle without stack overflow
Huge directory	100k+ files	Complete in reasonable time
Hard links	Multiple links to same file	Count once for du
Special files	FIFO, socket	Handle correctly

6.4 Verification Commands

# Test for memory leaks
$ valgrind --leak-check=full ./myfind --tree /usr/include

# Compare output with standard tools
$ diff <(./myfind --name "*.c" src | sort) <(find src -name "*.c" | sort)

# Performance comparison
$ time ./myfind --tree /usr >/dev/null
$ time tree /usr >/dev/null

# Test with strace
$ strace -c ./myfind --tree /home
# Check syscall counts are reasonable

7. Common Pitfalls & Debugging

Problem 1: “Crashes on deep directory trees”

Why: Stack overflow from recursion
Fix: Use iterative approach with explicit stack, or increase stack size

Problem 2: “Infinite loop with symlinks”

Why: Following symlinks without cycle detection
Fix: Use FTW_PHYS flag with nftw(), or track visited inodes

Problem 3: “Wrong sizes reported (du mode)”

Why: Not using lstat() for symlinks, or double-counting hardlinks
Fix: Use lstat(), track (dev, ino) pairs for hardlink dedup

Problem 4: “Buffer overflow in path construction”

Why: Using strcpy/strcat without bounds checking
Fix: Use snprintf() with proper size checks

Problem 5: “Missing files in output”

Why: TOCTOU race—file disappeared between readdir and stat
Fix: Handle ENOENT from stat() gracefully

Problem 6: “Permission denied breaks entire walk”

Why: Error handling stops traversal
Fix: Report error and continue to next entry

8. Extensions & Challenges

8.1 Easy Extensions

Extension	Description	Learning
–mtime filter	Find by modification time	Time comparisons
–perm filter	Find by permissions	Bitmask operations
–empty	Find empty files/dirs	Size and link count
–prune	Skip directories matching pattern	Traversal control

8.2 Advanced Challenges

Challenge	Description	Learning
Parallel walk	Multi-threaded traversal	Concurrency
Content search	grep -r equivalent	File reading + traversal
Watch mode	Continuous monitoring	inotify integration
Network filesystems	Handle NFS/CIFS	Timeout, error handling
Persistent state	Resume interrupted walk	Checkpointing

8.3 Research Topics

How does locate achieve instant search? (mlocate database)
How does git efficiently check status of large repos?
What is fanotify and how does it differ from inotify?
How do backup tools handle millions of files efficiently?

9. Real-World Connections

9.1 Production Systems Using This

System	How It Uses Directory Walking	Notable Feature
find	Recursive traversal with predicates	Expression language
rsync	Compare source/dest directories	Incremental sync
tar	Archive directory trees	Preserve metadata
du	Calculate disk usage	Hard link tracking
git	Index working directory	Stat cache
webpack	Watch source files	File change detection

9.2 How the Pros Do It

GNU find:

Uses nftw() internally on most systems
Implements full predicate language (-name, -type, -exec, etc.)
Handles -prune for skipping directories

rsync:

Builds file lists from both source and destination
Compares metadata to find changes
Optimized for network transfer

fd (modern find replacement):

Uses parallel traversal
Respects .gitignore by default
Written in Rust for performance

9.3 Reading the Source

# View GNU findutils source
$ git clone https://git.savannah.gnu.org/git/findutils.git
$ less findutils/find/ftsfind.c

# View tree source
$ apt source tree
$ less tree-*/tree.c

# View fd source (Rust)
$ git clone https://github.com/sharkdp/fd
$ less fd/src/walk.rs

10. Resources

10.1 Man Pages

$ man 3 opendir     # Open directory stream
$ man 3 readdir     # Read directory entry
$ man 3 closedir    # Close directory stream
$ man 3 nftw        # File tree walk
$ man 3 fnmatch     # Filename matching
$ man 2 stat        # Get file status
$ man 2 lstat       # Get symlink status
$ man 7 path_resolution  # How paths are resolved

10.2 Online Resources

10.3 Book Chapters

Book	Chapters	Topics Covered
“APUE” by Stevens	Ch. 4.21-4.22	Directory Reading, File Tree Walking
“TLPI” by Kerrisk	Ch. 18	Directories and Links
“Linux System Programming” by Love	Ch. 7	File and Directory Management

11. Self-Assessment Checklist

Before considering this project complete, verify:

12. Submission / Completion Criteria

Your project is complete when:

Functionality
- --tree mode displays directory structure correctly
- --du mode matches system du output
- --name filter works with glob patterns
- Cycle detection prevents infinite loops
Quality
- Compiles with gcc -Wall -Wextra -Werror with no warnings
- Zero valgrind errors
- Handles all error conditions gracefully
Testing
- Works on /usr/include (large directory)
- Handles permission denied gracefully
- Detects and reports symlink cycles
- Hard links counted correctly
Understanding
- Can explain directory structure internals
- Can explain cycle detection algorithm
- Understands nftw() vs manual walking trade-offs

Next Steps

After completing this project, you’re ready for:

Project 4: Shell Implementation - Use directory walking for command completion
Project 5: Process Monitor - Walk /proc filesystem
Project 10: File Watcher - Combine walking with inotify

The directory traversal patterns you’ve learned here appear in every tool that deals with the filesystem.