Project 3: Recursive Directory Walker
Build a recursive directory traversal tool that walks entire directory trees, reporting files, computing sizes, and optionally filtering by criteria (name pattern, size, type).
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 - Advanced |
| Time Estimate | 20-40 hours |
| Language | C (primary), Rust/Go (alternatives) |
| Prerequisites | Projects 1-2, opendir/readdir, recursion |
| Key Topics | Directory Operations, Recursion, Path Construction, Cycle Detection |
1. Learning Objectives
After completing this project, you will:
- Master opendir(), readdir(), and closedir() for directory operations
- Understand how directories are structured as files containing name→inode mappings
- Handle path construction safely without buffer overflows
- Detect and handle symbolic link cycles to prevent infinite loops
- Implement depth-first vs breadth-first traversal strategies
- Use nftw() for standardized directory walking (and understand when to roll your own)
- Implement file filtering with fnmatch() for glob patterns
- Calculate directory sizes like
duwhile handling hard link deduplication - Handle permissions errors and disappearing files gracefully (TOCTOU issues)
2. Theoretical Foundation
2.1 Core Concepts
Directories as Files
In UNIX, a directory is a special file containing a list of (name, inode) pairs:
Directory File Content (Conceptual View)
┌────────────────────────────────────────────────────────────────┐
│ /home/user/project │
│ │
│ ┌─────────────────────┬──────────────────────────────────┐ │
│ │ Entry Name │ Inode Number │ │
│ ├─────────────────────┼──────────────────────────────────┤ │
│ │ . │ 12345 (this directory) │ │
│ │ .. │ 12340 (parent directory) │ │
│ │ README.md │ 12346 │ │
│ │ src │ 12350 (subdirectory) │ │
│ │ Makefile │ 12347 │ │
│ │ build │ 12360 (subdirectory) │ │
│ └─────────────────────┴──────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Key Points:
- "." always points to self
- ".." always points to parent
- Entries are NOT sorted (order depends on creation/deletion history)
- Names are just labels; the inode holds all metadata
Directory Traversal Flow
readdir() Loop Pattern
┌─────────────────────────────────────────────────────────────────┐
│ │
│ dir = opendir("/path/to/dir") │
│ │ │
│ ▼ │
│ ┌────────────────────┐ │
│ │ readdir(dir) ◄─────┼─────────────────────────────────────┐ │
│ └─────────┬──────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌────────────────────┐ │ │
│ │ entry == NULL? │──Yes──▶ closedir(dir), DONE │ │
│ └─────────┬──────────┘ │ │
│ │ No │ │
│ ▼ │ │
│ ┌────────────────────┐ │ │
│ │ Skip "." and ".." │ │ │
│ └─────────┬──────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌────────────────────┐ │ │
│ │ Build full path: │ │ │
│ │ dir + "/" + name │ │ │
│ └─────────┬──────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌────────────────────┐ │ │
│ │ lstat(full_path) │ │ │
│ └─────────┬──────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌────────────────────┐ ┌───────────────────────┐ │ │
│ │ Is directory? │──Yes───▶│ Recurse into subdir │───┘ │
│ └─────────┬──────────┘ └───────────────────────┘ │
│ │ No │
│ ▼ │
│ ┌────────────────────┐ │
│ │ Process file │──────────────────────────────────────┶ │
│ │ (print, count, etc)│ │
│ └────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Symbolic Link Cycles
Symlinks can create infinite loops that naive traversal will never escape:
Filesystem with Cycle:
/home/user/
├── project/
│ ├── src/
│ │ └── link -> ../.. ← Points back to /home/user!
│ └── data/
└── docs/
Traversal without cycle detection:
1. Enter /home/user/project
2. Enter /home/user/project/src
3. Follow link -> /home/user
4. Enter /home/user/project (AGAIN!)
5. Enter /home/user/project/src
6. Follow link -> /home/user
7. ... INFINITE LOOP ...
Detection Strategy:
Track (device, inode) pairs of directories visited.
Before entering a directory, check if (dev, ino) already seen.
┌─────────────────────────────────────────────────────────┐
│ Visited Set (Hash Table) │
│ │
│ Key: (dev, ino) Seen? │
│ ───────────────────────────── │
│ (8, 12345) YES ← /home/user │
│ (8, 12350) YES ← /home/user/project │
│ (8, 12351) YES ← /home/user/project/src │
│ │
│ When link resolves to (8, 12345), we skip it! │
└─────────────────────────────────────────────────────────┘
Path Construction Safety
Building paths by concatenation is error-prone:
WRONG: Potential buffer overflow
char path[100];
strcpy(path, dir); // What if dir is 99 chars?
strcat(path, "/"); // Now 100 chars
strcat(path, entry); // OVERFLOW if entry has any length!
RIGHT: Safe with bounds checking
char path[PATH_MAX];
int len = snprintf(path, sizeof(path), "%s/%s", dir, entry->d_name);
if (len >= sizeof(path)) {
// Path too long! Handle error
fprintf(stderr, "Path too long: %s/%s\n", dir, entry->d_name);
continue; // Skip this entry
}
RIGHT: Using realpath() for canonical paths
char resolved[PATH_MAX];
if (realpath(path, resolved) == NULL) {
// Path doesn't exist or other error
perror(path);
}
nftw() vs Manual Recursion
The nftw() function provides standardized directory walking:
// nftw callback signature
int callback(const char *fpath, const struct stat *sb,
int typeflag, struct FTW *ftwbuf) {
// fpath: full path to file
// sb: stat structure (already called for you!)
// typeflag: FTW_F (file), FTW_D (directory), FTW_SL (symlink), etc.
// ftwbuf->level: depth in tree (0 = starting directory)
// ftwbuf->base: index in fpath where basename starts
printf("Level %d: %s\n", ftwbuf->level, fpath);
return 0; // 0 to continue, non-zero to stop
}
// Usage
nftw("/path", callback, 20, FTW_PHYS);
// 20 = max file descriptors to use
// FTW_PHYS = don't follow symlinks (like lstat)
When to use nftw():
- Simple traversal with standard behavior
- Portability is important
- Don’t need fine-grained control
When to roll your own:
- Need custom cycle detection
- Need to modify traversal order
- Need parallel traversal
- Building interactive tool (need to pause/resume)
2.2 Why This Matters
Directory traversal is fundamental to countless tools:
- Backup software: Walk directories to find files to backup
- Search tools: find, locate, grep -r
- Build systems: make, cmake (find source files)
- Sync tools: rsync, dropbox
- Disk usage: du, ncdu
- Security scanning: Find setuid files, check permissions
- Code editors: File tree views, project search
Real-world scale:
- A typical Linux system has 200,000+ files
- Large source trees (Linux kernel) have 70,000+ files
- Backup software must handle millions of files efficiently
2.3 Historical Context
The opendir/readdir API dates to the original UNIX but was standardized in POSIX.1. The key innovation was treating directories as files that you open and read entries from, rather than a special API.
The nftw() function (new file tree walk) replaced older ftw() to handle symbolic links properly. The FTW_PHYS flag was added specifically because symlink handling was a major source of bugs and security issues.
2.4 Common Misconceptions
Misconception 1: “readdir() returns entries in sorted order”
- Reality: Order depends on filesystem and creation/deletion history. Sort yourself if needed.
Misconception 2: “d_type in dirent is always valid”
- Reality: Some filesystems don’t support d_type (returns DT_UNKNOWN). Always stat() if you need the type.
Misconception 3: “Symlinks to directories should always be followed”
- Reality: Following symlinks can cause loops and security issues. Default to not following.
Misconception 4: “PATH_MAX is the absolute maximum path length”
- Reality: Some filesystems and operations can exceed PATH_MAX. Handle long paths gracefully.
Misconception 5: “File won’t disappear between readdir and stat”
- Reality: This is a TOCTOU (time-of-check-time-of-use) race. Handle ENOENT from stat().
3. Project Specification
3.1 What You Will Build
A directory traversal tool called myfind that combines features of tree, du, and find:
- Tree mode: Display directory structure visually
- Disk usage mode: Calculate sizes like
du - Search mode: Find files matching patterns
- All modes with proper cycle detection and error handling
3.2 Functional Requirements
- Tree display:
myfind --tree DIRshows directory structure - Disk usage:
myfind --du DIRshows sizes likedu - Name search:
myfind --name "PATTERN" DIRfinds matching files - Size filter:
myfind --size +1M DIRfinds files larger than 1MB - Type filter:
myfind --type f|d|l DIRfilters by file type - Symlink handling: Default don’t follow;
-Lto follow - Max depth:
--maxdepth Nlimits recursion depth - Execute action:
--exec CMD {}runs command on matches
3.3 Non-Functional Requirements
- Cycle safety: Must detect and skip symbolic link cycles
- Memory efficiency: Handle directories with millions of files
- Error resilience: Continue on permission denied, missing files
- Performance: Comparable to standard tools on same workload
- Correct sizes: Handle hard links correctly for du mode
3.4 Example Usage / Output
# 1. Basic tree display
$ ./myfind --tree /usr/include
/usr/include
├── assert.h
├── complex.h
├── errno.h
├── linux/
│ ├── capability.h
│ ├── fs.h
│ └── types.h
├── sys/
│ ├── socket.h
│ ├── stat.h
│ └── types.h
└── unistd.h
4 directories, 10 files
# 2. Disk usage mode
$ ./myfind --du /var/log
12K /var/log/apt
4K /var/log/apt/history.log
8K /var/log/apt/term.log
256K /var/log/journal
1.2M /var/log
Total: 1.2M in 23 files, 5 directories
# 3. Find files matching pattern
$ ./myfind --name "*.h" /usr/include | head -5
/usr/include/assert.h
/usr/include/complex.h
/usr/include/errno.h
/usr/include/features.h
/usr/include/float.h
# 4. Find files larger than 1MB
$ ./myfind --size +1M /usr
/usr/lib/libc.so.6
/usr/lib/libLLVM-14.so.1
/usr/bin/clang
...
# 5. Handle symbolic link cycle gracefully
$ ln -s . loop
$ ./myfind --tree .
.
├── file.txt
└── loop -> . (symlink, skipping to avoid cycle)
# 6. Find with type filter
$ ./myfind --type d /etc | head -5
/etc
/etc/apt
/etc/apt/apt.conf.d
/etc/apt/preferences.d
/etc/alternatives
# 7. Execute command on matches
$ ./myfind --name "*.log" --exec "wc -l {}" /var/log
42 /var/log/syslog
127 /var/log/auth.log
...
3.5 Real World Outcome
What success looks like:
- Tree display: Properly indented, shows structure clearly
- Accurate sizes: Matches
duoutput (accounting for hard links) - Pattern matching: Works with *, ?, [abc] patterns
- Cycle safety: Never hangs on symlink loops
- Error handling: Continues gracefully on permission denied
- Performance: Handles /usr (100k+ files) in seconds
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ myfind │
│ │
│ ┌─────────────────┐ │
│ │ Parse Args │ ← --tree, --du, --name, --size, etc. │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Initialize │ ← Create visited set for cycle detection │
│ │ State │ ← Set up filters, output mode │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Recursive Walker │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Check depth │ ← Respect --maxdepth │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Check cycle │ ← Is (dev,ino) in visited set? │ │
│ │ └────────┬────────┘ │ │
│ │ │ Not a cycle │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ opendir() │ │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ For each entry (readdir) │ │ │
│ │ │ │ │ │
│ │ │ ┌─────────────────┐ │ │ │
│ │ │ │ Skip . and .. │ │ │ │
│ │ │ └────────┬────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────────┐ │ │ │
│ │ │ │ Build path │ │ │ │
│ │ │ └────────┬────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────────┐ │ │ │
│ │ │ │ lstat() │ │ │ │
│ │ │ └────────┬────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────────┐ │ │ │
│ │ │ │ Apply filters │ ← name, size, type │ │ │
│ │ │ └────────┬────────┘ │ │ │
│ │ │ │ Matches │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────────┐ │ │ │
│ │ │ │ Output/Action │ ← print, accumulate size │ │ │
│ │ │ └────────┬────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ ┌─────────────────┐ ┌────────────────┐ │ │ │
│ │ │ │ Is directory? │──Yes───▶│ Recurse │ │ │ │
│ │ │ └─────────────────┘ └────────────────┘ │ │ │
│ │ │ │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ closedir() │ │ │
│ │ └─────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ Print Summary │ ← Total files, dirs, size │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
4.2 Key Components
| Component | Purpose | Key Functions |
|---|---|---|
| Argument Parser | Parse command line options | getopt_long() |
| Cycle Detector | Track visited (dev,ino) pairs | Hash set operations |
| Path Builder | Safely construct paths | snprintf(), realpath() |
| Walker | Recursive directory traversal | opendir(), readdir(), closedir() |
| Filter Engine | Apply name/size/type filters | fnmatch(), stat comparisons |
| Output Formatter | Tree, du, or plain output | printf with formatting |
| Stats Tracker | Count files, dirs, accumulate sizes | Running totals |
4.3 Data Structures
// Configuration from command line
typedef struct {
const char *start_path;
// Mode (mutually exclusive)
int tree_mode; // --tree
int du_mode; // --du
int find_mode; // (default or with filters)
// Filters
const char *name_pattern; // --name "*.c"
off_t min_size; // --size +1M (positive = minimum)
off_t max_size; // --size -1M (negative = maximum)
int file_type; // --type f/d/l
// Options
int follow_symlinks; // -L
int max_depth; // --maxdepth N, -1 = unlimited
const char *exec_cmd; // --exec CMD
int verbose; // -v
} config_t;
// Traversal state
typedef struct {
// Cycle detection
struct {
dev_t dev;
ino_t ino;
} *visited;
size_t visited_count;
size_t visited_capacity;
// Statistics
size_t file_count;
size_t dir_count;
size_t symlink_count;
off_t total_size;
// For du mode: track seen inodes (hard link dedup)
struct {
dev_t dev;
ino_t ino;
} *seen_inodes;
size_t seen_count;
size_t seen_capacity;
// Current depth
int current_depth;
} walker_state_t;
// Directory entry info
typedef struct {
char path[PATH_MAX];
struct stat st;
int stat_failed; // 1 if stat() failed
int is_cycle; // 1 if this is a cycle
} entry_info_t;
4.4 Algorithm Overview
FUNCTION walk_directory(path, depth, state, config):
// Check depth limit
IF depth > config.max_depth AND config.max_depth >= 0:
RETURN
// lstat the directory itself
IF lstat(path, &st) == -1:
PRINT error
RETURN
// Check for cycle (for directories only)
IF is_directory(st.st_mode):
IF (st.st_dev, st.st_ino) IN state.visited:
IF config.verbose:
PRINT "Skipping cycle: " + path
RETURN
ADD (st.st_dev, st.st_ino) TO state.visited
// Open directory
dir = opendir(path)
IF dir == NULL:
PRINT error
RETURN
// Read entries
WHILE (entry = readdir(dir)) != NULL:
IF entry.d_name == "." OR entry.d_name == "..":
CONTINUE
// Build full path
full_path = path + "/" + entry.d_name
IF path_too_long:
PRINT warning
CONTINUE
// Get entry info
IF lstat(full_path, &entry_st) == -1:
IF errno == ENOENT:
// File disappeared (TOCTOU race), skip silently
CONTINUE
PRINT error
CONTINUE
// Apply filters
IF NOT matches_filters(full_path, entry_st, config):
// Still recurse into directories even if they don't match
IF is_directory(entry_st.st_mode):
walk_directory(full_path, depth + 1, state, config)
CONTINUE
// Process matching entry
process_entry(full_path, entry_st, depth, state, config)
// Recurse into directories
IF is_directory(entry_st.st_mode):
IF NOT is_symlink(entry_st.st_mode) OR config.follow_symlinks:
walk_directory(full_path, depth + 1, state, config)
closedir(dir)
FUNCTION matches_filters(path, st, config):
// Name filter
IF config.name_pattern != NULL:
IF NOT fnmatch(config.name_pattern, basename(path)):
RETURN FALSE
// Size filter
IF config.min_size > 0 AND st.st_size < config.min_size:
RETURN FALSE
IF config.max_size > 0 AND st.st_size > config.max_size:
RETURN FALSE
// Type filter
IF config.file_type != 0:
IF config.file_type == 'f' AND NOT is_regular(st.st_mode):
RETURN FALSE
IF config.file_type == 'd' AND NOT is_directory(st.st_mode):
RETURN FALSE
IF config.file_type == 'l' AND NOT is_symlink(st.st_mode):
RETURN FALSE
RETURN TRUE
5. Implementation Guide
5.1 Development Environment Setup
# Create project directory
$ mkdir -p ~/projects/myfind
$ cd ~/projects/myfind
# Create test directory structure
$ mkdir -p testdir/subdir1/subsubdir
$ mkdir -p testdir/subdir2
$ touch testdir/file1.txt testdir/file2.c
$ touch testdir/subdir1/file3.h
$ touch testdir/subdir1/subsubdir/file4.cpp
$ touch testdir/subdir2/file5.py
# Create symlink cycle for testing
$ ln -s .. testdir/subdir1/parent_link
# Create various file types
$ mkfifo testdir/mypipe
$ ln -s file1.txt testdir/link_to_file1
# Verify structure
$ find testdir -type f -o -type d -o -type l | sort
5.2 Project Structure
myfind/
├── myfind.c # Main source file
├── Makefile # Build configuration
├── testdir/ # Test directory structure
│ ├── file1.txt
│ ├── file2.c
│ ├── link_to_file1 -> file1.txt
│ ├── mypipe # FIFO
│ ├── subdir1/
│ │ ├── file3.h
│ │ ├── parent_link -> .. # Cycle!
│ │ └── subsubdir/
│ │ └── file4.cpp
│ └── subdir2/
│ └── file5.py
└── README.md
5.3 The Core Question You’re Answering
“How do you traverse an entire filesystem efficiently while handling all the edge cases (permissions, symlinks, mount points)?”
This question underlies every backup tool, search utility, and file synchronizer ever written. The naive approach (recursion with string concatenation) fails on deep trees and cyclic links.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Directory as a File
- A directory is a file containing name→inode mappings
- What does opendir/readdir/closedir return?
- Book Reference: “APUE” Ch. 4.21-4.22
- Path Construction
- How to join directory path with filename?
- Maximum path length (PATH_MAX)
- Book Reference: “APUE” Ch. 4.14
- Symbolic Link Handling
- When to follow vs. when to skip?
- How to detect cycles?
- lstat() vs stat() for link detection
- nftw() vs Manual Recursion
- What does nftw() provide?
- When would you prefer manual control?
- Book Reference: “APUE” Ch. 4.22
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Recursion Strategy
- Use actual recursion or maintain a work queue?
- How deep can the call stack go? (Default stack ~8MB, each frame ~4KB = ~2000 levels)
- For extreme depth, consider iterative with explicit stack
- Memory Management
- How to handle paths—static buffer or malloc?
- How to track visited inodes for cycle detection?
- How to handle very large directories (millions of entries)?
- Error Handling
- What if you can’t read a directory (permission)?
- What if a file disappears mid-traversal (TOCTOU)?
- Should errors stop traversal or just be reported?
5.6 Thinking Exercise
Model the Directory Structure
Consider this filesystem:
/home/user/
├── file1.txt (inode 100)
├── subdir/ (inode 101)
│ └── file2.txt (inode 102)
├── link1 -> subdir (symlink, inode 103)
└── hardlink.txt (inode 100, same as file1.txt!)
What happens if you:
1. Traverse without following symlinks?
- Visit: /home/user, subdir, file2.txt
- Skip link1 (it's a symlink)
- Total: 3 entries
2. Follow symlinks but don't track inodes?
- Visit: /home/user, subdir, file2.txt
- Follow link1 -> enter subdir AGAIN
- Visit file2.txt AGAIN
- Could loop forever if symlink points to ancestor!
3. Count bytes for hardlink.txt—double count or not?
- file1.txt and hardlink.txt share inode 100
- du counts each inode only once
- Track (dev, ino) pairs to avoid double-counting
Trace this mentally before coding!
5.7 Hints in Layers
Hint 1: Basic Structure
Open directory with opendir(), loop with readdir(), skip “.” and “..”, recurse on directories, close with closedir().
void walk(const char *path, int depth) {
DIR *dir = opendir(path);
if (!dir) { perror(path); return; }
struct dirent *entry;
while ((entry = readdir(dir)) != NULL) {
if (strcmp(entry->d_name, ".") == 0 ||
strcmp(entry->d_name, "..") == 0)
continue;
// Build full path and process...
}
closedir(dir);
}
Hint 2: Path Building
// Safe path construction
char full_path[PATH_MAX];
int len = snprintf(full_path, sizeof(full_path), "%s/%s",
dir_path, entry->d_name);
if (len >= sizeof(full_path)) {
fprintf(stderr, "Path too long: %s/%s\n", dir_path, entry->d_name);
continue; // Skip this entry
}
Hint 3: Using nftw()
#define _XOPEN_SOURCE 500 // Required for nftw
#include <ftw.h>
int callback(const char *fpath, const struct stat *sb,
int typeflag, struct FTW *ftwbuf) {
// typeflag: FTW_F (file), FTW_D (directory pre),
// FTW_DP (directory post), FTW_SL (symlink)
// ftwbuf->level: depth in tree
// ftwbuf->base: index where basename starts
printf("%*s%s\n", ftwbuf->level * 2, "", fpath + ftwbuf->base);
return 0; // Continue walking
}
int main(int argc, char *argv[]) {
// FTW_PHYS: Don't follow symlinks
// 20: max file descriptors to use
nftw(argv[1], callback, 20, FTW_PHYS);
return 0;
}
Hint 4: Cycle Detection Track (device, inode) pairs in a hash set. Before entering a directory, check if already visited.
typedef struct {
dev_t dev;
ino_t ino;
} dir_id_t;
int is_visited(dir_id_t *visited, size_t count, dev_t dev, ino_t ino) {
for (size_t i = 0; i < count; i++) {
if (visited[i].dev == dev && visited[i].ino == ino)
return 1;
}
return 0;
}
5.8 The Interview Questions They’ll Ask
- “How would you find all files modified in the last 24 hours?”
- Use stat() to get st_mtime
- Compare to current time minus 24 hours
find -mtime -1equivalent
- “How does
duavoid counting hardlinked files twice?”- Tracks (dev, ino) pairs of all files seen
- Only counts size for first occurrence of each inode
- This is why du can be slower than ls
- “What’s the difference between depth-first and breadth-first traversal for filesystems?”
- DFS: Process subdirectories before siblings (typical)
- BFS: Process all entries at one level before going deeper
- DFS uses less memory, BFS better for finding nearby files
- “How would you make a parallel directory walker?”
- Producer-consumer with directory queue
- Multiple worker threads calling readdir
- Must handle shared visited set
- Consider io_uring for modern async I/O
- “What is a TOCTOU race in filesystem operations?”
- Time-Of-Check-Time-Of-Use
- File can change between stat() and open()
- Or disappear between readdir() and stat()
- Handle with retry or openat() for atomic operations
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Directory operations | “APUE” by Stevens | Ch. 4.21-4.22 |
| nftw() details | “The Linux Programming Interface” | Ch. 18 |
| Filesystem internals | “Understanding the Linux Kernel” | Ch. 12 |
| Pattern matching | “The Linux Programming Interface” | Ch. 18.8 |
5.10 Implementation Phases
Phase 1: Basic Walk (3-4 hours)
- Walk a single directory
- Print all entries
- Handle “.” and “..” correctly
Phase 2: Recursion (2-3 hours)
- Add recursion for subdirectories
- Track depth
- Handle max depth option
Phase 3: Cycle Detection (2-3 hours)
- Implement visited set
- Test with symlink cycles
- Add verbose output for skipped cycles
Phase 4: Tree Output (2-3 hours)
- Format output with tree characters (├── └──)
- Track last-entry state for correct formatting
- Add file count summary
Phase 5: Filters (3-4 hours)
- Add name pattern matching with fnmatch()
- Add size filters
- Add type filters
Phase 6: Du Mode (2-3 hours)
- Accumulate sizes
- Track inodes for hard link deduplication
- Human-readable size output
5.11 Key Implementation Decisions
| Decision | Trade-offs |
|---|---|
| Recursion vs iteration | Recursion is cleaner; iteration handles extreme depth |
| nftw() vs manual | nftw() is portable; manual gives control |
| Hash table vs array for visited | Hash is O(1); array is simpler but O(n) |
| Static vs dynamic path buffer | Static is simpler; dynamic handles any length |
| Stop vs continue on error | Continue is user-friendly; stop may be safer |
6. Testing Strategy
6.1 Unit Tests
| Test | Input | Expected Result |
|---|---|---|
| Skip . and .. | readdir loop | Only real entries |
| Path building | “/foo” + “bar” | “/foo/bar” |
| Path too long | near PATH_MAX | Warning, skip |
| fnmatch “*.c” | “test.c” | Match |
| fnmatch “*.c” | “test.h” | No match |
| Size filter +1M | 2MB file | Match |
| Size filter +1M | 500KB file | No match |
6.2 Integration Tests
# Test basic tree
$ ./myfind --tree testdir
# Should show proper tree structure
# Test cycle handling
$ ln -s .. testdir/subdir1/cycle
$ ./myfind --tree testdir
# Should not hang, should note cycle
# Test name filter
$ ./myfind --name "*.c" testdir
# Should only show .c files
# Test du mode
$ ./myfind --du testdir
$ du -sh testdir
# Sizes should match
# Compare with find
$ ./myfind --name "*.h" /usr/include | wc -l
$ find /usr/include -name "*.h" | wc -l
# Counts should match
6.3 Edge Cases to Test
| Case | Setup | Expected Behavior |
|---|---|---|
| Empty directory | mkdir empty | Shows directory, no entries |
| Permission denied | chmod 000 dir | Error message, continue |
| Symlink cycle | ln -s . loop | Detect, skip with message |
| Dangling symlink | ln -s /nonexistent link | Report link, note dangling |
| Very deep tree | 1000+ levels | Handle without stack overflow |
| Huge directory | 100k+ files | Complete in reasonable time |
| Hard links | Multiple links to same file | Count once for du |
| Special files | FIFO, socket | Handle correctly |
6.4 Verification Commands
# Test for memory leaks
$ valgrind --leak-check=full ./myfind --tree /usr/include
# Compare output with standard tools
$ diff <(./myfind --name "*.c" src | sort) <(find src -name "*.c" | sort)
# Performance comparison
$ time ./myfind --tree /usr >/dev/null
$ time tree /usr >/dev/null
# Test with strace
$ strace -c ./myfind --tree /home
# Check syscall counts are reasonable
7. Common Pitfalls & Debugging
Problem 1: “Crashes on deep directory trees”
- Why: Stack overflow from recursion
- Fix: Use iterative approach with explicit stack, or increase stack size
Problem 2: “Infinite loop with symlinks”
- Why: Following symlinks without cycle detection
- Fix: Use
FTW_PHYSflag with nftw(), or track visited inodes
Problem 3: “Wrong sizes reported (du mode)”
- Why: Not using lstat() for symlinks, or double-counting hardlinks
- Fix: Use lstat(), track (dev, ino) pairs for hardlink dedup
Problem 4: “Buffer overflow in path construction”
- Why: Using strcpy/strcat without bounds checking
- Fix: Use snprintf() with proper size checks
Problem 5: “Missing files in output”
- Why: TOCTOU race—file disappeared between readdir and stat
- Fix: Handle ENOENT from stat() gracefully
Problem 6: “Permission denied breaks entire walk”
- Why: Error handling stops traversal
- Fix: Report error and continue to next entry
8. Extensions & Challenges
8.1 Easy Extensions
| Extension | Description | Learning |
|---|---|---|
| –mtime filter | Find by modification time | Time comparisons |
| –perm filter | Find by permissions | Bitmask operations |
| –empty | Find empty files/dirs | Size and link count |
| –prune | Skip directories matching pattern | Traversal control |
8.2 Advanced Challenges
| Challenge | Description | Learning |
|---|---|---|
| Parallel walk | Multi-threaded traversal | Concurrency |
| Content search | grep -r equivalent | File reading + traversal |
| Watch mode | Continuous monitoring | inotify integration |
| Network filesystems | Handle NFS/CIFS | Timeout, error handling |
| Persistent state | Resume interrupted walk | Checkpointing |
8.3 Research Topics
- How does
locateachieve instant search? (mlocate database) - How does git efficiently check status of large repos?
- What is fanotify and how does it differ from inotify?
- How do backup tools handle millions of files efficiently?
9. Real-World Connections
9.1 Production Systems Using This
| System | How It Uses Directory Walking | Notable Feature |
|---|---|---|
| find | Recursive traversal with predicates | Expression language |
| rsync | Compare source/dest directories | Incremental sync |
| tar | Archive directory trees | Preserve metadata |
| du | Calculate disk usage | Hard link tracking |
| git | Index working directory | Stat cache |
| webpack | Watch source files | File change detection |
9.2 How the Pros Do It
GNU find:
- Uses nftw() internally on most systems
- Implements full predicate language (-name, -type, -exec, etc.)
- Handles -prune for skipping directories
rsync:
- Builds file lists from both source and destination
- Compares metadata to find changes
- Optimized for network transfer
fd (modern find replacement):
- Uses parallel traversal
- Respects .gitignore by default
- Written in Rust for performance
9.3 Reading the Source
# View GNU findutils source
$ git clone https://git.savannah.gnu.org/git/findutils.git
$ less findutils/find/ftsfind.c
# View tree source
$ apt source tree
$ less tree-*/tree.c
# View fd source (Rust)
$ git clone https://github.com/sharkdp/fd
$ less fd/src/walk.rs
10. Resources
10.1 Man Pages
$ man 3 opendir # Open directory stream
$ man 3 readdir # Read directory entry
$ man 3 closedir # Close directory stream
$ man 3 nftw # File tree walk
$ man 3 fnmatch # Filename matching
$ man 2 stat # Get file status
$ man 2 lstat # Get symlink status
$ man 7 path_resolution # How paths are resolved
10.2 Online Resources
10.3 Book Chapters
| Book | Chapters | Topics Covered |
|---|---|---|
| “APUE” by Stevens | Ch. 4.21-4.22 | Directory Reading, File Tree Walking |
| “TLPI” by Kerrisk | Ch. 18 | Directories and Links |
| “Linux System Programming” by Love | Ch. 7 | File and Directory Management |
11. Self-Assessment Checklist
Before considering this project complete, verify:
- I can explain how directories are stored as files
- I understand the difference between opendir/readdir and nftw
- My tool correctly handles symbolic link cycles
- Path construction is safe from buffer overflows
- Permission denied errors don’t crash the program
- TOCTOU races are handled gracefully
- Hard links are counted once in du mode
- Tree output formatting is correct
- Pattern matching with fnmatch() works correctly
- I can answer all the interview questions listed above
- My code has zero memory leaks (valgrind verified)
12. Submission / Completion Criteria
Your project is complete when:
- Functionality
--treemode displays directory structure correctly--dumode matches system du output--namefilter works with glob patterns- Cycle detection prevents infinite loops
- Quality
- Compiles with
gcc -Wall -Wextra -Werrorwith no warnings - Zero valgrind errors
- Handles all error conditions gracefully
- Compiles with
- Testing
- Works on /usr/include (large directory)
- Handles permission denied gracefully
- Detects and reports symlink cycles
- Hard links counted correctly
- Understanding
- Can explain directory structure internals
- Can explain cycle detection algorithm
- Understands nftw() vs manual walking trade-offs
Next Steps
After completing this project, you’re ready for:
- Project 4: Shell Implementation - Use directory walking for command completion
- Project 5: Process Monitor - Walk /proc filesystem
- Project 10: File Watcher - Combine walking with inotify
The directory traversal patterns you’ve learned here appear in every tool that deals with the filesystem.