Project 10: Parallel Directory Traversal

Build a fast, parallel directory walker that respects ignore rules, detects binaries, and feeds file paths to a search engine.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2 weeks
Main Programming Language	Rust (Alternatives: C++, Go)
Alternative Programming Languages	C++, Go
Coolness Level	Level 3: Clever
Business Potential	Level 3: Infra / Dev tools
Prerequisites	concurrency basics, filesystem APIs
Key Topics	work queues, ignore rules, file filtering, traversal order

1. Learning Objectives

By completing this project, you will:

Implement a parallel directory walker with a work queue.
Respect .gitignore and custom ignore patterns.
Detect binary files and skip them by default.
Balance work across threads without contention.
Produce deterministic traversal order for testing.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Parallel File System Traversal

Fundamentals

Directory traversal is the process of walking a directory tree and discovering files. A naive traversal is depth-first or breadth-first, single-threaded. For large repositories, this becomes a bottleneck: the search engine may be fast, but it starves for work because path discovery is slow. Parallel traversal uses a shared work queue: each worker thread pops a directory path, enumerates its entries, and pushes subdirectories back into the queue while emitting file paths to a downstream channel.

Correctness also requires filtering: ignore rules, hidden files, file types, and binary detection. These filters reduce work and improve throughput. A good traversal system is not just fast, it is precise about what it includes.

Deep Dive into the concept

The work queue is the core concurrency primitive. It can be implemented as a mutex-protected queue, a lock-free queue, or a work-stealing deque per worker. Work stealing improves scalability because each worker primarily works locally but can steal from others when idle. Rust’s crossbeam deque and Go’s worker pools are typical choices.

Ignore rules add complexity: .gitignore patterns apply hierarchically, and some patterns can re-include files. Your walker must maintain a stack of ignore matchers as it descends directories. Binary detection is a pragmatic heuristic: read a small chunk and classify as binary if it contains NUL bytes. This should be fast and optional, because some files are binary but still contain text data.

Traversal order affects determinism. For testing, you may want lexicographic ordering of emitted paths. For performance, you may emit as soon as discovered. You can support both by buffering outputs when --deterministic is enabled.

How this fits on projects

This walker is a core component of the final search tool (P15) and influences I/O performance in P11.

Definitions & key terms

work queue -> shared list of directories to process
work stealing -> idle workers steal tasks from others
ignore rules -> patterns that exclude or include paths
binary detection -> heuristic to skip non-text files

Mental model diagram (ASCII)

Queue: [dirA, dirB]
Worker1 -> dirA -> push dirA/sub1, emit files
Worker2 -> dirB -> push dirB/sub2, emit files

How it works (step-by-step)

Seed queue with root directory.
Workers pop directories and enumerate entries.
Apply ignore rules and filters.
Push subdirectories to queue and emit files.
Continue until queue empty and workers idle.

Minimal concrete example

while let Some(dir) = queue.pop() {
    for entry in read_dir(dir) { /* filter + enqueue */ }
}

Common misconceptions

Misconception: Traversal is always I/O bound. Correction: Ignore matching and metadata calls can be CPU heavy.
Misconception: Deterministic order is free. Correction: It requires sorting and buffering, which costs time.

Check-your-understanding questions

Why does work stealing scale better than a single queue?
How do .gitignore rules propagate into subdirectories?
Why skip binary files by default?

Check-your-understanding answers

It reduces contention and keeps workers busy.
Rules apply hierarchically; each directory adds patterns.
Binary files rarely contain meaningful text matches and are expensive to scan.

Real-world applications

ripgrep, fd, ag, find replacements
build systems and code indexing

Where you’ll apply it

This project: Section 3.2 Functional Requirements and Section 4.4 Algorithm.
Also used in: P11-mmap-vs-read-strategy-selector, P15-build-your-own-ripgrep.

References

ripgrep ignore crate
bfs directory traversal blog (tavianator)

Key insights

Traversal speed often dominates search speed; parallelism and ignore rules are critical.

Summary

A parallel walker feeds your search engine efficiently while respecting user expectations about ignored files.

Homework/Exercises to practice the concept

Implement a single-threaded walker and benchmark it.
Add a second thread and measure throughput.

Solutions to the homework/exercises

Use recursive DFS and collect file paths.
Introduce a queue and split work across threads.

3. Project Specification

3.1 What You Will Build

A CLI tool parwalk that recursively walks directories in parallel, applies ignore rules, detects binaries, and outputs file paths to stdout or a JSON list.

3.2 Functional Requirements

Parallel traversal: configurable worker count.
Ignore rules: support .gitignore and custom ignore file.
Hidden files: skip by default, include with --hidden.
Binary detection: skip binary by default, include with --binary.
Deterministic mode: optional sorted output.

3.3 Non-Functional Requirements

Performance: faster than find on large repos.
Reliability: handle permission errors gracefully.
Usability: clear stats (files visited, skipped).

3.4 Example Usage / Output

$ ./parwalk --threads 8 --hidden ~/code
/home/user/code/README.md
/home/user/code/src/main.rs

3.5 Data Formats / Schemas / Protocols

JSON output example:

{"files":["/path/a","/path/b"],"skipped":120}

3.6 Edge Cases

Symlink loops
Permission denied directories
Very deep directory trees

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo run -- --threads 8 --hidden ~/code

3.7.2 Golden Path Demo (Deterministic)

Use --deterministic to produce sorted output for tests.

3.7.3 CLI Transcript (Success + Failure)

$ ./parwalk --threads 4 fixtures/tree
fixtures/tree/a.txt
fixtures/tree/sub/b.txt
exit code: 0

$ ./parwalk /root/secret
warning: permission denied: /root/secret
exit code: 0

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +------------+   +-----------+
| CLI Parse |-->| Work Queue|-->| Dir Worker |-->| Output    |
+-----------+   +-----------+   +------------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Task { path: PathBuf, ignore: IgnoreState }

4.4 Algorithm Overview

Seed queue with root.
Workers pop tasks, read directory entries.
Filter entries with ignore rules.
Enqueue subdirectories, emit files.

Complexity Analysis

Time: O(number of entries)
Space: O(depth of traversal + queue size)

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

parwalk/
├── src/
│   ├── main.rs
│   ├── queue.rs
│   ├── ignore.rs
│   ├── binary.rs
│   └── output.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How do I traverse huge directory trees fast without scanning junk files?”

5.4 Concepts You Must Understand First

Work queues and worker pools.
Hierarchical ignore rules.
Binary detection heuristics.

5.5 Questions to Guide Your Design

How will you prevent symlink loops?
How will you handle permission errors without stopping?
How will you keep output deterministic for tests?

5.6 Thinking Exercise

What happens if two workers scan the same directory? How do you prevent it?

5.7 The Interview Questions They’ll Ask

Why is .gitignore handling tricky?
How do you prevent contention in a parallel walker?

5.8 Hints in Layers

Hint 1: Start with a single-threaded walker. Hint 2: Add a queue with multiple worker threads. Hint 3: Add ignore rule stacking.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Single Thread (3 days)

Implement traversal and filtering.
Checkpoint: correct output on fixtures.

Phase 2: Parallelism (4 days)

Add worker pool and queue.
Checkpoint: speedup on large repo.

Phase 3: Ignore + Binary (3 days)

Add ignore rules and binary detection.
Checkpoint: matches rg --files on fixtures.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Ignore rules exclude node_modules.
Binary files skipped by default.
Permission errors do not abort traversal.

6.3 Test Data

fixtures/tree/
  a.txt
  sub/ignored.bin

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Log worker activity on small trees.
Compare output with rg --files.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --max-depth flag.
Add --type filters (files only, dirs only).

8.2 Intermediate Extensions

Add file size filters.
Cache ignore matcher per directory.

8.3 Advanced Extensions

Use git index for faster traversal inside repos.
Add adaptive thread scaling based on I/O.

9. Real-World Connections

9.1 Industry Applications

Code search tools and build systems
Backup and sync tools

ripgrep ignore crate
fd and bfs

9.3 Interview Relevance

Parallel traversal and work-stealing design

10. Resources

10.1 Essential Reading

ripgrep ignore documentation
bfs performance notes

10.2 Tools & Documentation

strace to inspect syscalls
hyperfine for traversal benchmarks

Previous: P09-teddy-algorithm-multi-literal-simd
Next: P11-mmap-vs-read-strategy-selector

11. Self-Assessment Checklist

11.1 Understanding

I can explain work-stealing traversal.
I can explain ignore rule stacking.

11.2 Implementation

Output matches rg --files on fixtures.
Deterministic mode works.

11.3 Growth

I can identify traversal bottlenecks with profiling.

12. Submission / Completion Criteria

Minimum Viable Completion:

Parallel walker with basic filtering

Full Completion:

.gitignore + binary detection

Excellence (Going Above & Beyond):

Git index traversal optimization
Adaptive thread scaling