Project 10: Parallel Directory Traversal
Build a fast, parallel directory walker that respects ignore rules, detects binaries, and feeds file paths to a search engine.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 2 weeks |
| Main Programming Language | Rust (Alternatives: C++, Go) |
| Alternative Programming Languages | C++, Go |
| Coolness Level | Level 3: Clever |
| Business Potential | Level 3: Infra / Dev tools |
| Prerequisites | concurrency basics, filesystem APIs |
| Key Topics | work queues, ignore rules, file filtering, traversal order |
1. Learning Objectives
By completing this project, you will:
- Implement a parallel directory walker with a work queue.
- Respect .gitignore and custom ignore patterns.
- Detect binary files and skip them by default.
- Balance work across threads without contention.
- Produce deterministic traversal order for testing.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Parallel File System Traversal
Fundamentals
Directory traversal is the process of walking a directory tree and discovering files. A naive traversal is depth-first or breadth-first, single-threaded. For large repositories, this becomes a bottleneck: the search engine may be fast, but it starves for work because path discovery is slow. Parallel traversal uses a shared work queue: each worker thread pops a directory path, enumerates its entries, and pushes subdirectories back into the queue while emitting file paths to a downstream channel.
Correctness also requires filtering: ignore rules, hidden files, file types, and binary detection. These filters reduce work and improve throughput. A good traversal system is not just fast, it is precise about what it includes.
Deep Dive into the concept
The work queue is the core concurrency primitive. It can be implemented as a mutex-protected queue, a lock-free queue, or a work-stealing deque per worker. Work stealing improves scalability because each worker primarily works locally but can steal from others when idle. Rust’s crossbeam deque and Go’s worker pools are typical choices.
Ignore rules add complexity: .gitignore patterns apply hierarchically, and some patterns can re-include files. Your walker must maintain a stack of ignore matchers as it descends directories. Binary detection is a pragmatic heuristic: read a small chunk and classify as binary if it contains NUL bytes. This should be fast and optional, because some files are binary but still contain text data.
Traversal order affects determinism. For testing, you may want lexicographic ordering of emitted paths. For performance, you may emit as soon as discovered. You can support both by buffering outputs when --deterministic is enabled.
How this fits on projects
This walker is a core component of the final search tool (P15) and influences I/O performance in P11.
Definitions & key terms
- work queue -> shared list of directories to process
- work stealing -> idle workers steal tasks from others
- ignore rules -> patterns that exclude or include paths
- binary detection -> heuristic to skip non-text files
Mental model diagram (ASCII)
Queue: [dirA, dirB]
Worker1 -> dirA -> push dirA/sub1, emit files
Worker2 -> dirB -> push dirB/sub2, emit files
How it works (step-by-step)
- Seed queue with root directory.
- Workers pop directories and enumerate entries.
- Apply ignore rules and filters.
- Push subdirectories to queue and emit files.
- Continue until queue empty and workers idle.
Minimal concrete example
while let Some(dir) = queue.pop() {
for entry in read_dir(dir) { /* filter + enqueue */ }
}
Common misconceptions
- Misconception: Traversal is always I/O bound. Correction: Ignore matching and metadata calls can be CPU heavy.
- Misconception: Deterministic order is free. Correction: It requires sorting and buffering, which costs time.
Check-your-understanding questions
- Why does work stealing scale better than a single queue?
- How do .gitignore rules propagate into subdirectories?
- Why skip binary files by default?
Check-your-understanding answers
- It reduces contention and keeps workers busy.
- Rules apply hierarchically; each directory adds patterns.
- Binary files rarely contain meaningful text matches and are expensive to scan.
Real-world applications
- ripgrep, fd, ag, find replacements
- build systems and code indexing
Where you’ll apply it
- This project: Section 3.2 Functional Requirements and Section 4.4 Algorithm.
- Also used in: P11-mmap-vs-read-strategy-selector, P15-build-your-own-ripgrep.
References
- ripgrep ignore crate
- bfs directory traversal blog (tavianator)
Key insights
Traversal speed often dominates search speed; parallelism and ignore rules are critical.
Summary
A parallel walker feeds your search engine efficiently while respecting user expectations about ignored files.
Homework/Exercises to practice the concept
- Implement a single-threaded walker and benchmark it.
- Add a second thread and measure throughput.
Solutions to the homework/exercises
- Use recursive DFS and collect file paths.
- Introduce a queue and split work across threads.
3. Project Specification
3.1 What You Will Build
A CLI tool parwalk that recursively walks directories in parallel, applies ignore rules, detects binaries, and outputs file paths to stdout or a JSON list.
3.2 Functional Requirements
- Parallel traversal: configurable worker count.
- Ignore rules: support .gitignore and custom ignore file.
- Hidden files: skip by default, include with
--hidden. - Binary detection: skip binary by default, include with
--binary. - Deterministic mode: optional sorted output.
3.3 Non-Functional Requirements
- Performance: faster than
findon large repos. - Reliability: handle permission errors gracefully.
- Usability: clear stats (files visited, skipped).
3.4 Example Usage / Output
$ ./parwalk --threads 8 --hidden ~/code
/home/user/code/README.md
/home/user/code/src/main.rs
3.5 Data Formats / Schemas / Protocols
JSON output example:
{"files":["/path/a","/path/b"],"skipped":120}
3.6 Edge Cases
- Symlink loops
- Permission denied directories
- Very deep directory trees
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
cargo run -- --threads 8 --hidden ~/code
3.7.2 Golden Path Demo (Deterministic)
Use --deterministic to produce sorted output for tests.
3.7.3 CLI Transcript (Success + Failure)
$ ./parwalk --threads 4 fixtures/tree
fixtures/tree/a.txt
fixtures/tree/sub/b.txt
exit code: 0
$ ./parwalk /root/secret
warning: permission denied: /root/secret
exit code: 0
4. Solution Architecture
4.1 High-Level Design
+-----------+ +-----------+ +------------+ +-----------+
| CLI Parse |-->| Work Queue|-->| Dir Worker |-->| Output |
+-----------+ +-----------+ +------------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Queue | distribute directories | work-stealing deque | | Ignore engine | match patterns | per-dir stack | | Binary detector | sample bytes | NUL heuristic | | Output | emit paths | optional sorting |
4.3 Data Structures (No Full Code)
struct Task { path: PathBuf, ignore: IgnoreState }
4.4 Algorithm Overview
- Seed queue with root.
- Workers pop tasks, read directory entries.
- Filter entries with ignore rules.
- Enqueue subdirectories, emit files.
Complexity Analysis
- Time: O(number of entries)
- Space: O(depth of traversal + queue size)
5. Implementation Guide
5.1 Development Environment Setup
cargo build --release
5.2 Project Structure
parwalk/
├── src/
│ ├── main.rs
│ ├── queue.rs
│ ├── ignore.rs
│ ├── binary.rs
│ └── output.rs
└── fixtures/
5.3 The Core Question You’re Answering
“How do I traverse huge directory trees fast without scanning junk files?”
5.4 Concepts You Must Understand First
- Work queues and worker pools.
- Hierarchical ignore rules.
- Binary detection heuristics.
5.5 Questions to Guide Your Design
- How will you prevent symlink loops?
- How will you handle permission errors without stopping?
- How will you keep output deterministic for tests?
5.6 Thinking Exercise
What happens if two workers scan the same directory? How do you prevent it?
5.7 The Interview Questions They’ll Ask
- Why is .gitignore handling tricky?
- How do you prevent contention in a parallel walker?
5.8 Hints in Layers
Hint 1: Start with a single-threaded walker. Hint 2: Add a queue with multiple worker threads. Hint 3: Add ignore rule stacking.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Filesystems | “The Linux Programming Interface” | directory APIs | | Concurrency | “Programming Rust” | thread pools |
5.10 Implementation Phases
Phase 1: Single Thread (3 days)
- Implement traversal and filtering.
- Checkpoint: correct output on fixtures.
Phase 2: Parallelism (4 days)
- Add worker pool and queue.
- Checkpoint: speedup on large repo.
Phase 3: Ignore + Binary (3 days)
- Add ignore rules and binary detection.
- Checkpoint: matches
rg --fileson fixtures.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Queue | mutex queue vs work stealing | work stealing | better scaling | | Output order | immediate vs sorted | immediate default | faster |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | ignore matching | .gitignore cases | | Integration | traversal output | fixture tree | | Stress | many files | generated directory |
6.2 Critical Test Cases
- Ignore rules exclude
node_modules. - Binary files skipped by default.
- Permission errors do not abort traversal.
6.3 Test Data
fixtures/tree/
a.txt
sub/ignored.bin
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Deadlocks | traversal hangs | avoid holding locks while I/O | | Duplicate paths | repeated output | ensure tasks unique | | Ignored files included | wrong results | test ignore hierarchy |
7.2 Debugging Strategies
- Log worker activity on small trees.
- Compare output with
rg --files.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--max-depthflag. - Add
--typefilters (files only, dirs only).
8.2 Intermediate Extensions
- Add file size filters.
- Cache ignore matcher per directory.
8.3 Advanced Extensions
- Use git index for faster traversal inside repos.
- Add adaptive thread scaling based on I/O.
9. Real-World Connections
9.1 Industry Applications
- Code search tools and build systems
- Backup and sync tools
9.2 Related Open Source Projects
- ripgrep ignore crate
- fd and bfs
9.3 Interview Relevance
- Parallel traversal and work-stealing design
10. Resources
10.1 Essential Reading
- ripgrep ignore documentation
- bfs performance notes
10.2 Tools & Documentation
straceto inspect syscallshyperfinefor traversal benchmarks
10.3 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain work-stealing traversal.
- I can explain ignore rule stacking.
11.2 Implementation
- Output matches
rg --fileson fixtures. - Deterministic mode works.
11.3 Growth
- I can identify traversal bottlenecks with profiling.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Parallel walker with basic filtering
Full Completion:
- .gitignore + binary detection
Excellence (Going Above & Beyond):
- Git index traversal optimization
- Adaptive thread scaling